This article describes the various data and metadata object types supplied by arrow, and documents how these objects are structured.
Arrow metadata classes
The arrow package defines the following classes for representing metadata:
- A
Schema
is a list ofField
objects used to describe the structure of a tabular data object; where - A
Field
specifies a character string name and aDataType
; and - A
DataType
is an attribute controlling how values are represented
Consider this:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
tb <- arrow_table(df)
tb$schema
## Schema
## x: int32
## y: string
##
## See $metadata for additional Schema metadata
The schema that has been automatically inferred could also be manually created:
## Schema
## x: int32
## y: string
The schema()
function allows the following shorthand to
define fields:
## Schema
## x: int32
## y: string
Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types:
arrow_table(df, schema = schema(x = int64(), y = utf8()))
## Table
## 3 rows x 2 columns
## $x <int64>
## $y <string>
##
## See $metadata for additional Schema metadata
arrow_table(df, schema = schema(x = float64(), y = utf8()))
## Table
## 3 rows x 2 columns
## $x <double>
## $y <string>
##
## See $metadata for additional Schema metadata
R object attributes
Arrow supports custom key-value metadata attached to Schemas. When we
convert a data.frame
to an Arrow Table or RecordBatch, the
package stores any attributes()
attached to the columns of
the data.frame
in the Arrow object Schema. Attributes added
to objects in this fashion are stored under the r
key, as
shown below:
# data frame with custom metadata
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
attr(df, "df_meta") <- "custom data frame metadata"
attr(df$y, "col_meta") <- "custom column metadata"
# when converted to a Table, the metadata is preserved
tb <- arrow_table(df)
tb$metadata
## $r
## $r$attributes
## $r$attributes$df_meta
## [1] "custom data frame metadata"
##
##
## $r$columns
## $r$columns$x
## NULL
##
## $r$columns$y
## $r$columns$y$attributes
## $r$columns$y$attributes$col_meta
## [1] "custom column metadata"
##
##
## $r$columns$y$columns
## NULL
It is also possible to assign additional string metadata under any other key you wish, using a command like this:
tb$metadata$new_key <- "new value"
Metadata attached to a Schema is preserved when writing the Table to
Arrow/Feather or Parquet formats. When reading those files into R, or
when calling as.data.frame()
on a Table or RecordBatch, the
column attributes are restored to the columns of the resulting
data.frame
. This means that custom data types, including
haven::labelled
, vctrs
annotations, and
others, are preserved when doing a round-trip through Arrow.
Note that the attributes stored in $metadata$r
are only
understood by R. If you write a data.frame
with
haven
columns to a Feather file and read that in Pandas,
the haven
metadata won’t be recognized there. Similarly,
Pandas writes its own custom metadata, which the R package does not
consume. You are free, however, to define custom metadata conventions
for your application and assign any (string) values you want to other
metadata keys.
Further reading
- To learn more about arrow metadata, see the documentation for
schema()
. - To learn more about data types, see the data types article.