pyarrow.parquet.write_table#
- pyarrow.parquet.write_table(table, where, row_group_size=None, version='2.6', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, coerce_timestamps=None, allow_truncated_timestamps=False, data_page_size=None, flavor=None, filesystem=None, compression_level=None, use_byte_stream_split=False, column_encoding=None, data_page_version='1.0', use_compliant_nested_type=True, encryption_properties=None, write_batch_size=None, dictionary_pagesize_limit=None, store_schema=True, write_page_index=False, write_page_checksum=False, sorting_columns=None, store_decimal_as_integer=False, **kwargs)[source]#
Write a Table to Parquet format.
- Parameters:
- table
pyarrow.Table
- where
str
orpyarrow.NativeFile
- row_group_size
int
Maximum number of rows in each written row group. If None, the row group size will be the minimum of the Table size and 1024 * 1024.
- version{“1.0”, “2.4”, “2.6”}, default “2.6”
Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Files written with version=’2.4’ or ‘2.6’ may not be readable in all Parquet implementations, so version=’1.0’ is likely the choice that maximizes file compatibility. UINT32 and some logical types are only available with version ‘2.4’. Nanosecond timestamps are only available with version ‘2.6’. Other features such as compression algorithms or the new serialized data page format must be enabled separately (see ‘compression’ and ‘data_page_version’).
- use_dictionarybool or
list
, defaultTrue
Specify if we should use dictionary encoding in general or only for some columns. When encoding the column, if the dictionary size is too large, the column will fallback to
PLAIN
encoding. Specially,BOOLEAN
type doesn’t support dictionary encoding.- compression
str
ordict
, default ‘snappy’ Specify the compression codec, either on a general basis or per-column. Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’}.
- write_statisticsbool or
list
, defaultTrue
Specify if we should write statistics in general (default is True) or only for some columns.
- use_deprecated_int96_timestampsbool, default
None
Write timestamps to INT96 Parquet format. Defaults to False unless enabled by flavor argument. This take priority over the coerce_timestamps option.
- coerce_timestamps
str
, defaultNone
Cast timestamps to a particular resolution. If omitted, defaults are chosen depending on version. For
version='1.0'
andversion='2.4'
, nanoseconds are cast to microseconds (‘us’), while forversion='2.6'
(the default), they are written natively without loss of resolution. Seconds are always cast to milliseconds (‘ms’) by default, as Parquet does not have any temporal type with seconds resolution. If the casting results in loss of data, it will raise an exception unlessallow_truncated_timestamps=True
is given. Valid values: {None, ‘ms’, ‘us’}- allow_truncated_timestampsbool, default
False
Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception. Passing
allow_truncated_timestamp=True
will NOT result in the truncation exception being ignored unlesscoerce_timestamps
is not None.- data_page_size
int
, defaultNone
Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). If None, use the default data page size of 1MByte.
- flavor{‘spark’}, default
None
Sanitize schema or set other compatibility options to work with various target systems.
- filesystem
FileSystem
, defaultNone
If nothing passed, will be inferred from where if path-like, else where is already a file-like object so no filesystem is needed.
- compression_level
int
ordict
, defaultNone
Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.
- use_byte_stream_splitbool or
list
, defaultFalse
Specify if the byte_stream_split encoding should be used in general or only for some columns. If both dictionary and byte_stream_stream are enabled, then dictionary is preferred. The byte_stream_split encoding is valid for integer, floating-point and fixed-size binary data types (including decimals); it should be combined with a compression codec so as to achieve size reduction.
- column_encoding
str
ordict
, defaultNone
Specify the encoding scheme on a per column basis. Can only be used when
use_dictionary
is set to False, and cannot be used in combination withuse_byte_stream_split
. Currently supported values: {‘PLAIN’, ‘BYTE_STREAM_SPLIT’, ‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’}. Certain encodings are only compatible with certain data types. Please refer to the encodings section of Reading and writing Parquet files.- data_page_version{“1.0”, “2.0”}, default “1.0”
The serialized Parquet data page format version to write, defaults to 1.0. This does not impact the file schema logical types and Arrow to Parquet type casting behavior; for that use the “version” option.
- use_compliant_nested_typebool, default
True
Whether to write compliant Parquet nested type (lists) as defined here, defaults to
True
. Foruse_compliant_nested_type=True
, this will write into a list with 3-level structure where the middle level, namedlist
, is a repeated group with a single field namedelement
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }
For
use_compliant_nested_type=False
, this will also write into a list with 3-level structure, where the name of the single field of the middle levellist
is taken from the element name for nested columns in Arrow, which defaults toitem
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> item; } }
- encryption_properties
FileEncryptionProperties
, defaultNone
File encryption properties for Parquet Modular Encryption. If None, no encryption will be done. The encryption properties can be created using:
CryptoFactory.file_encryption_properties()
.- write_batch_size
int
, defaultNone
Number of values to write to a page at a time. If None, use the default of 1024.
write_batch_size
is complementary todata_page_size
. If pages are exceeding thedata_page_size
due to large column values, lowering the batch size can help keep page sizes closer to the intended size.- dictionary_pagesize_limit
int
, defaultNone
Specify the dictionary page size limit per row group. If None, use the default 1MB.
- store_schemabool, default
True
By default, the Arrow schema is serialized and stored in the Parquet file metadata (in the “ARROW:schema” key). When reading the file, if this key is available, it will be used to more faithfully recreate the original Arrow data. For example, for tz-aware timestamp columns it will restore the timezone (Parquet only stores the UTC values without timezone), or columns with duration type will be restored from the int64 Parquet column.
- write_page_indexbool, default
False
Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.
- write_page_checksumbool, default
False
Whether to write page checksums in general for all columns. Page checksums enable detection of data corruption, which might occur during transmission or in the storage.
- sorting_columns
Sequence
ofSortingColumn
, defaultNone
Specify the sort order of the data being written. The writer does not sort the data nor does it verify that the data is sorted. The sort order is written to the row group metadata, which can then be used by readers.
- store_decimal_as_integerbool, default
False
Allow decimals with 1 <= precision <= 18 to be stored as integers. In Parquet, DECIMAL can be stored in any of the following physical types: - int32: for 1 <= precision <= 9. - int64: for 10 <= precision <= 18. - fixed_len_byte_array: precision is limited by the array size.
Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
binary: precision is unlimited. The minimum number of bytes to store the unscaled value is used.
By default, this is DISABLED and all decimal types annotate fixed_len_byte_array. When enabled, the writer will use the following physical types to store decimals: - int32: for 1 <= precision <= 9. - int64: for 10 <= precision <= 18. - fixed_len_byte_array: for precision > 18.
As a consequence, decimal columns stored in integer types are more compact.
- **kwargsoptional
Additional options for ParquetWriter
- table
Examples
Generate an example PyArrow Table:
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]})
and write the Table into Parquet file:
>>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet')
Defining row group size for the Parquet file:
>>> pq.write_table(table, 'example.parquet', row_group_size=3)
Defining row group compression (default is Snappy):
>>> pq.write_table(table, 'example.parquet', compression='none')
Defining row group compression and encoding per-column:
>>> pq.write_table(table, 'example.parquet', ... compression={'n_legs': 'snappy', 'animal': 'gzip'}, ... use_dictionary=['n_legs', 'animal'])
Defining column encoding per-column:
>>> pq.write_table(table, 'example.parquet', ... column_encoding={'animal':'PLAIN'}, ... use_dictionary=False)