pyarrow.dataset.HivePartitioning#

class pyarrow.dataset.HivePartitioning(Schema schema, dictionaries=None, null_fallback=u'__HIVE_DEFAULT_PARTITION__', segment_encoding=u'uri')#

Bases: KeyValuePartitioning

A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.

Multi-level, directory based partitioning scheme originating from Apache Hive with all data files stored in the leaf directories. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names.

For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15”.

Parameters:
schemaSchema

The schema that describes the partitions present in the file path.

dictionariesdict[str, Array]

If the type of any field of schema is a dictionary type, the corresponding entry of dictionaries must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing.

null_fallbackstr, default “__HIVE_DEFAULT_PARTITION__”

If any field is None then this fallback will be used as a label

segment_encodingstr, default “uri”

After splitting paths into segments, decode the segments. Valid values are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:
HivePartitioning

Examples

>>> from pyarrow.dataset import HivePartitioning
>>> partitioning = HivePartitioning(
...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
>>> print(partitioning.parse("/year=2009/month=11/"))
((year == 2009) and (month == 11))
__init__(*args, **kwargs)#

Methods

__init__(*args, **kwargs)

discover([infer_dictionary, ...])

Discover a HivePartitioning.

format(self, expr)

Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme

parse(self, path)

Parse a path into a partition expression.

Attributes

dictionaries

The unique values for each partition field, if available.

schema

The arrow Schema attached to the partitioning.

dictionaries#

The unique values for each partition field, if available.

Those values are only available if the Partitioning object was created through dataset discovery from a PartitioningFactory, or if the dictionaries were manually specified in the constructor. If no dictionary field is available, this returns an empty list.

static discover(infer_dictionary=False, max_partition_dictionary_size=0, null_fallback='__HIVE_DEFAULT_PARTITION__', schema=None, segment_encoding='uri')#

Discover a HivePartitioning.

Parameters:
infer_dictionarybool, default False

When inferring a schema for partition fields, yield dictionary encoded types instead of plain. This can be more efficient when materializing virtual columns, and Expressions parsed by the finished Partitioning will include dictionaries of all unique inspected values for each field.

max_partition_dictionary_sizeint, default 0

Synonymous with infer_dictionary for backwards compatibility with 1.0: setting this to -1 or None is equivalent to passing infer_dictionary=True.

null_fallbackstr, default “__HIVE_DEFAULT_PARTITION__”

When inferring a schema for partition fields this value will be replaced by null. The default is set to __HIVE_DEFAULT_PARTITION__ for compatibility with Spark

schemaSchema, default None

Use this schema instead of inferring a schema from partition values. Partition values will be validated against this schema before accumulation into the Partitioning’s dictionary.

segment_encodingstr, default “uri”

After splitting paths into segments, decode the segments. Valid values are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:
PartitioningFactory

To be used in the FileSystemFactoryOptions.

format(self, expr)#

Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme

Parameters:
exprpyarrow.dataset.Expression
Returns:
tuple[str, str]

Examples

Specify the Schema for paths like “/2009/June”:

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.compute as pc
>>> part = ds.partitioning(pa.schema([("year", pa.int16()),
...                                   ("month", pa.string())]))
>>> part.format(
...     (pc.field("year") == 1862) & (pc.field("month") == "Jan")
... )
('1862/Jan', '')
parse(self, path)#

Parse a path into a partition expression.

Parameters:
pathstr
Returns:
pyarrow.dataset.Expression
schema#

The arrow Schema attached to the partitioning.