pyarrow.dataset.dataset#
- pyarrow.dataset.dataset(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]#
Open a dataset.
Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset.
A unified interface for different sources, like Parquet and Feather
Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)
Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.)
- Parameters:
- sourcepath,
list
of paths, dataset,list
of datasets, (list
of)RecordBatch
orTable
, iterable ofRecordBatch
, RecordBatchReader, or URI - Path pointing to a single file:
Open a FileSystemDataset from a single file.
- Path pointing to a directory:
The directory gets discovered recursively according to a partitioning scheme if given.
- List of file paths:
Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed.
- List of datasets:
A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed.
- (List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.
- schema
Schema
, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.
- format
FileFormat
orstr
Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, “json”, and “orc” are supported. For Feather, only version 2 files are supported.
- filesystem
FileSystem
or URIstr
, defaultNone
If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.
- partitioning
Partitioning
,PartitioningFactory
,str
,list
ofstr
The partitioning scheme specified with the
partitioning()
function. A flavor string can be used as shortcut, and with a list of field names a DirectoryPartitioning will be inferred.- partition_base_dir
str
, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.
- exclude_invalid_filesbool, optional (default
True
) If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time).
- ignore_prefixes
list
, optional Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source.
- sourcepath,
- Returns:
- dataset
Dataset
Either a FileSystemDataset or a UnionDataset depending on the source parameter.
- dataset
Examples
Creating an example Table:
>>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> pq.write_table(table, "file.parquet")
Opening a single file:
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("file.parquet", format="parquet") >>> dataset.to_table() pyarrow.Table year: int64 n_legs: int64 animal: string ---- year: [[2020,2022,2021,2022,2019,2021]] n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a single file with an explicit schema:
>>> myschema = pa.schema([ ... ('n_legs', pa.int64()), ... ('animal', pa.string())]) >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet") >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a dataset for a single directory:
>>> ds.write_dataset(table, "partitioned_dataset", format="parquet", ... partitioning=['year']) >>> dataset = ds.dataset("partitioned_dataset", format="parquet") >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[5],[2],[4,100],[2,4]] animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]
For a single directory from a S3 bucket:
>>> ds.dataset("s3://mybucket/nyc-taxi/", ... format="parquet")
Opening a dataset from a list of relatives local paths:
>>> dataset = ds.dataset([ ... "partitioned_dataset/2019/part-0.parquet", ... "partitioned_dataset/2020/part-0.parquet", ... "partitioned_dataset/2021/part-0.parquet", ... ], format='parquet') >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[5],[2],[4,100]] animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"]]
With filesystem provided:
>>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> ds.dataset(paths, filesystem='file:///directory/prefix, ... format='parquet')
Which is equivalent with:
>>> fs = SubTreeFileSystem("/directory/prefix", ... LocalFileSystem()) >>> ds.dataset(paths, filesystem=fs, format='parquet')
With a remote filesystem URI:
>>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> ds.dataset(paths, filesystem='s3://bucket/', ... format='parquet')
Similarly to the local example, the directory prefix may be included in the filesystem URI:
>>> ds.dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet')
Construction of a nested dataset:
>>> ds.dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ])