pyarrow.fs.S3FileSystem#

class pyarrow.fs.S3FileSystem(access_key=None, *, secret_key=None, session_token=None, bool anonymous=False, region=None, request_timeout=None, connect_timeout=None, scheme=None, endpoint_override=None, bool background_writes=True, default_metadata=None, role_arn=None, session_name=None, external_id=None, load_frequency=900, proxy_options=None, allow_bucket_creation=False, allow_bucket_deletion=False, check_directory_existence_before_creation=False, retry_strategy: S3RetryStrategy = AwsStandardS3RetryStrategy(max_attempts=3), force_virtual_addressing=False)#

Bases: FileSystem

S3-backed FileSystem implementation

AWS access_key and secret_key can be provided explicitly.

If role_arn is provided instead of access_key and secret_key, temporary credentials will be fetched by issuing a request to STS to assume the specified role.

If neither access_key nor secret_key are provided, and role_arn is also not provided, then attempts to establish the credentials automatically. S3FileSystem will try the following methods, in order:

  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables

  • configuration files such as ~/.aws/credentials and ~/.aws/config

  • for nodes on Amazon EC2, the EC2 Instance Metadata Service

Note: S3 buckets are special and the operations available on them may be limited or more expensive than desired.

When S3FileSystem creates new buckets (assuming allow_bucket_creation is True), it does not pass any non-default settings. In AWS S3, the bucket and all objects will be not publicly visible, and will have no bucket policies and no resource tags. To have more control over how buckets are created, use a different API to create them.

Parameters:
access_keystr, default None

AWS Access Key ID. Pass None to use the standard AWS environment variables and/or configuration file.

secret_keystr, default None

AWS Secret Access key. Pass None to use the standard AWS environment variables and/or configuration file.

session_tokenstr, default None

AWS Session Token. An optional session token, required if access_key and secret_key are temporary credentials from STS.

anonymousbool, default False

Whether to connect anonymously if access_key and secret_key are None. If true, will not attempt to look up credentials using standard AWS configuration methods.

role_arnstr, default None

AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role.

session_namestr, default None

An optional identifier for the assumed role session.

external_idstr, default None

An optional unique identifier that might be required when you assume a role in another account.

load_frequencyint, default 900

The frequency (in seconds) with which temporary credentials from an assumed role session will be refreshed.

regionstr, default None

AWS region to connect to. If not set, the AWS SDK will attempt to determine the region using heuristics such as environment variables, configuration profile, EC2 metadata, or default to ‘us-east-1’ when SDK version <1.8. One can also use pyarrow.fs.resolve_s3_region() to automatically resolve the region from a bucket name.

request_timeoutdouble, default None

Socket read timeouts on Windows and macOS, in seconds. If omitted, the AWS SDK default value is used (typically 3 seconds). This option is ignored on non-Windows, non-macOS systems.

connect_timeoutdouble, default None

Socket connection timeout, in seconds. If omitted, the AWS SDK default value is used (typically 1 second).

schemestr, default ‘https’

S3 connection transport scheme.

endpoint_overridestr, default None

Override region with a connect string such as “localhost:9000”

background_writesbool, default True

Whether file writes will be issued in the background, without blocking.

default_metadatamapping or pyarrow.KeyValueMetadata, default None

Default metadata for open_output_stream. This will be ignored if non-empty metadata is passed to open_output_stream.

proxy_optionsdict or str, default None

If a proxy is used, provide the options here. Supported options are: ‘scheme’ (str: ‘http’ or ‘https’; required), ‘host’ (str; required), ‘port’ (int; required), ‘username’ (str; optional), ‘password’ (str; optional). A proxy URI (str) can also be provided, in which case these options will be derived from the provided URI. The following are equivalent:

S3FileSystem(proxy_options='http://username:password@localhost:8020')
S3FileSystem(proxy_options={'scheme': 'http', 'host': 'localhost',
                            'port': 8020, 'username': 'username',
                            'password': 'password'})
allow_bucket_creationbool, default False

Whether to allow directory creation at the bucket-level. This option may also be passed in a URI query parameter.

allow_bucket_deletionbool, default False

Whether to allow directory deletion at the bucket-level. This option may also be passed in a URI query parameter.

check_directory_existence_before_creationbool, default false

Whether to check the directory existence before creating it. If false, when creating a directory the code will not check if it already exists or not. It’s an optimization to try directory creation and catch the error, rather than issue two dependent I/O calls. If true, when creating a directory the code will only create the directory when necessary at the cost of extra I/O calls. This can be used for key/value cloud storage which has a hard rate limit to number of object mutation operations or scenerios such as the directories already exist and you do not have creation access.

retry_strategyS3RetryStrategy, default AwsStandardS3RetryStrategy(max_attempts=3)

The retry strategy to use with S3; fail after max_attempts. Available strategies are AwsStandardS3RetryStrategy, AwsDefaultS3RetryStrategy.

force_virtual_addressingbool, default False

Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access.

Examples

>>> from pyarrow import fs
>>> s3 = fs.S3FileSystem(region='us-west-2')
>>> s3.get_file_info(fs.FileSelector(
...    'power-analysis-ready-datastore/power_901_constants.zarr/FROCEAN', recursive=True
... ))
[<FileInfo for 'power-analysis-ready-datastore/power_901_constants.zarr/FROCEAN/.zarray...

For usage of the methods see examples for LocalFileSystem().

__init__(*args, **kwargs)#

Methods

__init__(*args, **kwargs)

copy_file(self, src, dest)

Copy a file.

create_dir(self, path, *, bool recursive=True)

Create a directory and subdirectories.

delete_dir(self, path)

Delete a directory and its contents, recursively.

delete_dir_contents(self, path, *, ...)

Delete a directory's contents, recursively.

delete_file(self, path)

Delete a file.

equals(self, FileSystem other)

Parameters:

from_uri(uri)

Create a new FileSystem from URI or Path.

get_file_info(self, paths_or_selector)

Get info for the given files.

move(self, src, dest)

Move / rename a file or directory.

normalize_path(self, path)

Normalize filesystem path.

open_append_stream(self, path[, ...])

Open an output stream for appending.

open_input_file(self, path)

Open an input file for random access reading.

open_input_stream(self, path[, compression, ...])

Open an input stream for sequential reading.

open_output_stream(self, path[, ...])

Open an output stream for sequential writing.

Attributes

region

The AWS region this filesystem connects to.

type_name

The filesystem's type name.

copy_file(self, src, dest)#

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Parameters:
srcstr

The path of the file to be copied from.

deststr

The destination path where the file is copied to.

Examples

>>> local.copy_file(path,
...                 local_path + '/pyarrow-fs-example_copy.dat')

Inspect the file info:

>>> local.get_file_info(local_path + '/pyarrow-fs-example_copy.dat')
<FileInfo for '/.../pyarrow-fs-example_copy.dat': type=FileType.File, size=4>
>>> local.get_file_info(path)
<FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
create_dir(self, path, *, bool recursive=True)#

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Parameters:
pathstr

The path of the new directory.

recursivebool, default True

Create nested directories as well.

delete_dir(self, path)#

Delete a directory and its contents, recursively.

Parameters:
pathstr

The path of the directory to be deleted.

delete_dir_contents(self, path, *, bool accept_root_dir=False, bool missing_dir_ok=False)#

Delete a directory’s contents, recursively.

Like delete_dir, but doesn’t delete the directory itself.

Parameters:
pathstr

The path of the directory to be deleted.

accept_root_dirbool, default False

Allow deleting the root directory’s contents (if path is empty or “/”)

missing_dir_okbool, default False

If False then an error is raised if path does not exist

delete_file(self, path)#

Delete a file.

Parameters:
pathstr

The path of the file to be deleted.

equals(self, FileSystem other)#
Parameters:
otherpyarrow.fs.FileSystem
Returns:
bool
static from_uri(uri)#

Create a new FileSystem from URI or Path.

Recognized URI schemes are “file”, “mock”, “s3fs”, “gs”, “gcs”, “hdfs” and “viewfs”. In addition, the argument can be a pathlib.Path object, or a string describing an absolute local path.

Parameters:
uristr

URI-based path, for example: file:///some/local/path.

Returns:
tuple of (FileSystem, str path)

With (filesystem, path) tuple where path is the abstract path inside the FileSystem instance.

Examples

Create a new FileSystem subclass from a URI:

>>> uri = 'file:///{}/pyarrow-fs-example.dat'.format(local_path)
>>> local_new, path_new = fs.FileSystem.from_uri(uri)
>>> local_new
<pyarrow._fs.LocalFileSystem object at ...
>>> path_new
'/.../pyarrow-fs-example.dat'

Or from a s3 bucket:

>>> fs.FileSystem.from_uri("s3://usgs-landsat/collection02/")
(<pyarrow._s3fs.S3FileSystem object at ...>, 'usgs-landsat/collection02')
get_file_info(self, paths_or_selector)#

Get info for the given files.

Any symlink is automatically dereferenced, recursively. A non-existing or unreachable file returns a FileStat object and has a FileType of value NotFound. An exception indicates a truly exceptional condition (low-level I/O error, etc.).

Parameters:
paths_or_selectorFileSelector, path-like or list of path-likes

Either a selector object, a path-like object or a list of path-like objects. The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, use allow_not_found.

Returns:
FileInfo or list of FileInfo

Single FileInfo object is returned for a single path, otherwise a list of FileInfo objects is returned.

Examples

>>> local
<pyarrow._fs.LocalFileSystem object at ...>
>>> local.get_file_info("/{}/pyarrow-fs-example.dat".format(local_path))
<FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
move(self, src, dest)#

Move / rename a file or directory.

If the destination exists: - if it is a non-empty directory, an error is returned - otherwise, if it has the same type as the source, it is replaced - otherwise, behavior is unspecified (implementation-dependent).

Parameters:
srcstr

The path of the file or the directory to be moved.

deststr

The destination path where the file or directory is moved to.

Examples

Create a new folder with a file:

>>> local.create_dir('/tmp/other_dir')
>>> local.copy_file(path,'/tmp/move_example.dat')

Move the file:

>>> local.move('/tmp/move_example.dat',
...            '/tmp/other_dir/move_example_2.dat')

Inspect the file info:

>>> local.get_file_info('/tmp/other_dir/move_example_2.dat')
<FileInfo for '/tmp/other_dir/move_example_2.dat': type=FileType.File, size=4>
>>> local.get_file_info('/tmp/move_example.dat')
<FileInfo for '/tmp/move_example.dat': type=FileType.NotFound>

Delete the folder: >>> local.delete_dir(‘/tmp/other_dir’)

normalize_path(self, path)#

Normalize filesystem path.

Parameters:
pathstr

The path to normalize

Returns:
normalized_pathstr

The normalized path

open_append_stream(self, path, compression='detect', buffer_size=None, metadata=None)#

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note

Some filesystem implementations do not support efficient appending to an existing file, in which case this method will raise NotImplementedError. Consider writing to multiple files (using e.g. the dataset layer) instead.

Parameters:
pathstr

The source to open for writing.

compressionstr optional, default ‘detect’

The compression algorithm to use for on-the-fly compression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).

buffer_sizeint optional, default None

If None or 0, no buffering will happen. Otherwise the size of the temporary write buffer.

metadatadict optional, default None

If not None, a mapping of string keys to string values. Some filesystems support storing metadata along the file (such as “Content-Type”). Unsupported metadata keys will be ignored.

Returns:
streamNativeFile

Examples

Append new data to a FileSystem subclass with nonempty file:

>>> with local.open_append_stream(path) as f:
...     f.write(b'+newly added')
12

Print out the content fo the file:

>>> with local.open_input_file(path) as f:
...     print(f.readall())
b'data+newly added'
open_input_file(self, path)#

Open an input file for random access reading.

Parameters:
pathstr

The source to open for reading.

Returns:
streamNativeFile

Examples

Print the data from the file with open_input_file():

>>> with local.open_input_file(path) as f:
...     print(f.readall())
b'data'
open_input_stream(self, path, compression='detect', buffer_size=None)#

Open an input stream for sequential reading.

Parameters:
pathstr

The source to open for reading.

compressionstr optional, default ‘detect’

The compression algorithm to use for on-the-fly decompression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).

buffer_sizeint optional, default None

If None or 0, no buffering will happen. Otherwise the size of the temporary read buffer.

Returns:
streamNativeFile

Examples

Print the data from the file with open_input_stream():

>>> with local.open_input_stream(path) as f:
...     print(f.readall())
b'data'
open_output_stream(self, path, compression='detect', buffer_size=None, metadata=None)#

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Parameters:
pathstr

The source to open for writing.

compressionstr optional, default ‘detect’

The compression algorithm to use for on-the-fly compression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).

buffer_sizeint optional, default None

If None or 0, no buffering will happen. Otherwise the size of the temporary write buffer.

metadatadict optional, default None

If not None, a mapping of string keys to string values. Some filesystems support storing metadata along the file (such as “Content-Type”). Unsupported metadata keys will be ignored.

Returns:
streamNativeFile

Examples

>>> local = fs.LocalFileSystem()
>>> with local.open_output_stream(path) as stream:
...     stream.write(b'data')
4
region#

The AWS region this filesystem connects to.

type_name#

The filesystem’s type name.