pyarrow.array#
- pyarrow.array(obj, type=None, mask=None, size=None, from_pandas=None, bool safe=True, MemoryPool memory_pool=None)#
Create pyarrow.Array instance from a Python object.
- Parameters:
- objsequence, iterable,
ndarray
,pandas.Series
, Arrow-compatiblearray
If both type and size are specified may be a single use iterable. If not strongly-typed, Arrow type will be inferred for resulting array. Any Arrow-compatible array that implements the Arrow PyCapsule Protocol (has an
__arrow_c_array__
or__arrow_c_device_array__
method) can be passed as well.- type
pyarrow.DataType
Explicit type to attempt to coerce to, otherwise will be inferred from the data.
- mask
array
[bool], optional Indicate which values are null (True) or not null (False).
- size
int64
, optional Size of the elements. If the input is larger than size bail at this length. For iterators, if size is larger than the input iterator this will be treated as a “max size”, but will involve an initial allocation of size followed by a resize to the actual size (so if you know the exact size specifying it correctly will give you better performance).
- from_pandasbool, default
None
Use pandas’s semantics for inferring nulls from values in ndarray-like data. If passed, the mask tasks precedence, but if a value is unmasked (not-null), but still null according to pandas semantics, then it is null. Defaults to False if not passed explicitly by user, or True if a pandas object is passed in.
- safebool, default
True
Check for overflows or other unsafe conversions.
- memory_pool
pyarrow.MemoryPool
, optional If not passed, will allocate memory from the currently-set default memory pool.
- objsequence, iterable,
- Returns:
- array
pyarrow.Array
orpyarrow.ChunkedArray
A ChunkedArray instead of an Array is returned if:
the object data overflowed binary storage.
the object’s
__arrow_array__
protocol method returned a chunked array.
- array
Notes
Timezone will be preserved in the returned array for timezone-aware data, else no timezone will be returned for naive timestamps. Internally, UTC values are stored for timezone-aware data with the timezone set in the data type.
Pandas’s DateOffsets and dateutil.relativedelta.relativedelta are by default converted as MonthDayNanoIntervalArray. relativedelta leapdays are ignored as are all absolute fields on both objects. datetime.timedelta can also be converted to MonthDayNanoIntervalArray but this requires passing MonthDayNanoIntervalType explicitly.
Converting to dictionary array will promote to a wider integer type for indices if the number of distinct values cannot be represented, even if the index type was explicitly set. This means that if there are more than 127 values the returned dictionary array’s index type will be at least pa.int16() even if pa.int8() was passed to the function. Note that an explicit index type will not be demoted even if it is wider than required.
Examples
>>> import pandas as pd >>> import pyarrow as pa >>> pa.array(pd.Series([1, 2])) <pyarrow.lib.Int64Array object at ...> [ 1, 2 ]
>>> pa.array(["a", "b", "a"], type=pa.dictionary(pa.int8(), pa.string())) <pyarrow.lib.DictionaryArray object at ...> ... -- dictionary: [ "a", "b" ] -- indices: [ 0, 1, 0 ]
>>> import numpy as np >>> pa.array(pd.Series([1, 2]), mask=np.array([0, 1], dtype=bool)) <pyarrow.lib.Int64Array object at ...> [ 1, null ]
>>> arr = pa.array(range(1024), type=pa.dictionary(pa.int8(), pa.int64())) >>> arr.type.index_type DataType(int16)