pyarrow.array#

pyarrow.array(obj, type=None, mask=None, size=None, from_pandas=None, bool safe=True, MemoryPool memory_pool=None)#

Create pyarrow.Array instance from a Python object.

Parameters:

objsequence, iterable, ndarray, pandas.Series, Arrow-compatible array: If both type and size are specified may be a single use iterable. If not strongly-typed, Arrow type will be inferred for resulting array. Any Arrow-compatible array that implements the Arrow PyCapsule Protocol (has an __arrow_c_array__ or __arrow_c_device_array__ method) can be passed as well.
typepyarrow.DataType: Explicit type to attempt to coerce to, otherwise will be inferred from the data.
maskarray[bool], optional: Indicate which values are null (True) or not null (False).
sizeint64, optional: Size of the elements. If the input is larger than size bail at this length. For iterators, if size is larger than the input iterator this will be treated as a “max size”, but will involve an initial allocation of size followed by a resize to the actual size (so if you know the exact size specifying it correctly will give you better performance).
from_pandasbool, default None: Use pandas’s semantics for inferring nulls from values in ndarray-like data. If passed, the mask tasks precedence, but if a value is unmasked (not-null), but still null according to pandas semantics, then it is null. Defaults to False if not passed explicitly by user, or True if a pandas object is passed in.
safebool, default True: Check for overflows or other unsafe conversions.
memory_poolpyarrow.MemoryPool, optional: If not passed, will allocate memory from the currently-set default memory pool.

Returns:

arraypyarrow.Array or pyarrow.ChunkedArray

A ChunkedArray instead of an Array is returned if:

the object data overflowed binary storage.
the object’s __arrow_array__ protocol method returned a chunked array.

Notes

Timezone will be preserved in the returned array for timezone-aware data, else no timezone will be returned for naive timestamps. Internally, UTC values are stored for timezone-aware data with the timezone set in the data type.

Pandas’s DateOffsets and dateutil.relativedelta.relativedelta are by default converted as MonthDayNanoIntervalArray. relativedelta leapdays are ignored as are all absolute fields on both objects. datetime.timedelta can also be converted to MonthDayNanoIntervalArray but this requires passing MonthDayNanoIntervalType explicitly.

Converting to dictionary array will promote to a wider integer type for indices if the number of distinct values cannot be represented, even if the index type was explicitly set. This means that if there are more than 127 values the returned dictionary array’s index type will be at least pa.int16() even if pa.int8() was passed to the function. Note that an explicit index type will not be demoted even if it is wider than required.

Examples

>>> import pandas as pd
>>> import pyarrow as pa
>>> pa.array(pd.Series([1, 2]))
<pyarrow.lib.Int64Array object at ...>
[
  1,
  2
]

>>> pa.array(["a", "b", "a"], type=pa.dictionary(pa.int8(), pa.string()))
<pyarrow.lib.DictionaryArray object at ...>
...
-- dictionary:
  [
    "a",
    "b"
  ]
-- indices:
  [
    0,
    1,
    0
  ]

>>> import numpy as np
>>> pa.array(pd.Series([1, 2]), mask=np.array([0, 1], dtype=bool))
<pyarrow.lib.Int64Array object at ...>
[
  1,
  null
]

>>> arr = pa.array(range(1024), type=pa.dictionary(pa.int8(), pa.int64()))
>>> arr.type.index_type
DataType(int16)