pyarrow.ChunkedArray

class pyarrow.ChunkedArray

Bases: pyarrow.lib._PandasConvertible

An array-like composed from a (possibly empty) collection of pyarrow.Arrays

Warning

Do not call this class’s constructor directly.

__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(*args, **kwargs)

Initialize self.

cast(self, target_type[, safe])

Cast array values to another data type

chunk(self, i)

Select a chunk by its index

dictionary_encode(self)

Compute dictionary-encoded representation of array

equals(self, ChunkedArray other)

Return whether the contents of two chunked arrays are equal.

fill_null(self, fill_value)

See pyarrow.compute.fill_null docstring for usage.

filter(self, mask[, null_selection_behavior])

Select values from a chunked array.

flatten(self, MemoryPool memory_pool=None)

Flatten this ChunkedArray.

format(self, **kwargs)

is_null(self)

Return BooleanArray indicating the null values.

is_valid(self)

Return BooleanArray indicating the non-null values.

iterchunks(self)

length(self)

slice(self[, offset, length])

Compute zero-copy slice of this ChunkedArray

take(self, indices)

Select values from a chunked array.

to_numpy(self)

Return a NumPy copy of this array (experimental).

to_pandas(self[, memory_pool, categories, …])

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

to_pylist(self)

Convert to a list of native Python objects.

to_string(self, int indent=0, int window=10)

Render a “pretty-printed” string representation of the ChunkedArray

unique(self)

Compute distinct elements in array

validate(self, *[, full])

Perform validation checks.

value_counts(self)

Compute counts of unique elements in array.

Attributes

chunks

data

nbytes

Total number of bytes consumed by the elements of the chunked array.

null_count

Number of null entries

num_chunks

Number of underlying chunks

type

cast(self, target_type, safe=True)

Cast array values to another data type

See pyarrow.compute.cast for usage

chunk(self, i)

Select a chunk by its index

Parameters

i (int) –

Returns

pyarrow.Array

chunks
data
dictionary_encode(self)

Compute dictionary-encoded representation of array

Returns

pyarrow.ChunkedArray – Same chunking as the input, all chunks share a common dictionary.

equals(self, ChunkedArray other)

Return whether the contents of two chunked arrays are equal.

Parameters

other (pyarrow.ChunkedArray) – Chunked array to compare against.

Returns

are_equal (bool)

fill_null(self, fill_value)

See pyarrow.compute.fill_null docstring for usage.

filter(self, mask, null_selection_behavior='drop')

Select values from a chunked array. See pyarrow.compute.filter for full usage.

flatten(self, MemoryPool memory_pool=None)

Flatten this ChunkedArray. If it has a struct type, the column is flattened into one array per struct field.

Parameters

memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool

Returns

result (List[ChunkedArray])

format(self, **kwargs)
is_null(self)

Return BooleanArray indicating the null values.

is_valid(self)

Return BooleanArray indicating the non-null values.

iterchunks(self)
length(self)
nbytes

Total number of bytes consumed by the elements of the chunked array.

null_count

Number of null entries

Returns

int

num_chunks

Number of underlying chunks

Returns

int

slice(self, offset=0, length=None)

Compute zero-copy slice of this ChunkedArray

Parameters
  • offset (int, default 0) – Offset from start of array to slice

  • length (int, default None) – Length of slice (default is until end of batch starting from offset)

Returns

sliced (ChunkedArray)

take(self, indices)

Select values from a chunked array. See pyarrow.compute.take for full usage.

to_numpy(self)

Return a NumPy copy of this array (experimental).

Returns

array (numpy.ndarray)

to_pandas(self, memory_pool=None, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool timestamp_as_object=False, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False, bool safe=True, bool split_blocks=False, bool self_destruct=False, types_mapper=None)

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

Parameters
  • memory_pool (MemoryPool, default None) – Arrow MemoryPool to use for allocations. Uses the default memory pool is not passed.

  • strings_to_categorical (bool, default False) – Encode string (UTF8) and binary types to pandas.Categorical.

  • categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures.

  • zero_copy_only (bool, default False) – Raise an ArrowException if this function call would require copying the underlying data.

  • integer_object_nulls (bool, default False) – Cast integers with nulls to objects

  • date_as_object (bool, default True) – Cast dates to objects. If False, convert to datetime64[ns] dtype.

  • timestamp_as_object (bool, default False) – Cast non-nanosecond timestamps (np.datetime64) to objects. This is useful if you have timestamps that don’t fit in the normal date range of nanosecond timestamps (1678 CE-2262 CE). If False, all timestamps are converted to datetime64[ns] dtype.

  • use_threads (bool, default True) – Whether to parallelize the conversion using multiple threads.

  • deduplicate_objects (bool, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower.

  • ignore_metadata (bool, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present

  • safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.

  • split_blocks (bool, default False) – If True, generate one internal “block” for each column when creating a pandas.DataFrame from a RecordBatch or Table. While this can temporarily reduce memory note that various pandas operations can trigger “consolidation” which may balloon memory use.

  • self_destruct (bool, default False) – EXPERIMENTAL: If True, attempt to deallocate the originating Arrow memory while converting the Arrow object to pandas. If you use the object after calling to_pandas with this option it will crash your program.

  • types_mapper (function, default None) – A function mapping a pyarrow DataType to a pandas ExtensionDtype. This can be used to override the default pandas type for conversion of built-in pyarrow types or in absence of pandas_metadata in the Table schema. The function receives a pyarrow DataType and is expected to return a pandas ExtensionDtype or None if the default conversion should be used for that type. If you have a dictionary mapping, you can pass dict.get as function.

Returns

pandas.Series or pandas.DataFrame depending on type of object

to_pylist(self)

Convert to a list of native Python objects.

to_string(self, int indent=0, int window=10)

Render a “pretty-printed” string representation of the ChunkedArray

type
unique(self)

Compute distinct elements in array

Returns

pyarrow.Array

validate(self, *, full=False)

Perform validation checks. An exception is raised if validation fails.

By default only cheap validation checks are run. Pass full=True for thorough validation checks (potentially O(n)).

Parameters

full (bool, default False) – If True, run expensive checks, otherwise cheap checks only.

Raises

ArrowInvalid

value_counts(self)

Compute counts of unique elements in array.

Returns

An array of <input type “Values”, int64_t “Counts”> structs