pyarrow.dataset.partitioning¶

pyarrow.dataset.partitioning(schema=None, field_names=None, flavor=None)[source]¶

Specify a partitioning scheme.

The supported schemes include:

“DirectoryPartitioning”: this scheme expects one segment in the file path for each field in the specified schema (all fields are required to be present). For example given schema<year:int16, month:int8> the path “/2009/11” would be parsed to (“year”_ == 2009 and “month”_ == 11).
“HivePartitioning”: a scheme for “/$key=$value/” nested directories as found in Apache Hive. This is a multi-level, directory based partitioning scheme. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names. For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15” (but the field order does not need to match).

Parameters

schema (pyarrow.Schema, default None) – The schema that describes the partitions present in the file path. If not specified, and field_names and/or flavor are specified, the schema will be inferred from the file path (and a PartitioningFactory is returned).
field_names (list of str, default None) – A list of strings (field names). If specified, the schema’s types are inferred from the file paths (only valid for DirectoryPartitioning).
flavor (str, default None) – The default is DirectoryPartitioning. Specify flavor="hive" for a HivePartitioning.

Returns

Partitioning or PartitioningFactory

Examples

Specify the Schema for paths like “/2009/June”:

>>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())]))

or let the types be inferred by only specifying the field names:

>>> partitioning(field_names=["year", "month"])

For paths like “/2009/June”, the year will be inferred as int32 while month will be inferred as string.

Create a Hive scheme for a path like “/year=2009/month=11”:

>>> partitioning(
...     pa.schema([("year", pa.int16()), ("month", pa.int8())]),
...     flavor="hive")

A Hive scheme can also be discovered from the directory structure (and types will be inferred):

>>> partitioning(flavor="hive")