pyarrow.dataset.HivePartitioning

class pyarrow.dataset.HivePartitioning

Bases: pyarrow._dataset.Partitioning

A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.

Multi-level, directory based partitioning scheme originating from Apache Hive with all data files stored in the leaf directories. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names.

For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15”.

Parameters

schema (Schema) – The schema that describes the partitions present in the file path.

Returns

HivePartitioning

Examples

>>> from pyarrow.dataset import HivePartitioning
>>> partitioning = HivePartitioning(
...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
>>> print(partitioning.parse("/year=2009/month=11"))
((year == 2009:int16) and (month == 11:int8))
__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(*args, **kwargs)

Initialize self.

discover

Discover a HivePartitioning.

parse

Attributes

schema

The arrow Schema attached to the partitioning.

static discover()

Discover a HivePartitioning.

max_partition_dictionary_sizeint or None, default 0

The maximum number of unique values to consider for dictionary encoding. By default no field will be inferred as dictionary encoded. If -1 is provided dictionary encoding will be used for every string field.

Returns

PartitioningFactory – To be used in the FileSystemFactoryOptions.

parse()
schema

The arrow Schema attached to the partitioning.