pyarrow.dataset.HivePartitioning¶

class pyarrow.dataset.HivePartitioning¶

Bases: pyarrow._dataset.Partitioning

A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.

Multi-level, directory based partitioning scheme originating from Apache Hive with all data files stored in the leaf directories. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names.

For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15”.

Parameters: schema (Schema) – The schema that describes the partitions present in the file path.
Returns: HivePartitioning

Examples

>>> from pyarrow.dataset import HivePartitioning
>>> partitioning = HivePartitioning(
...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
>>> print(partitioning.parse("/year=2009/month=11"))
((year == 2009:int16) and (month == 11:int8))

__init__(*args, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(args, *kwargs)	Initialize self.
`discover`	Discover a HivePartitioning.
`parse`

Attributes

schema

The arrow Schema attached to the partitioning.

static discover()¶

Discover a HivePartitioning.

max_partition_dictionary_sizeint or None, default 0: The maximum number of unique values to consider for dictionary encoding. By default no field will be inferred as dictionary encoded. If -1 is provided dictionary encoding will be used for every string field.

Returns: PartitioningFactory – To be used in the FileSystemFactoryOptions.

parse()¶

schema¶: The arrow Schema attached to the partitioning.