Data Types

Data types govern how physical data is interpreted. Their specification allows binary interoperability between different Arrow implementations, including from different programming languages and runtimes (for example it is possible to access the same data, without copying, from both Python and Java using the pyarrow.jvm bridge module).

Information about a data type in C++ can be represented in three ways:

  1. Using a arrow::DataType instance (e.g. as a function argument)

  2. Using a arrow::DataType concrete subclass (e.g. as a template parameter)

  3. Using a arrow::Type::type enum value (e.g. as the condition of a switch statement)

The first form (using a arrow::DataType instance) is the most idiomatic and flexible. Runtime-parametric types can only be fully represented with a DataType instance. For example, a arrow::TimestampType needs to be constructed at runtime with a arrow::TimeUnit::type parameter; a arrow::Decimal128Type with scale and precision parameters; a arrow::ListType with a full child type (itself a arrow::DataType instance).

The two other forms can be used where performance is critical, in order to avoid paying the price of dynamic typing and polymorphism. However, some amount of runtime switching can still be required for parametric types. It is not possible to reify all possible types at compile time, since Arrow data types allows arbitrary nesting.

Creating data types

To instantiate data types, it is recommended to call the provided factory functions:

std::shared_ptr<arrow::DataType> type;

// A 16-bit integer type
type = arrow::int16();
// A 64-bit timestamp type (with microsecond granularity)
type = arrow::timestamp(arrow::TimeUnit::MICRO);
// A list type of single-precision floating-point values
type = arrow::list(arrow::float32());