Reading and writing Parquet files

The Parquet C++ library is part of the Apache Arrow project and benefits from tight integration with Arrow C++.

The arrow::FileReader class reads data for an entire file or row group into an ::arrow::Table.

The arrow::WriteTable() function writes an entire ::arrow::Table to an output file.

The StreamReader and StreamWriter classes allow for data to be written using a C++ input/output streams approach to read/write fields column by column and row by row. This approach is offered for ease of use and type-safety. It is of course also useful when data must be streamed as files are read and written incrementally.

Please note that the performance of the StreamReader and StreamWriter classes will not be as good due to the type checking and the fact that column values are processed one at a time.

FileReader

The Parquet arrow::FileReader requires a ::arrow::io::RandomAccessFile instance representing the input file.

#include "arrow/parquet/arrow/reader.h"

{
   // ...
   arrow::Status st;
   arrow::MemoryPool* pool = default_memory_pool();
   std::shared_ptr<arrow::io::RandomAccessFile> input = ...;

   // Open Parquet file reader
   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
   st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
   if (!st.ok()) {
      // Handle error instantiating file reader...
   }

   // Read entire file as a single Arrow table
   std::shared_ptr<arrow::Table> table;
   st = arrow_reader->ReadTable(&table);
   if (!st.ok()) {
      // Handle error reading Parquet data...
   }
}

Finer-grained options are available through the arrow::FileReaderBuilder helper class.

WriteTable

The arrow::WriteTable() function writes an entire ::arrow::Table to an output file.

#include "parquet/arrow/writer.h"

{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;
   PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("test.parquet"));

   PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

StreamReader

The StreamReader allows for Parquet files to be read using standard C++ input operators which ensures type-safety.

Please note that types must match the schema exactly i.e. if the schema field is an unsigned 16-bit integer then you must supply a uint16_t type.

Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances:

  • Attempt to read field by supplying the incorrect type.

  • Attempt to read beyond end of row.

  • Attempt to read beyond end of file.

#include "arrow/io/file.h"
#include "parquet/stream_reader.h"

{
   std::shared_ptr<arrow::io::ReadableFile> infile;

   PARQUET_ASSIGN_OR_THROW(
      infile,
      arrow::io::ReadableFile::Open("test.parquet"));

   parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};

   std::string article;
   float price;
   uint32_t quantity;

   while ( !os.eof() )
   {
      os >> article >> price >> quantity >> parquet::EndRow;
      // ...
   }
}

StreamWriter

The StreamWriter allows for Parquet files to be written using standard C++ output operators. This type-safe approach also ensures that rows are written without omitting fields and allows for new row groups to be created automatically (after certain volume of data) or explicitly by using the EndRowGroup stream modifier.

Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances:

  • Attempt to write a field using an incorrect type.

  • Attempt to write too many fields in a row.

  • Attempt to skip a required field.

#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;

   PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("test.parquet"));

   parquet::WriterProperties::Builder builder;
   std::shared_ptr<parquet::schema::GroupNode> schema;

   // Set up builder with required compression type etc.
   // Define schema.
   // ...

   parquet::StreamWriter os{
      parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

   // Loop over some data structure which provides the required
   // fields to be written and write each row.
   for (const auto& a : getArticles())
   {
      os << a.name() << a.price() << a.quantity() << parquet::EndRow;
   }
}