Choosing a columnar format

Denis Gabaydulin
6 min readMar 29, 2017

--

Today we have two popular open-sourced columnar storage formats orc and parquet. Both are inspired by Dremel paper. The goal of the article to shed light on the problem how to choose an optimal columnar format in a concrete case applying a knowledge of format internals.

I use spark 2.0.0. Results may vary in older releases!

Let’s look at the quite popular case. We have a wide schema which represents an event stream.

root 
|-- accountType: string (nullable = true)
|-- actionSource: string (nullable = true)
|-- ageFrom: long (nullable = true)
|-- ageTo: long (nullable = true)
|-- amount: long (nullable = true)
|-- avatarCandidateId: long (nullable = true)
|-- avatarId: long (nullable = true)
|-- balance: long (nullable = true)
|-- birthDate: string (nullable = true)
|-- blockingReasonType: string (nullable = true)
|-- bot: boolean (nullable = true)
|-- city: struct (nullable = true)
| |-- country: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
|-- count: long (nullable = true)
|-- created: string (nullable = true)
|-- deleted: string (nullable = true)
|-- eventType: string (nullable = true)
|-- finished: string (nullable = true)
|-- followingId: long (nullable = true)
|-- geo: struct (nullable = true)
| |-- currentCity: struct (nullable = true)
| | |-- country: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- currentLocation: struct (nullable = true)
| | |-- lat: double (nullable = true)
| | |-- lon: double (nullable = true)
| |-- residenceCity: struct (nullable = true)
| | |-- country: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
|-- giftCategory: long (nullable = true)
|-- giftId: long (nullable = true)
|-- gold: long (nullable = true)
|-- hasAvatar: boolean (nullable = true)
|-- hasPurchasesBefore: boolean (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- packId: long (nullable = true)
|-- paymentSystem: string (nullable = true)
|-- postId: long (nullable = true)
|-- previousBalance: long (nullable = true)
|-- price: struct (nullable = true)
| |-- amount: double (nullable = true)
| |-- currency: string (nullable = true)
|-- rebill: boolean (nullable = true)
|-- recipientId: long (nullable = true)
|-- recommendedFeedId: string (nullable = true)
|-- selfGift: boolean (nullable = true)
|-- sex: string (nullable = true)
|-- snapshotId: long (nullable = true)
|-- started: string (nullable = true)
|-- state: string (nullable = true)
|-- subscriptionType: long (nullable = true)
|-- surname: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- terms: boolean (nullable = true)
|-- tid: string (nullable = true)
|-- time: string (nullable = true)
|-- toAddress: string (nullable = true)
|-- tokenId: long (nullable = true)
|-- transportType: string (nullable = true)
|-- type: string (nullable = true)
|-- updated: string (nullable = true)
|-- userId: long (nullable = true)
|-- userRequestContext: struct (nullable = true)
| |-- advertisement: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
| |-- connectionType: string (nullable = true)
| |-- ip: string (nullable = true)
| |-- loginInfo: struct (nullable = true)
| | |-- loginType: string (nullable = true)
| | |-- sessionId: string (nullable = true)
| |-- reqId: string (nullable = true)
| |-- uiContext: string (nullable = true)
| |-- userAgent: string (nullable = true)
|-- wishId: long (nullable = true)
|-- zoneType: string (nullable = true)

The dataset #1 has 7822979 rows. Let’s try to save it in orc/parquet format w/o compression.

Dataset #1. No compression

As you can see, there’s no surprise. Both formats are pretty good. Orc is a little better. How much can you win using a good columnar format?
Well, the original size of the dataset is 3.5 G, converted to orc it reduces to 692 MB.

Now let’s see how a compression codec affects the size. Here’s the results of two compression codecs gzip/zlib and snappy.

Dataset #1. Compression codes

Wow! A convincing win of gzip codec. It’s better than snappy codec up to 35%. Note, I did a similar test on the older release of spark 1.6.2 and there was no such significant difference between gzip and snappy.

Ok, now we have a baseline, let’s try to do something more interesting. What, if I change a number of partitions in the output?

df.coalesce(4)
.write
.option("compression", "snappy")
.parquet("/data/test3/parquet.snappy")
df.coalesce(4)
.write
.option("compression", "gzip")
.parquet("/data/test3/parquet.gzip")

Sizes are changed, but not dramatically (larger up to 4%). Now, let’s try to repartition our df.

The repartition operation adds an additional shuffle step before writing the result.

df.repartition(4)
.write
.option("compression", "snappy")
.parquet("/data/test4/parquet.snappy")
df.repartition(4)
.write
.option("compression", "gzip")
.parquet("/data/test4/parquet.gzip")

The test reveals some interesting results.

Dataset #1. Shuffling

The shuffled version larger up to 20%! The explanation of this fact is pretty simple. A columnar format has many optimizations, such as RLE, which work best on sorted data*.

*The original order was specified by the timeline.

But, what about small datasets? Let’s see what’s happen with dataset #2 which is a subset of dataset #1.

root 
|-- userid: long (nullable = true)
|-- eventtype: string (nullable = true)
|-- created: string (nullable = true)

It has the same number of rows (7822979). So, let’s run our benchmarks.

Dataset #2

The results are interesting. The outsider is orc w/o any compression. Parquet is a little better than orc. And the difference between snappy and gzip reduced to 20%. The tricky version* of orc w/o compression shows why sorting is important.

*In this version the dataset was sorted by user_id column

df.orderBy("userId")
.coalesce(32)
.write
.option("compression", "none")
.orc("/data/test8/orc.no.comp")

If you compare two first partitions you will see the result of the optimization. Both partitions have an equal structure and more than 200k rows.

Optimization

As you can see, user_id column has a smaller range in the sorted version. It means that user_id values were distributed unevenly across partitions and every value is located only in a single partition.
For example, the second partition has the range:
621367090409900754–664836511522359025.
In the regular version, all values are evenly distributed and all partitions have most part of all user_id values:
-9223372036854775808 — 680134850249755185.
The optimization can be illustrated with the following example. Suppose, you have an event with as single column user_id. A sample event stream looks like:

1, 2, 3, 4, 3, 2, 2, 1, 4, 1, 2, 3

A random partitioner splits the data into three partitions:

{1, 2, 3, 4} 
{3, 2, 2, 1}
{4, 1, 2, 3}

A range partitioner splits the same data into three partitions:

{1, 1, 1} 
{2, 2, 2, 2}
{3, 3, 3, 4, 4}

After applying RLE encoding for each partition, in the second case, you’ve got a significant reduce of the space.

Conclusions are boring. There’s no silver bullet.

  • orc/parquet formats have similar compression rate w/o any compression codec
  • zlib/gzip is better than snappy (10–35%) in terms of space
  • sorting of a data may give a significant reduce of used space

--

--

Denis Gabaydulin
Denis Gabaydulin

Responses (1)