impala |

Let’s use HDFS as a database !

So, we have data coming from one of our service, which is the source of a flume agent. Now, we want to be able to query them in a scalable fashion without using hbase or any other database, to be lean.

One way is to use HDFS as a database (Flume has a HDFS sink that handle partitioning), create a Hive table on top to query its content, and because we want something performant and fast, to actually use Impala to query the data using a Apache Parquet format.

Here’s a little diagram of the stack :

Apache Oozie is used to regularly export the HDFS content to a parquet format.
We are storing our data into HDFS in an Apache Avro format (Snappy compressed) because of all its advantages and because we are already using it everywhere.

Let’s review the stack one piece at a time, starting with the Flume configuration.

Flume

Let’s say we have a avro source where our custom service is sending events :

agent1.sources.events_source.type = avro
agent1.sources.events_source.bind = 0.0.0.0
agent1.sources.events_source.port = 9876

Let’s first configure the Flume HDFS sink where we are going to export our events, the configuration is pretty long but every piece has its importance :

agent1.sinks = ... s

agent1.sinks.s.type = hdfs
agent1.sinks.s.hdfs.path = /user/flume/events/ymd=%Y-%m-%d/h=%H
agent1.sinks.s.hdfs.inUsePrefix = .
agent1.sinks.s.hdfs.fileType = DataStream
agent1.sinks.s.hdfs.filePrefix = events
agent1.sinks.s.hdfs.fileSuffix = .avro
agent1.sinks.s.hdfs.rollInterval = 300
agent1.sinks.s.hdfs.rollSize = 0
agent1.sinks.s.hdfs.rollCount = 0
agent1.sinks.s.serializer = com.company.CustomAvroSerializer$Builder
agent1.sinks.s.channel = events_channel

Let’s review this config.

Partition

Because of our volume of data, we want to partition them per year-month-day then by hour. The “ymd=” and “h=” in the path are important, it represents the “column” name of the time dimension that will be queryable later.

Note that your Flume source must have a “timestamp” in the header for Flume to know what is the time dimension. If you don’t have this info, you can simply add hdfs.useLocalTimeStamp = true to use the ingestion time, but it’s discouraged because that means you don’t have any timestamp column in your data, and you’re going to get stuck later when doing some Impala partitioning.

Roll interval

We decide to roll a new file every 5min, and not based on size nor count (they have to be explicitely set to 0 because they have another default value).

By default, Flume buffers into a .tmp file we can’t rely on, and because we want to access the fresh data quickly, 5min is a good start. This is going to generate a bunch of file (144 per day), but we don’t care because we are going to export them later into a hourly parquet format and clean up the old HDFS content.

Moreover, if Flume suddenly dies, you are going to lose maximum 5min of data, instead of the whole buffer. Stopping properly Flume flushes the buffer hopefully.

File name

The inUsePrefix to “.” is to hide the working files to Hive during a query (it ignores the aka hidden files). If you don’t, some MapReduces can fail because at first, Hive saw a file Flume was buffering into (a .tmp), then the time to execute the MR, it was not there anymore (because of some flush), and kaboom, the MR will fail :

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hadoop01:8020/user/flume/events/ymd=2016–03–17/h=15/events.1458223202146.avro.tmp

File type

By default, the filetype is SequenceFile. We don’t want that, because that makes Flume convert the output stream to a SequenceFile that Hive will not be able to read because the avro schema won’t be inside. Setting it to DataStream let the data sent unaltered.

FYI, a typical SequenceFile body :

SEQ♠!org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable ▒c(g▒s▒▒►▒▒▒|TF..

A snappy compressed avro file body :

Obj♦avro.schema▒8{"type":"record","name":"Order",...}avro.codecsnappy ▒▒T▒▒▒▒♣

Avro serializer

Our custom serializer is doing some conversion of the original event, and simply emits some avro using a DataFileWriter with the snappyCodec:

DatumWriter

Category: impala

From Apache Flume to Apache Impala using HDFS

Flume

Partition

Roll interval

File name

File type

Avro serializer

Multiple Flume ?

Performance consideration

HDFS

Hive

AvroSerDe

Create the table

ERR: Long schemas

ERR: Partition and field names conflicts

Notify Hive of the new partitions

Drop partitions

Querying

Impala

Query the Hive avro table

Query a Parquet table

Oozie

Improvements

Conclusion