Why it’s important to log using slf4j ?

September 7th, 2016 | code quality, hadoop, java, logging, scala |

You are a Java or Scala programmer. You are logging stuff with different levels of severity. And you probably already used slf4j even without noticing.

This post is global overview of its ecosystem, why it exists and how does it work. It’s not because you’re using something everyday that you know the details right?

Why does slf4j even exist?

Why do we need something complicated like a logging framework to do something simple as put a message on stdout? Because not everybody wants to use only stdout, and because of dependencies that have their own logging logic too.

slf4j needs love

slf4j is an API that exposes logging methods (logger.info, logger.error and so on). It’s just a facade, an abstraction, an interface. By itself, it can’t log anything. It needs an implementation, a binding, something that truly logs the message somewhere. slf4j is just the entry point, it needs an exit.

slf4j breathes in logs

But it can also serve as a exit for other logging systems. This is thanks to the logging adapters/bridges, that redirect others logging frameworks to slf4j. Hence, you can make all your application logs to go through the same pipe even if the origin logging system is different.

slf4j is magic

The magic in that ? You can do all that and update the implementation without altering the existing code.

 

We are going to see several logging implementations slf4j can be bound to.

I’m going to use Scala code because it’s more concise, but that’s exactly the same in Java.

Simple logging using JUL

JUL stands for java.util.logging. This is a package that exists since the JDK1.4 (JSR 47). It’s quite simple to use and does the job:

val l = java.util.logging.Logger.getLogger("My JUL")
l.info("coucou")

Output:

Aug 18, 2016 11:41:00 PM App$ delayedEndpoint$App$1
INFO: coucou

App is my class, delayedEndpoint is the method.

It’s configurable through its API:

// we create a logger that accept ALL levels
val l = java.util.logging.Logger.getLogger("My JUL")
l.setLevel(Level.ALL)
// we output ALL the logs to the console
val h = new ConsoleHandler
h.setLevel(Level.ALL)

// and to a file but only greather or equal to WARNING
val f = new FileHandler("warn.log", true)
f.setLevel(Level.WARNING)
f.setFormatter(new SimpleFormatter)

l.addHandler(h)
l.addHandler(f)

// log stuff
l.entering(classOf[App].toString, "myMethod")
l.info("hello there")
l.severe("badaboom")
l.exiting(classOf[App].toString, "myMethod")

That can output something like :

sept. 07, 2016 11:16:53 PM interface scala.App myMethod
FINER: ENTRY
sept. 07, 2016 11:16:53 PM com.App$ myMethod
INFO: hello there
sept. 07, 2016 11:16:53 PM com.App$ myMethod
INFO: hello there
sept. 07, 2016 11:16:53 PM com.App$ myMethod
SEVERE: badaboom
sept. 07, 2016 11:16:53 PM com.App$ myMethod
SEVERE: badaboom
sept. 07, 2016 11:16:53 PM interface scala.App myMethod
FINER: RETURN

The default format is horrible but we can see our logs. You’ll notice we have the INFO and SEVERE twice but not the FINER. It’s because, by default, there is already a console handler logging all INFO minimum.

It’s also configurable through a properties file often named “logging.properties”.

For instance, on OSX, you can find the JVM global JUL configuration here (that contains the default console handler we just talked about):

/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/logging.properties

You can use a file of yours by specifying its path in the system properties:

-Djava.util.logging.config.file=src/main/resources/logging.properties

Some values inside must be references (FQCN) that will be load dynamically, otherwise it’s simple properties (think beans).

.level = INFO
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level = INFO
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%4$s: %5$s [%1$tc]%n

We can get a reference to the global logger to change it’s minimum level:

java.util.logging.Logger.getGlobal.setLevel(Level.ALL)

The output is better:

FINER: ENTRY [Wed Sep 07 23:32:48 CEST 2016]
INFO: hello there [Wed Sep 07 23:32:48 CEST 2016]
SEVERE: badaboom [Wed Sep 07 23:32:48 CEST 2016]
FINER: RETURN [Wed Sep 07 23:32:48 CEST 2016]

Be careful, specifying a configuration file is not used as an override of the default! If you forget something (especially handlers=), you might not see any logging.

Note that we used the handler java.util.logging.ConsoleHandler but there is also available a FileHandler (if unconfigured, it logs into $HOME/java0.log).

LogManagers

All the Loggers created in the application are managed by a LogManager.

By default, there is a default instance created on startup. It’s possible to give another one, by specifying the property java.util.logging.manager.

It’s often used along with log4j that implements a custom LogManager (available in the package org.apache.logging.log4j:log4j-jul):

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

This way, any manager can have a hand on any Logger created in the application.

It can change their behavior and where do they read their configuration for instance. This is what we call a Logging Adapter or a bridge: you can log using JUL in the code and use log4j features to manipulate and save the logs. We’ll go into more details later in this post.

A smarter logging with slf4j-api

Let’s go into the main subject: slf4j.

The API

First, we need to add a dependency to its API:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("hello")

We are getting some logs, but not what we expect:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

slf4j is using a org.slf4j.helpers.NOPLogger to log, but unfortunately, as the name says, all methods are empty shells:

// org.slf4j.helpers.NOPLogger.java
final public void info(String msg, Throwable t) {
    // NOP
}

The application still works, but without logs. slf4j tries to find a class “org.slf4j.impl.StaticLoggerBinder” available in the classpath. If it does found one, it fallbacks to the NOPLogger.

A simple slf4j binding

Fortunately, there is a simple implementation of slf4j :

libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.21"

Now it can find a org.slf4j.impl.StaticLoggerBinder to create a Logger (a SimpleLogger in this case).

By default, this logger publishes messages to System.err, but it can actually write to System.out or any file.

val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("message from {}", "slf4j!")

Output:

[main] INFO My App — message from slf4j!

The style and destination can be configured using System variables or via a properties file.

-Dorg.slf4j.simpleLogger.showDateTime=true
-Dorg.slf4j.simpleLogger.dateTimeFormat="yyyy-MM-dd HH:mm:ss"
-Dorg.slf4j.simpleLogger.levelInBrackets=true
-Dorg.slf4j.simpleLogger.logFile=simpleLogger.log

Here, we say we want to log into a file “simpleLogger.log”.

For the sake of clarity and organization, it’s preferable to put those props in a dedicated file such as src/main/resources/simplelogger.properties:

org.slf4j.simpleLogger.showDateTime=true
org.slf4j.simpleLogger.dateTimeFormat="yyyy-MM-dd HH:mm:ss"
org.slf4j.simpleLogger.levelInBrackets=true
org.slf4j.simpleLogger.logFile=simpleLogger.log

This was our first sl4j logging implementation. But we already saw another one: JUL !

slf4j to JUL

slf4j can redirect its logs to JUL that provides the “writing” piece as we already saw.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"

The name “slf4j-jdk14″ is because JUL package appeared in the JDK1.4 as we said. Strange name to pick but well.

Output:

INFO: message from slf4j! [Thu Aug 18 23:45:15 CEST 2016]

The code is the same as previously, we just changed the implementation. Notice the output is different than the SimpleLogger’s.

This logger is actually an instance of JDK14LoggerAdapter. It’s using the style we defined at the beginning, in logging.properties, used by JUL, remember ?.

Note that you don’t have the full control on the Logger via the API as we had when using directly java.util.logging.Logger which exposes more methods. We just have access to the slf4j’s ones. This is why the configuration files comes in handy.

Multiple implementations

If we have multiple implementations available, slf4j will have to pick between them, and it will leave you a small warning about that.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.21"

Output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [.../slf4j-simple/jars/slf4j-simple-1.7.21.jar!...]
SLF4J: Found binding in [.../org.slf4j/slf4j-jdk14/jars/slf4j-jdk14–1.7.21.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
"2016-08-18 23:53:54" [main] [INFO] My App - message from slf4j!

As we said org.slf4j.impl.StaticLoggerBinder is the class slf4j-api is looking for in the classpath to get an implementation. This is the class that must exist in a slf4j implementation jar.

This message is just a warning, the logging will work. But slf4j will simply pick one available logging implementation and deal with it. But it’s a bad smell that should be fixed, because maybe it won’t pick the one you want.

It often happens when pom.xml or build.sbt imports dependencies that themselves depends on one of the slf4j implementation.

They have to be excluded and your own program should import a slf4j implementation itself. If you don’t, you could run in a no-logging issue.

A real case causing logs loss

For a real case, let’s import the hadoop client lib:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"

If we restart our program, it’s getting more verbose and we’re getting a surprise:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [.../org.slf4j/slf4j-log4j12/jars/slf4j-log4j12–1.7.5.jar!...]
SLF4J: Found binding in [.../org.slf4j/slf4j-jdk14/jars/slf4j-jdk14–1.7.21.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (My App).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

We can see some log4j warnings that we never imported, and we don’t even see our own message! Where did it go?

It went into log4j that is not configured, meaning into a blackhole.

One way is to exclude the log4j impl from the dependencies:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" exclude("org.slf4j", "slf4j-log4j12")

If we restart our program, we can see that our JUL console logs are back.

Note that the hadoop logging will still be voided, because it still rely on log4j configuration we didn’t configured.

One way to fix this and get the hadoop logs, would be to redirect log4j api to slf4j api. It’s possible, we simply need to add a dependency to org.slf4j:log4j-over-slf4j.

Again, we’ll see that in details later in this article, but the point is: you shouldn’t have multiple logging implementations available in one program.

slf4j implementations should be declared as optional

A best practice when writing a library or any module that can be imported somewhere, is to set slf4j implementation dependency as “optional”:

libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21" % "optional"

  org.slf4j
  slf4j-jdk14
  true

With optional, the dependency won’t be imported by transitivity.

The program which depends on it can use anything, no need to exclude it. More details here https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html.

JCL/ACL

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jcl" % "1.7.21"

JCL stands for Jakarta Commons Logging.

Jakarta is an old retired Apache project, basically, it’s known as ACL now, Apache Commons Logging. It’s not maintained anymore (since 2014), but we can find it in old projects.

It serves the same purpose as slf4j, meaning it’s an abstraction over different logging frameworks such as log4j or JUL.

slf4j’s getLogger() will return a JCLLoggerAdapter that will look for a specific “Log” implementation set by the System variable “org.apache.commons.logging.Log”.

If not set, it will try to fallback on any implementations it can find in the classpath (log4j, JUL..).

New projects should forget about it. Only, if they depends on an old project that depends on JCL, then it should be considered to add a bridge to redirect JCL logs to the implementation of the project.

log4j

log4j is a widely-used logging framework. v1.x has been refactored and improved a lot to create the v2.x called log4j2.

Again, it can be used as an abstraction over a logging implementation, but it can be used as an implementation as well.

log4j1.2

log4j1.2 has reached end of life in 2015.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.21"

Note that this will pull log4j1.2 library too. Here is the dependency tree:


[info] +-org.slf4j:slf4j-log4j12:1.7.21
[info]   +-log4j:log4j:1.2.17
[info]   +-org.slf4j:slf4j-api:1.7.21

When calling slf4j’s getLogger(“My App”), it will use log4j API to create the logger:

org.apache.log4j.LogManager.getLogger(name);

Note that this LogManager has nothing to do with the JUL’s one.

When you don’t have slf4j but just log4j, this is the method you call to get a Logger. slf4j-log4j12 just does the same.

Anyway, that’s not enough:

log4j:WARN No appenders could be found for logger (My App).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

log4j needs a configuration file. We can create a simple properties file “src/main/resources/log4j.properties”:

log4j.rootLogger=DEBUG, STDOUT
log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender
log4j.appender.STDOUT.layout=org.apache.log4j.PatternLayout
log4j.appender.STDOUT.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

If we restart our program, we can see our message:

0 [main] INFO My App — message from slf4j!

Or if we like xml (nobody?), we can create a file “log4j.xml” (notice the lowercase tags):





 
   
     
   
 

 
   
   
 


Output:

2016–08–22 01:06:38,194 INFO [main] App$ (App.scala:11) — message from slf4j!

But you shouldn’t useWhen you don’t have slf4j but just log4j, this is the method you call to get a Logger. slf4j-log4j12 just does the same.

log4j2

Now, let’s say we want to use the latest version of log4j. It may be the most popular slf4j’s binding used nowadays.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"

Notice the organization of the binding is “org.apache.logging.log4j”, and not “org.slf4j” like log4j12’s.

Only adding this dependency is not enough :

Failed to instantiate SLF4J LoggerFactory
Reported exception:
java.lang.NoClassDefFoundError: org/apache/logging/log4j/spi/AbstractLoggerAdapter
...

We need to add log4j-api dependency ourselves:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.6.2"

Not enough yet!

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console…

We need to add log4j-core dependency too

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.6.2"

We get another error message (!) :

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.

That’s better, we just need a configuration file, that’s the last step.

Let’s create a sample log4j2.xml (notice the caps):




 
   
     
   
   
     
   
   
     
   
 

 
   
     
     
   
 


Our message is finally back and a file A1.log is created too:

2016–08–22 01:51:49,912 INFO [run-main-a] App$ (App.scala:8) — message from slf4j!

log4j2 is excellent because it has a vast collections of Appenders where to write the logs : https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html

  • Console, File, RollingFile, MemoryMappedFile
  • Flume, Kafka, JDBC, JMS, Socket
  • SMTP (emails on errors, woo!)
  • Any Appender can be treated as Async too (doing the logging in another thread, to not block the main thread cause of i/o)

logback

logback has the same father as log4j, it was meant to be the successor of log4j.

The syntax of the configuration is therefore quite similar.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "ch.qos.logback" % "logback-classic" % "1.1.7"

“logback-classic” will pull-down “logback-core” as dependency, no need to add it.

It will run without configuration (finally!):

02:17:43.032 [run-main-1f] INFO My App — message from slf4j!

But of course, you can create a logback.xml to customize its behavior:



    
        
            %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
        
    

    
        
    

  • debug : to display some info about the logging system creation on startup
  • scan : modification are taken into account in live. This is particularly useful in production when you just want to get debug message for a short amount of time.
  • notice the xml style is log4j1.2’s

It’s also possible to use a custom config file

-Dlogback.configurationFile=src/main/resources/logback.xml

logback has the same collection of appenders as log4j. Some are not part of the official package, such as:

TLDR

Add a dependency to slf4j which is a logging interface: “org.slf4j” % “slf4j-api” and add an logging implementation:

Implementation Dependency(ies) Configuration / Note
to the console “org.slf4j” % “slf4j-simple” simplelogger.properties
to java.util.logging (JUL) “org.slf4j” % “slf4j-jdk14″ logging.properties
to JCL/ACL “org.slf4j” % “slf4j-jcl” (deprecated)
to log4j1.2 “org.slf4j” % “slf4j-log4j12″ (deprecated) log4.[properties|xml]
to log4j2 “org.apache.logging.log4j” % “log4j-[slf4j-impl|api|core]” log4j2.xml
to logback “ch.qos.logback” % “logback-classic” logback.xml

A very nice picture to resume what we just saw (we didn’t talked about sl4j-nop, it’s just a black hole):


http://www.slf4j.org/manual.html

 

So we learned about multiple implementations/bindings of slf4j’s api.

But if your project depends on other projects that are not using slf4j but directly JUL or log4j, it’s possible to redirect them to your own slf4j’s implementation, thanks to the bridges.

Bridges

Previously, we imported hadoop-client and our logs disappeared because it was using a log4j logger we never configured.

We excluded its implementation from the program and could see our logs again, but the logs of the hadoop-client library was still using log4j, and therefore its logs went into the void.

To avoid that, it’s possible to create a bridge to send log4j messages to slf4j, that we will dispatch where we want.

The bridge package generally contains both sides in the name, as “org.apache.logging.log4j” % “log4j-to-slf4j” % “2.6.2”.

For instance, with those dependencies :

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"
libraryDependencies += "org.apache.logging.log4j" % "log4j-to-slf4j" % "2.6.2"

The path of the logs is:
hadoop’s log(…) → ACL → log4j → bridge → slf4j → JUL → System.err
Phew!

val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("message from {}", "slf4j!")
// generate some hadoop logs
new DFSClient(new InetSocketAddress(1337), new Configuration)

We are actually “lucky” because 2 implementations were available for slf4j: log4j (provided in hadoop-client) and “slf4j-jdk14″.

Fortunately for us, slf4j pock “slf4j-jdk14″. Otherwise we would have get trap into an infinite loop :

hadoop’s log(…) → ACL → log4j → bridge → slf4j → log4j → log4j → bridge → slf4j → log4j→ log4j → bridge → slf4j → log4j…

Output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [...slf4j-jdk14–1.7.21.jar!...]
SLF4J: Found binding in [...slf4j-log4j12–1.7.5.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
INFO: message from slf4j! [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.use.legacy.blockreader.local = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.read.shortcircuit = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.domain.socket.data.traffic = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.domain.socket.path = [Fri Aug 19 01:08:46 CEST 2016]
…

Another bridge supposedly doing the same exists : “org.slf4j” % “log4j-over-slf4j” % “1.7.21”. Unfortunately, it creates the infinite loop in our case, because slf4j pick “slf4j-log4j12″:

SLF4J: Found binding in [...slf4j-log4j12–1.7.5.jar!...]
SLF4J: Found binding in [...slf4j-jdk14–1.7.21.jar!...]
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.StackOverflowError

But we can explicitely exclude the other implementation :

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" exclude("org.slf4j", "slf4j-log4j12")

If we do, both bridges are working as expected.

As you can see, without altering anything in the hadoop library, we made it generate logs where and with the format we wanted.

Bridges between those common implementations are available (they couldn’t agree on the naming it seems..):

  • jcl-over-slf4j
  • log4j-over-slf4j
  • jul-to-slf4j

That’s the power of slf4j and its implementations. It’s completely decoupled from the source.

TLDR

Here’s a picture resuming the available bridges to slf4j:


http://www.slf4j.org/legacy.html

Performance

Some applications can generate a tremendous amount of logs. Some precautions should be taken care of:

  • async logging should always be preferred (another thread doing the logging, not the caller’s). This is often available in the logging configuration itself ()
  • you should not add level guards (if (logger.isDebugEnabled)…) before logging, that brings us to the next point:
  • do not concat strings yourself in the message: use the placeholder syntax such as log.info(“the values are {} and {}”, item, item2). The .toString() won’t be computed if it’s not needed (it can be cpu intensive, but basically, it’s just useless to call it if the log level is not enough).
  • In Scala, you generally use https://github.com/Log4s/log4s to avoid this and just use classic string interpolation. It’s based on macros and will automatically add guard conditions.

Some benchmarks and comparaison: https://logging.apache.org/log4j/2.x/performance.html

Last notes

slf4j is useful combined to a powerful implementation such as log4j2 or logback.

But be careful when the application is managed by another application, like supervisor, because they can handle the logging themselves too like file rolling, or logstash to somewhere. Often, keeping the logging configuration simple (stdout) is enough.

A lot of frameworks have traits or abstract classes or globals to provide the logging directly  :

  • Akka : provides LoggingReceive, ActorLogging, akka.event.Logging, akka-slf4j.jar
  • Spark : it’s using log4j and had a trait org.apache.spark.Logging (removed in 2.0.0)
  • Play Framework: it’s using logback and provides a Logger object/class on top of slf4j’s Logger

 

 

From Apache Flume to Apache Impala using HDFS

March 31st, 2016 | cloudera, flume, hadoop, hbase, hdfs, impala |

Let’s use HDFS as a database !

So, we have data coming from one of our service, which is the source of a flume agent. Now, we want to be able to query them in a scalable fashion without using hbase or any other database, to be lean.

One way is to use HDFS as a database (Flume has a HDFS sink that handle partitioning), create a Hive table on top to query its content, and because we want something performant and fast, to actually use Impala to query the data using a Apache Parquet format.

Here’s a little diagram of the stack :

flume to impala

  • Apache Oozie is used to regularly export the HDFS content to a parquet format.
  • We are storing our data into HDFS in an Apache Avro format (Snappy compressed) because of all its advantages and because we are already using it everywhere.

Let’s review the stack one piece at a time, starting with the Flume configuration.

Flume

Let’s say we have a avro source where our custom service is sending events :

agent1.sources.events_source.type = avro
agent1.sources.events_source.bind = 0.0.0.0
agent1.sources.events_source.port = 9876

Let’s first configure the Flume HDFS sink where we are going to export our events, the configuration is pretty long but every piece has its importance :

agent1.sinks = ... s

agent1.sinks.s.type = hdfs
agent1.sinks.s.hdfs.path = /user/flume/events/ymd=%Y-%m-%d/h=%H
agent1.sinks.s.hdfs.inUsePrefix = .
agent1.sinks.s.hdfs.fileType = DataStream
agent1.sinks.s.hdfs.filePrefix = events
agent1.sinks.s.hdfs.fileSuffix = .avro
agent1.sinks.s.hdfs.rollInterval = 300
agent1.sinks.s.hdfs.rollSize = 0
agent1.sinks.s.hdfs.rollCount = 0
agent1.sinks.s.serializer = com.company.CustomAvroSerializer$Builder
agent1.sinks.s.channel = events_channel

Let’s review this config.

Partition

Because of our volume of data, we want to partition them per year-month-day then by hour. The “ymd=” and “h=” in the path are important, it represents the “column” name of the time dimension that will be queryable later.

Note that your Flume source must have a “timestamp” in the header for Flume to know what is the time dimension. If you don’t have this info, you can simply add hdfs.useLocalTimeStamp = true to use the ingestion time, but it’s discouraged because that means you don’t have any timestamp column in your data, and you’re going to get stuck later when doing some Impala partitioning.

Roll interval

We decide to roll a new file every 5min, and not based on size nor count (they have to be explicitely set to 0 because they have another default value).

By default, Flume buffers into a .tmp file we can’t rely on, and because we want to access the fresh data quickly, 5min is a good start. This is going to generate a bunch of file (144 per day), but we don’t care because we are going to export them later into a hourly parquet format and clean up the old HDFS content.

Moreover, if Flume suddenly dies, you are going to lose maximum 5min of data, instead of the whole buffer. Stopping properly Flume flushes the buffer hopefully.

File name

The inUsePrefix to “.” is to hide the working files to Hive during a query (it ignores the aka hidden files). If you don’t, some MapReduces can fail because at first, Hive saw a file Flume was buffering into (a .tmp), then the time to execute the MR, it was not there anymore (because of some flush), and kaboom, the MR will fail :

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hadoop01:8020/user/flume/events/ymd=2016–03–17/h=15/events.1458223202146.avro.tmp

File type

By default, the filetype is SequenceFile. We don’t want that, because that makes Flume convert the output stream to a SequenceFile that Hive will not be able to read because the avro schema won’t be inside. Setting it to DataStream let the data sent unaltered.

FYI, a typical SequenceFile body :

SEQ♠!org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable ▒c(g▒s▒▒►▒▒▒|TF..

A snappy compressed avro file body :

Obj♦avro.schema▒8{"type":"record","name":"Order",...}avro.codecsnappy ▒▒T▒▒▒▒♣

Avro serializer

Our custom serializer is doing some conversion of the original event, and simply emits some avro using a DataFileWriter with the snappyCodec:

DatumWriter writer = new GenericDatumWriter<>(schema);
dataFileWriter = new DataFileWriter<>(writer)
                    .setCodec(CodecFactory.snappyCodec())
                    .create(schema, out);
              

Multiple Flume ?

Be careful if multiple Flume are writing to the same location, the buffer is not shareable and you could have name conflicts. By default, Flume named the file with a timestamp (in milliseconds). That’s fine most of the time but you never know if they are going to collide one day. Consider having two different configurations with a different prefix or suffix to the filename.

Performance consideration

Don’t forget to monitor your Flume when adding a HDFS sink. The overhead is noticeable; there are a lot more I/O threads (by default it’s 10 but I noticed way more threads with VisualVM), and the CPU slightly increases.

Our Flume is now properly sinking to HDFS, let’s check it out.

HDFS

First thing is to verify the data are correct and readable.

We check the partitioning is present :

$ hdfs dfs -ls /user/flume/events/ymd=2016-01-29
/user/flume/events/ymd=2016-01-29/h=00
/user/flume/events/ymd=2016-01-29/h=01
...

Then we check a file to see if it’s parsable :

$ hdfs dfs -cat /user/flume/events/ymd=2016–03–14/h=15/events.1458147890641.avro
Objavro.schema�8{“type”:”record”,”name”:”OrderRecord”,”namespace”:”com.company.avro”,”fields”:[...

Our schema is there, it’s not a SequenceFile, good! Let’s use the avro tools to deserialize the content. It’s simply a .jar with some useful functions (getschema, tojson), downloadable here.

$ curl -sLO http://apache.crihan.fr/dist/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
$ java -jar avro-tools-1.7.7.jar getschema events.1458147890641.avro
{
  "type" : "record",
  "name" : "Order",
  "namespace" : "com.company",
  "fields" : [ {
    "name" : "type",
    "type" : "string"
  }, {
...
$ java -jar avro-tools-1.7.7.jar tojson logs.1458147890641.avro
{"type":"AD12","customer_id":2458189, ...
{"type":"AD12","customer_id":9515711, ...

Our HDFS is in place and get streamed new data. Let’s now configure Hive to create a table on top of the files.

Hive

Avro is hopefully standard in Hadoop and Hive has everything to read avro files to create a table.

AvroSerDe

The magic happens when using the AvroSerDe (Avro Serializer Deserializer). It is used to read avro files to create a table, and vice-versa, to create avro files from a table (with some INSERT). It also detects and uncompresses the files when compressed with Snappy.

Under the hood, it’s simply using DataFile[Writer|Reader] code to read/write avro content.

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

Create the table

We create a Hive external table mapped onto our .avro files using AvroSerDe and specifying a avro schema to read the data (it will probably be the same that the schema used to write the data BUT it could be different, such as only a slice):

CREATE EXTERNAL TABLE avro_events
PARTITIONED BY (ymd STRING, h INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/flume/events/'
TBLPROPERTIES ('avro.schema.literal' = '
{
 "type" : "record",
 "name" : "Order",
 "namespace" : "com.company",
 "fields" : [ {
   "name" : "type",
   "type" : "string"
 } ]
}');

The PARTITIONED BY matches our structure and names :

$ hdfs dfs -ls /user/flume/events/ymd=2016–03–17/h=12 
/user/flume/events/ymd=2016–03–17/h=12/events.1458212422229.avro
/user/flume/events/ymd=2016–03–17/h=12/events.1458212456756.avro
...

It’s possible to externalize the schema into its proper file on HDFS and use it :

TBLPROPERTIES ('avro.schema.url' = '/user/flume/Order.avsc');

You can generate the schema if you don’t have it (if it’s generated on the fly by the code with some Scala macros for instance):

$ java -jar avro-tools-1.7.7.jar getschema events.avro > Order.avsc
$ hdfs dfs -put Order.avsc /user/flume/

ERR: Long schemas

If you have a looong schema, therefore a long query, you could end up with this error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: Put request failed : INSERT INTO “TABLE_PARAMS” (“PARAM_VALUE”,”TBL_ID”,”PARAM_KEY”) VALUES (?,?,?)
org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO “TABLE_PARAMS” (“PARAM_VALUE”,”TBL_ID”,”PARAM_KEY”) VALUES (?,?,?)
Caused by: org.postgresql.util.PSQLException: ERROR: value too long for type character varying(4000)

You must use an external schema in this case, your query is too long.

If you’re using a UI like Hue to do the query, and notice some weird things, use the hive shell, the error will be properly displayed, Hue won’t.

ERR: Partition and field names conflicts

If you have a column in your data which the name is the same as the partition dimensions name (ymd and h in our case), you are going to get this error :

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Partition column name hour conflicts with table columns.

You need to change the partition column name you are using.

Notify Hive of the new partitions

Our data are in HDFS, our Hive table is mapped onto it, good. But Hive need to discover the data now, it’s not automatic because we are short-circuiting it (we are inserting data without any Hive INSERT).

msck repair table to the rescue. It simply looks in the folder to discover new directories and add them to the metadata.

hive> msck repair table avro_events;
OK
Partitions not in metastore: avro_events:ymd=2016–03–17/hour=12
Repair: Added partition to metastore avro_events:ymd=2016–03–17/h=12
Time taken: 2.339 seconds, Fetched: 1 row(s)

Hopefully, we are not going to do that manually each time we need, we are going to use Oozie later in the process, and directly via Impala.

Drop partitions

If you’re playing a bit and saw some “Partitions missing from filesystem” when doing a msck repair, you can remove the partitions (metadata) manually :

hive> alter table avro_events DROP IF EXISTS PARTITION (ymd='2016–03–21') PURGE;
Dropped the partition ymd=2016–03–21/h=13
Dropped the partition ymd=2016–03–21/h=14
Dropped the partition ymd=2016–03–21/h=15
Dropped the partition ymd=2016–03–21/h=16
Dropped the partition ymd=2016–03–21/h=17
Dropped the partition ymd=2016–03–21/h=18
Dropped the partition ymd=2016–03–21/h=19
Dropped the partition ymd=2016–03–21/h=20
Dropped the partition ymd=2016–03–21/h=21
Dropped the partition ymd=2016–03–21/h=22
Dropped the partition ymd=2016–03–21/h=23

Querying

You can already query the Hive table to get proper records.

For instance, if you have a column “timestamp”, it’s nice to check the latest ingested data time :

select max(`timestamp`) from avro_events;

Hive has some magic column we can used to get more insights about the table content. To know how many records each partition contains, we can use the virtual column INPUT__FILE__NAME :

hive> select INPUT__FILE__NAME, COUNT(*)
FROM events
GROUP BY INPUT__FILE__NAME
ORDER BY INPUT__FILE__NAME;
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      910
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      1572
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      4884

That can come in handy if you just want to look what’s inside a specific file :

hive> SELECT COUNT(*) FROM avro_events
      WHERE INPUT__FILE__NAME = '...';

Because we are not crazy enough to query through Hive, let’s focus on querying the data through Impala to get blazing fast responses.

Impala

We can use Impala to query the avro table, but for performance reason, we are going to export it into a Parquet table afterwards. This step is mostly to be aware that it’s possible.

Query the Hive avro table

If right now, in Impala, you do:

> show tables;

You won’t see the Hive table yet. You need to make Impala aware it’s there:

> INVALIDATE METADATA avro_events;

When it’s done, you’ll be able to query it, with frozen data still. It’s the same problem that with Hive.

For Impala to see the latest data of the existing partitions, we can use REFRESH :

> REFRESH avro_events;

But that won’t discover the new partitions still (ie: if Flume just created a new hour partition).

We have to use the equivalent of Hive’s msck repair table, but for Impala:

> ALTER TABLE avro_events RECOVER PARTITIONS;

This RECOVER PARTITIONS will do what REFRESH does (see the latest data of the existing partitions), but will discover the new ones too. I don’t know the impact and process time on big tables with tons of partitions.

Query a Parquet table

Because we want to be taken seriously, we want to store our data into Parquet files to get fast queries. Parquet stores the data in columns rather in rows, supporting über-fast filtering because it doesn’t need to parse every rows.

First, we need to create a partitioned Impala table stored in Parquet :

CREATE TABLE avro_events_as_parquet (type STRING, ...)
PARTITIONED BY(ymd STRING, h INT)
STORED AS PARQUETFILE;

It doesn’t have to follow the same partitioning than the Hive table but for the sake of clarity, it does.

It’s empty, we are going to fill it with the avro table. To do that, we are going to base our partition logic on a “timestamp” column you should have in your data. We can’t retrieve the partition values from the Hive table avro folders name because it’s not queryable.

-- We ensure we're viewing the latest partitions of the Hive table
-- where Flume is sinking its content

ALTER TABLE avro_events RECOVER PARTITIONS;
REFRESH avro_events;

-- We insert the data overwriting the partitions we are going to
-- write into, to be sure we don't append stuff to existing (that
-- would do duplicates and wrong stats!)

INSERT OVERWRITE avro_events_as_parquet
PARTITION(ymd, h)
  SELECT type, ..., 
    FROM_UNIXTIME(FLOOR(`timestamp`/1000), 'yyyy-MM-dd') AS ymd,
    CAST(FROM_UNIXTIME(FLOOR(`timestamp`/1000), 'HH') AS INT) AS h
  FROM avro_events
  [ WHERE timestamp < $min AND timestamp > $max ];

-- We compute the stats of the new partitions
COMPUTE INCREMENTAL STATS avro_events_as_parquet;

We specify the partition values doing some transformation on our “timestamp” column (note: FROM_UNIXTIME uses the current locale).

The WHERE is not necessary the first time, you just want to load everything. Later, when you have a scheduled job every hour for instance, you want to filter which partition you want to write, thus the WHERE.

The end result is something like :

$ hdfs dfs -ls .../avro_events_as_parquet/ymd=2016–03–23/h=13
78.5 M avro_events_as_parquet/ymd=2016–03–23/h=13/644a2fbc8ad76fb2–2bd5c46c0a9a2cba_497985827_data.0.parq

A nice .parq file combining all the 5min-range avro files Flume created.

Now, because Flume is streaming, and because you want to query Impala without doing all the manual updates yourself, you’re planning a recurring Oozie job to take care of them.

Oozie

It’s quite straight-forward, it’s just an encapsulation of the scripts we just used.

First, we define a coordinator running every hour that will write the previous hour partition. (the coordinator could trigger the workflow every 5mins to have less lag in Impala if that’s necessary; the same Parquet partition would just be overwritten 12 times per hour with more and more data each time)

We take care of adding a small lag before the exact new hour, 12:00, and instead run the workflow at 12:05, to be sure Flume has flushed its data (the 5min is not random, it’s the same value as Flume rollInterval).

That means we can define a property in the coordinator.xml like that :


  partitionDate
  ${coord:dateOffset(coord:dateOffset(coord:nominalTime(), -1, 'HOUR'), -5, 'MINUTE')}

If the workflow runs at 12:05, then partitionDate=2016–03–31T11:00Z. The partition ymd=2016–03–31/h=11 will contain every data in [11:00, 12:00).

Then, in the worflow, we create a action where we pass this value :


    ${jobTracker}
    ${nameNode}

    ${insertOverwriteScript}
    ${partitionDate}
    ...

And the script uses it to create the WHERE condition we just talked about in the Impala part using some shell script transformations (because with what we did, we just have a plain TZ date, but we need 2 timestamps at least (min and max), or any other time value we could use from our data if we have them, such as “day”, “hour”).

I assume you know Oozie and what you do to not provide the full scripts in there.

Improvements

I think the stack could be simplified using a serializer to export into Parquet directly to HDFS, but that would still create a bunch of tiny parquet files (because we want to flush a lot) so we would still need to merge them automatically at the end. What do you think?

Conclusion

I hope I made myself clear and that the process makes sense.

Don’t forget we did that just because we didn’t want to use any database, but only HDFS to store our data, for some reasons. We wanted to be able to query quickly through Impala to have “almost-realtime” data (with a coordinator every 5min for instance that would do it).

Another solution would be to sink Flume into HBase then query over it, or create a Hive/Impala table on top.

Oh, this post is missing some gifs.

If you’re not me, come say hi on twitter :

HBase merge and split impact in HDFS

December 24th, 2015 | cloudera, hadoop, hbase, hdfs |

Why merge ?

I had a table with a lot of regions (almost a thousand), more than I wanted, and more than I should, according to the HBase book.
The max size of the HFile hbase.hregion.max.filesize was 10G.
I raised it to 20G and recompact the whole table, thinking I’m done. But nothing happened.
Why ? Because HBase does not merge regions automatically.

Compaction is used to merge Storefiles in a same HStore (one HStore per column family, per region).
A region that exists, is going to exist forever. Except if we delete it or merge it manually.

I then decided to merge some regions, first to give a try, and second to see the impact in HDFS, because I’m curious.
If you are wondering too, you’re in the good place, keep reading. It’s not complex.

I’ll first do a merge without raising hbase.hregion.max.filesize, to see what is happening.
Then I’ll raise the max and do another merge, and check the differences.

HBase version : 1.0.0-cdh5.4.7
Hadoop Version : 2.6.0-cdh5.4.7

Merge still at 10G max

First, you need to find 2 consecutive regions to merge together.
The consecutivity is important, you can merge region that are not consecutive but it’s not recommanded (overlapping).
e.g. : If you have 2 regions which start/end keys are 0-9 and 9-A, you want to create a region which start/end keys are 0-A.

In HDFS, there is no order, it’s all guids. To know to what they correspond, one way is to go to the HBase admin and select the table.
That will display every of its regions name, uuid, start/end keys.

http://hadoopmaster:60010/table.jsp?name=my_table
# or :16010 if recent

A region name is something like :

my_table,0115d0b6f99f58a34...2a9e72781c7,1440840915183.fbb00c100422d1cc0f9b7e39d6c6bd91.
# signification :
[table],[start key],[timestamp].[encoded ID]

The encoded ID is what we are interested into. This is the folder in HDFS (/hbase/data/default/my_table/fbb00c100422d1cc0f9b7e39d6c6bd91) where the data of this region are.

Let’s merge it with its following one.

hbase> merge_region 'fbb00c100422d1cc0f9b7e39d6c6bd91', ‘a12acd303c0b7e512c8926666c5f02eb'

That creates a new region 65bd... that contains a HFile which size is growing slowly, as we can see in HDFS :
(here is a diff from before and after the merge_region)

>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a
>        226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
> 2684354560 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp/752530e58ae8478d812696b066edcc9f
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/recovered.edits
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/recovered.edits/2206186528.seqid
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t
>        109 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t/ccd883e710664f1fbf605590deaf2868.a12acd303c0b7e512c8926666c5f02eb
>        109 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t/e17b4ea9b9fa47c1839999426ef9ffe7.fbb00c100422d1cc0f9b7e39d6c6bd91

<          0 2015-12-23 12:13 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits
<          0 2015-12-23 12:13 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits/2198106631.seqid
---
>          0 2015-12-23 12:24 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits
>          0 2015-12-23 12:24 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits/2198106637.seqid

<          0 2015-12-23 11:45 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91
---
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/.merges

<          0 2015-12-23 12:13 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits
<          0 2015-12-23 12:13 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits/2206186546.seqid
---
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits/2206186549.seqid

What we can see :
new region in the folder 65b... with a HFile in .tmp (2.7GB, growing) and a .regioninfo (very important, that’s the metadata to identify what is this region)
– new empty folder .merges in one of the region we are merging
recovered.edits folders. Don’t mind them, I won’t display them anymore. For more info, check this nice Cloudera blog post to know more about it.

After a few minutes, it was done, the HFile grew up to 17GB (which was over the limit of 10GB).
HBase started the reverse process : it split the big region I just made ! :-(

>           0 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40
>         226 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.regioninfo
>           0 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t
>         109 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t/752530e58ae8478d812696b066edcc9f.65bd82b5477fcc2090804c351d89700a

>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a
>         226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.splits
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t
> 17860937303 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t/752530e58ae8478d812696b066edcc9f

>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8
>         226 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.regioninfo
>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp
>   134217728 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp/e377603958894f8ca1ec598112b95bf4
>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/t
>         109 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/t/752530e58ae8478d812696b066edcc9f.65bd82b5477fcc2090804c351d89700a

<           0 2015-12-23 11:45 .../my_table/a12acd303c0b7e512c8926666c5f02eb

<           0 2015-12-23 11:45 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91

– the two old regions are removed (a12... and fbb...)
– the split region has a .splits folder
– 2 new regions appeared : 2c1... and 743...
– only one of these 2 regions have a HFile that is slowly growing (meaning: sequential process)

Meanwhile, in the logs…

// 2 new regions from a SPLIT
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition null to {2c142664dc0929d7c6cc5fa6fe3b4e40 state=SPLITTING_NEW, ts=1450872332817, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition null to {743bfa035be56bf412d00803abe433b8 state=SPLITTING_NEW, ts=1450872332817, server=hadoopslave04,60020,1450869198826}

// the region we are splitting was OPEN
// it goes to SPLITTING then SPLIT, and is set offline for the time being
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {65bd82b5477fcc2090804c351d89700a state=OPEN, ts=1450869854560, server=hadoopslave04,60020,1450869198826} to {65bd82b5477fcc2090804c351d89700a state=SPLITTING, ts=1450872332817, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {65bd82b5477fcc2090804c351d89700a state=SPLITTING, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {65bd82b5477fcc2090804c351d89700a state=SPLIT, ts=1450872334767, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Offlined 65bd82b5477fcc2090804c351d89700a from hadoopslave04,60020,1450869198826

// both 2 new regions switch from SPLITTING_NEW to OPEN
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {2c142664dc0929d7c6cc5fa6fe3b4e40 state=SPLITTING_NEW, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {2c142664dc0929d7c6cc5fa6fe3b4e40 state=OPEN, ts=1450872334767, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {743bfa035be56bf412d00803abe433b8 state=SPLITTING_NEW, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {743bfa035be56bf412d00803abe433b8 state=OPEN, ts=1450872334767, server=hadoopslave04,60020,1450869198826}

// daughter a and b = new regions with start keys; the parent being the split region
2015-12-23 13:05:34,873 INFO org.apache.hadoop.hbase.master.AssignmentManager: Handled SPLIT event; parent=my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a., daughter a=my_table,fe7f...,1450872332556.2c142664dc0929d7c6cc5fa6fe3b4e40., daughter b=my_table,feff7...,1450872332556.743bfa035be56bf412d00803abe433b8., on hadoopslave04,60020,1450869198826

// then the split region saw it reference in the metadata (zookeeper) removed
2015-12-23 13:08:28,965 INFO org.apache.hadoop.hbase.MetaTableAccessor: Deleted references in merged region my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a., qualifier=mergeA and qualifier=mergeB

Back to HDFS

After a while, HBase is done with the daughter b region 743b..., and starts creating the daughter a region 2c14....

>           0 2015-12-23 13:25 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40
>         226 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.regioninfo
>           0 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.tmp
>           0 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t
>  8732040437 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t/2388513b0d55429888478924914af494

>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a
>         226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.splits
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t
> 17860937303 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t/752530e58ae8478d812696b066edcc9f

>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8
>         226 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.regioninfo
>           0 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp
>           0 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/t
>  8733203481 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/t/e377603958894f8ca1ec598112b95bf4

It’s done. The region has been successfully split.
After a few minutes, the big region 65bd... will be removed automatically.

2015-12-23 13:43:28,908 INFO org.apache.hadoop.hbase.MetaTableAccessor: Deleted my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a.
2015-12-23 13:43:28,908 INFO org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 722 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)

Note: if we do the difference between the sum of the daughter region sizes and the big one, we get a delta of +395MB (the single HFile is bigger).

We’ve successfully merged 2 regions into one, that was automatically split into two. Hurray !

Raising hbase.hregion.max.filesize to avoid splitting

Now, let’s change hbase.hregion.max.filesize in Cloudera to 20G and merge again to get a big one, without split.

We apply the same process than before and merge manually the 2 regions we got previously 2c14... and 743b....
That creates a new region 1e64... which size is surprisingly lesser than our previous merge (we only get a delta of 212KB) that is not going to be split.

$ hdfs dfs -ls -R /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7
-rw-r--r--   3 hbase hbase         226 2015-12-23 13:45 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/.regioninfo
drwxr-xr-x   - hbase hbase           0 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/.tmp
drwxr-xr-x   - hbase hbase           0 2015-12-23 13:48 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/recovered.edits
-rw-r--r--   3 hbase hbase           0 2015-12-23 13:48 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/recovered.edits/2206186536.seqid
drwxr-xr-x   - hbase hbase           0 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/t
-rw-r--r--   3 hbase hbase 17465031518 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/t/d1109c52de404b0c9d07e2e9c7fdeb5e

So that worked, I have one less region ! Let’s continue why the thousands more I have.

Why should I know this ?

Knowing what’s going on in HDFS with HBase is important when you are facing issues and errors in HBase table structure.
To know if you have that kind of issues, just give a try with :

$ hbase hbck my_table

If you see some ERRORS such as :
– No HDFS region dir found
– Region not deployed on any region server.
– Region found in META, but not in HDFS or deployed on any region server.
– First region should start with an empty key. You need to create a new region and regioninfo in HDFS to plug the hole.
– You need to create a new .regioninfo and region dir in hdfs to plug the hole.
– ERROR: Found lingering reference file hdfs://…

That’s going to be a bit useful. (and maybe complicated for you, that depends of the problem!)

That can happen quite easily unfortunately (I successfully ran into those issue just with a merge_region, not sure why exactly).