Webpack Hot Reloading and React

December 29th, 2015 | hot reloading, javascript, react, webpack |

This post is an extract of a github repo I’m working on chtefi/react-stack-step-by-step to explain step-by-step from scratch a full reactjs stack

– You already use HR ?

If you have a project using webpack, React, you probably have followed some tutorials and blogs to add the Hot Reloading piece.

Here, we are going to explain what’s going on under the hood, and how the HR projects / plugins you installed are working together (webpack HR middlewares, react-transform..).

We are going to use the latest React HR project in date react-transform-hmr and not the older one react-hot-loader which has been sentenced to the maximum penalty : the-death-of-react-hot-loader. I guess it’s still working properly but you know that the Javascript community is always on the bleeding edge side!

– You don’t already use HR ?

If the HR part does not sound familiar to you, that’s fine because I’m going to explain step-by-step why it’s mandatory to use HR nowadays, and what is the way to install and use it into your project.

 

Let’s just start by resuming why everybody is now talking about HR (beyond Javascript).

Why should we enjoy Hot Reloading ?

In one word : productivity.
In another word : DX (Developer eXperience).

Some more explanations :

– you broke your F5 key years ago, you can’t refresh anymore
– you don’t want to develop blindly, then refresh, then try again. It’s a backend thing. We have a UI, it’s alive
– you don’t want to lose your application state (by doing a full refresh) if you just fixed a typo or a color
– that can display your compilation and runtime errors directly in the browser in the exact spot in the UI where it’s used
– you can do HR with Javascript, CSS, anything. We’ll just focus on React components here.
– no caveat

As soon as you put it in place, it will work forever without any modifications, for any of the future files you’re going to add. It’s just some pipes to plug together to make it work and last forever.

2015-12-28: actually there is only one caveat (!) : the React HR plugin we are going to use does not handle the stateless functional components created with the simpler React v0.14 syntax : const App = (props) => {

Hello {props.name}
};.
Refer to comment.

Now, let’s tackle the packages we need to install to use it in a nodejs project.

What packages to install to use HR ?

Let’s suppose you already have a base project using :

webpack to compile your javascript bundle(s)
React for your frontend ofc
– a nodejs server that runs a http server such as expressjs to serve your static content (html, js, css..) while you’re developing

You now want to experiment HR.

webpack to the rescue

webpack is actually the main actor dealing with the HR, it already exposes an API to process some part of the HR pipeline.

We just need to add some wrapper around to use its API, and some more logic to manage the specific React components state : you don’t want to lose the current state of your components when you change something in a js file (a style, some constant, a prop, add a React component inside an existing one etc.).

We need to install 4 packages : 2 for webpack, 2 for React.

webpack-dev-middleware
– a classic expressjs middleware, where requests are passed on
– it automatically watches the sources for changes to recompile the javascript bundle server side if some javascript source have changed
– it always serves the bundle up to date

webpack-hot-middleware
– a classic expressjs middleware, where requests are passed on
– it automatically subscribes to the bundle recompilation events (such as “start”, “done”), to notify the frontend that something has changed and it need to update itself
– it uses SSE to communicate with the frontend

Specific packages for React HR

babel-plugin-react-transform
– it can add any code around React component methods during the Babel compilation ES6+JSX to ES5. You should already have configured your babel loader in webpack, such as :

 module: {
    loaders: [{
      test: /\.js$/,
      loader: 'babel',
      include: path.join(__dirname, 'src'),
    }]
  },

react-transform-hmr
– it is used by babel-plugin-react-transform to add specific code around the React components to properly handle HR and their current state

That gives us :

$ npm install --save-dev webpack-dev-middleware
$ npm install --save-dev webpack-hot-middleware
$ npm install --save-dev babel-plugin-react-transform@beta
$ npm install --save-dev react-transform-hmr
// or for the copy/paste, all together :
$ npm i -D webpack-dev-middleware webpack-hot-middleware babel-plugin-react-transform@beta react-transform-hmr

2015-12-28: we explicitely ask for the beta (>=2.0.0) of babel-plugin-react-transform because for now, the latest published version does not work with Babel 6. But work has been done and is just waiting to be merged.

 

Now that we have installed the necessary packages, it’s time to configure them.

Configure Babel

We need to configure Babel (we are going to use .babelrc) to use babel-plugin-react-transform and react-transform-hmr to add the HR code around them.

The most basic configuration is :

{
  "presets": ["react", "es2015"],
  "env": {
    "development": {
      "plugins": [
        ["react-transform", {
          "transforms": [{
            "transform": "react-transform-hmr",
            "imports": ["react"],
            "locals": ["module"]
          }]
        }]
      ]
    }
  }
}

Basically :
– that adds the transform babel-plugin-react-transform for the development NODE_ENV only
– this transform retrieve all the React components it can find in the source code
– it passes them down to each of its processors defined in "transforms" to let them add their custom code. (in our case, react-transform-hmr will add the HR code)

 

For the record, babel-plugin-react-transform handles as many "transforms" as we want. They are just going to be called right after each other. For instance :

{
  "presets": ["react", "es2015"],
  "env": {
    "development": {
      "plugins": [
        ["react-transform", {
          "transforms": [{
            "transform": "react-transform-hmr",
            "imports": ["react"],
            "locals": ["module"]
          }, {
            "transform": "react-transform-catch-errors",
            "imports": ["react", "redbox-react"]
          }]
        }]
      ]
    }
  }
}

The React component code will pass through react-transform-hmr, then through react-transform-catch-errors.
Each of them will add its code around each components.

FYI, the latter is used to catch errors that are thrown in the render() method of the React components. Then it’s using its "imports" property to redirect the error to a visual React component. Here redbox-react displays a big red screen of the death with the stacktrace for instance. But it could be anything else.

Basically, react-transform-catch-errors just adds try { render() } catch (e) { ... } around the original render() method of your components, and in the catch, it’s returning the React component you gave in "imports". Makes sense right ?

Now, our React code is ready to handle HR.
We now have to make the server communicate to the browser that the code has changed and that it needs to update.

Handle server/client communication to send/receive HR updates

Bundle recompilation on the fly

First, we need to make the server aware that the source code has changed to recompile the bundle, and then notify the browser.
That’s the role of webpack-dev-middleware and webpack-hot-middleware.

webpack-dev-middleware will automatically start to watch the source code for changes and recompile the bundle
webpack-hot-middleware will be notified a new bundle is compiled and will notify the browser

We just need to plug them into expressjs as middlewares to start them :

var express = require('express');
var webpack = require('webpack');
var path = require('path');

var webpackDevMiddleware = require("webpack-dev-middleware");
var webpackHotMiddleware = require("webpack-hot-middleware");
var webpackConfig = require('../webpack.config');

var app = express();

var compiler = webpack(webpackConfig);
app.use(webpackDevMiddleware(compiler));
app.use(webpackHotMiddleware(compiler));
app.use(express.static('src'));

app.listen(3000);

But how is the browser going to handle the updates ? It’s where webpack itself rises.

Browser live update

We need to add some code client-side to deal with it, otherwise, it’s not possible. HR is not browser native or anything.

Therefore, to inject some more code, we will use the webpack bundle entry point in webpack.config.js, to add another entry.
A bundle could have several entry points.
It is just there to say to webpack : “hey, resolve the import dependency (for make the bundle) tree starting from those files!”.

entry: [
  'webpack-hot-middleware/client',
  path.join(__dirname, 'src', 'App.js'),
],

The entrypoint webpack-hot-middleware/client simply refers to the file node_modules/webpack-hot-middleware/client.js.
It is the file that contains the code that will be used in the browser to handle the SSE communication with the server (to intercept the update notifications).

Then we need to add a specific webpack internal plugin HotModuleReplacementPlugin to expose the generic webpack HR API in the browser :

plugins: [
  new webpack.optimize.OccurenceOrderPlugin(), // recommanded by webpack
  new webpack.HotModuleReplacementPlugin(),
  new webpack.NoErrorsPlugin() // recommanded by webpack
]

This API will be used by the code injected from webpack-hot-middleware/client when this one will retrieve some “update” events through its SSE. (specifically, it will call module.hot.apply(..) from HotModuleReplacementPlugin).
You follow ?

 

Nothing more to do, you’re good to go !

See it in action

Process :
– start your nodejs server
– go to your page using the bundle on your browser
– go to your code editor and change some javascript bits
– see the live update !
– if not, check your console in the browser to see some nice error messages

Behind the scene :
– the server automatically recompile the bundle if some javascript code is updated and notify the browser to update itself via SSE
– the bundle used in the browser contains some SSE code to intercept those notifications, some generic HR code to “patch” the javascript, and some custom HR code for the React components to not lose their current state

A boilerplate on github

You can checkout this boilerplate from @dan_abramov https://github.com/gaearon/react-transform-boilerplate to give a try.
It’s very simple and does exactly what we just talked about.
You will see some more options used but nothing fancy.

Its .babelrc is using react-hmre instead of our two transforms packages (babel-plugin-react-transform@beta and react-transform-hmr), but it is exactly the same actually. react-hmre simply encapsulates them.

More bits to learn

Let’s explain some more in-depth some aspects of the code used.

webpack-dev-middleware is optional, but…

Without webpack-dev-middleware, you just need to launch the webpack watch yourself :

// app.use(webpackDevMiddleware(compiler));
// replace the dev-middleware with a simple watch() (args are mandatory)
compiler.watch({}, function(){});
app.use(webpackHotMiddleware(compiler));

Because webpackHotMiddleware subscribes to the bundle compilation events (no matter what started it), it will work.

But you’ll suffer from some consequences : a bunch of .js and .json files in your project will appears each time a compilation occurs.

They contain the delta sent to the client to update itself (webpack only sends the updated chunks, not the whole bundle each time). The advantage of using webpack-dev-middleware is that you won’t see those files. They will be handled in-memory by itself. That’s why you need to install this particular package too.

babel-plugin-react-transform and react-transform-hmr

Without the code added by react-transform-hmr, webpack would not be able to hot update the React components, you would get this in the browser console :

[HMR] bundle rebuilding
[HMR] bundle rebuilt in 160ms
[HMR] Checking for updates on the server...
[HMR] The following modules couldn't be hot updated: (Full reload needed)
[HMR]  - ./src/App.js

What’s inside the SSE ?

1. The browser initializes a SSE request : (thanks to the code webpack-hot-middleware/client.js injected in the bundle) on this specific url : GET localhost:3000/__webpack_hmr (it never returns). It’s handled by the server that knows it’s SSE. (by the webpack-hot-middleware expressjs middleware)

Then if a Javascript file is edited, because webpack-dev-middleware started a watch on the sources, webpack-hot-middleware is notified (because it subscribed to the recompilation events) and notify the frontend via SSE with a new module map (used by webpack), such as :

data: {"action":"building"}

// few ms later ...

data: {"action":"built","time":260,"hash":"6b625811aa23ea1ec259","warnings":[],"errors":[],"modules":{"0":"multi main","1":"./~/fbjs/lib/invariant.js","2":"./~/react/lib/Object.assign.js","3":"./~/fbjs/lib/warning.js","4":"./~/fbjs/lib/ExecutionEnvironment.js","5":"./~/react/lib/ReactMount.js","6":"./~/react/lib/ReactElement.js", ...

Then the frontend asks for those 2 files:
(I guess the hash used in the GET comes from the SSE data ? I didn’t check)

GET localhost:3000/0.0119cbdcd4c2cf8d27c2.hot-update.js

{"h":"6b625811aa23ea1ec259","c":[0]}

GET localhost:3000/0119cbdcd4c2cf8d27c2.hot-update.json

webpackHotUpdate(0,{

/***/ 97:
/***/ function(module, exports, __webpack_require__) {

  /* WEBPACK VAR INJECTION */(function(module) {'use strict';

  ...

Those particular urls are served by webpack-dev-middleware which is keeping those files (*.hot-update.js[on]) in memory and serves them as any classic static file. They are requested by the code injected by webpack.HotModuleReplacementPlugin() that handles the responses and hot-updates the javascript with the new code.

Conclusion

Before I dug into, I was finding that the HR was a bit complicated to understand how it works, who are the actors, how does it communicate, etc. I was never sure what package was used for what exactly, and when. I hope those explanations were clear enough and that now, you know as much as me ! (or more!) Don’t hesitate to add more details or correct me if I’m wrong.

Finally, it’s not that complex but it just need a lot of “pipes” to plug everything together. But it all makes sense.

Unfortunately, there are still some caveats with the way I exposed :
– it doesn’t work for functional stateless React components const App = (props) => {

Hello {props.name}
};
babel-plugin-react-transform is simply an experiment, maybe it will die
– is working on to handle HR at the function level directly, to make it more generic
webpack is an awesome tool

HBase merge and split impact in HDFS

December 24th, 2015 | cloudera, hadoop, hbase, hdfs |

Why merge ?

I had a table with a lot of regions (almost a thousand), more than I wanted, and more than I should, according to the HBase book.
The max size of the HFile hbase.hregion.max.filesize was 10G.
I raised it to 20G and recompact the whole table, thinking I’m done. But nothing happened.
Why ? Because HBase does not merge regions automatically.

Compaction is used to merge Storefiles in a same HStore (one HStore per column family, per region).
A region that exists, is going to exist forever. Except if we delete it or merge it manually.

I then decided to merge some regions, first to give a try, and second to see the impact in HDFS, because I’m curious.
If you are wondering too, you’re in the good place, keep reading. It’s not complex.

I’ll first do a merge without raising hbase.hregion.max.filesize, to see what is happening.
Then I’ll raise the max and do another merge, and check the differences.

HBase version : 1.0.0-cdh5.4.7
Hadoop Version : 2.6.0-cdh5.4.7

Merge still at 10G max

First, you need to find 2 consecutive regions to merge together.
The consecutivity is important, you can merge region that are not consecutive but it’s not recommanded (overlapping).
e.g. : If you have 2 regions which start/end keys are 0-9 and 9-A, you want to create a region which start/end keys are 0-A.

In HDFS, there is no order, it’s all guids. To know to what they correspond, one way is to go to the HBase admin and select the table.
That will display every of its regions name, uuid, start/end keys.

http://hadoopmaster:60010/table.jsp?name=my_table
# or :16010 if recent

A region name is something like :

my_table,0115d0b6f99f58a34...2a9e72781c7,1440840915183.fbb00c100422d1cc0f9b7e39d6c6bd91.
# signification :
[table],[start key],[timestamp].[encoded ID]

The encoded ID is what we are interested into. This is the folder in HDFS (/hbase/data/default/my_table/fbb00c100422d1cc0f9b7e39d6c6bd91) where the data of this region are.

Let’s merge it with its following one.

hbase> merge_region 'fbb00c100422d1cc0f9b7e39d6c6bd91', ‘a12acd303c0b7e512c8926666c5f02eb'

That creates a new region 65bd... that contains a HFile which size is growing slowly, as we can see in HDFS :
(here is a diff from before and after the merge_region)

>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a
>        226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
> 2684354560 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp/752530e58ae8478d812696b066edcc9f
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/recovered.edits
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/recovered.edits/2206186528.seqid
>          0 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t
>        109 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t/ccd883e710664f1fbf605590deaf2868.a12acd303c0b7e512c8926666c5f02eb
>        109 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/t/e17b4ea9b9fa47c1839999426ef9ffe7.fbb00c100422d1cc0f9b7e39d6c6bd91

<          0 2015-12-23 12:13 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits
<          0 2015-12-23 12:13 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits/2198106631.seqid
---
>          0 2015-12-23 12:24 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits
>          0 2015-12-23 12:24 .../my_table/a12acd303c0b7e512c8926666c5f02eb/recovered.edits/2198106637.seqid

<          0 2015-12-23 11:45 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91
---
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/.merges

<          0 2015-12-23 12:13 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits
<          0 2015-12-23 12:13 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits/2206186546.seqid
---
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits
>          0 2015-12-23 12:24 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91/recovered.edits/2206186549.seqid

What we can see :
new region in the folder 65b... with a HFile in .tmp (2.7GB, growing) and a .regioninfo (very important, that’s the metadata to identify what is this region)
– new empty folder .merges in one of the region we are merging
recovered.edits folders. Don’t mind them, I won’t display them anymore. For more info, check this nice Cloudera blog post to know more about it.

After a few minutes, it was done, the HFile grew up to 17GB (which was over the limit of 10GB).
HBase started the reverse process : it split the big region I just made ! :-(

>           0 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40
>         226 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.regioninfo
>           0 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t
>         109 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t/752530e58ae8478d812696b066edcc9f.65bd82b5477fcc2090804c351d89700a

>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a
>         226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.splits
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t
> 17860937303 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t/752530e58ae8478d812696b066edcc9f

>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8
>         226 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.regioninfo
>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp
>   134217728 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp/e377603958894f8ca1ec598112b95bf4
>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/t
>         109 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/t/752530e58ae8478d812696b066edcc9f.65bd82b5477fcc2090804c351d89700a

<           0 2015-12-23 11:45 .../my_table/a12acd303c0b7e512c8926666c5f02eb

<           0 2015-12-23 11:45 .../my_table/fbb00c100422d1cc0f9b7e39d6c6bd91

– the two old regions are removed (a12... and fbb...)
– the split region has a .splits folder
– 2 new regions appeared : 2c1... and 743...
– only one of these 2 regions have a HFile that is slowly growing (meaning: sequential process)

Meanwhile, in the logs…

// 2 new regions from a SPLIT
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition null to {2c142664dc0929d7c6cc5fa6fe3b4e40 state=SPLITTING_NEW, ts=1450872332817, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition null to {743bfa035be56bf412d00803abe433b8 state=SPLITTING_NEW, ts=1450872332817, server=hadoopslave04,60020,1450869198826}

// the region we are splitting was OPEN
// it goes to SPLITTING then SPLIT, and is set offline for the time being
2015-12-23 13:05:32,817 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {65bd82b5477fcc2090804c351d89700a state=OPEN, ts=1450869854560, server=hadoopslave04,60020,1450869198826} to {65bd82b5477fcc2090804c351d89700a state=SPLITTING, ts=1450872332817, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {65bd82b5477fcc2090804c351d89700a state=SPLITTING, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {65bd82b5477fcc2090804c351d89700a state=SPLIT, ts=1450872334767, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Offlined 65bd82b5477fcc2090804c351d89700a from hadoopslave04,60020,1450869198826

// both 2 new regions switch from SPLITTING_NEW to OPEN
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {2c142664dc0929d7c6cc5fa6fe3b4e40 state=SPLITTING_NEW, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {2c142664dc0929d7c6cc5fa6fe3b4e40 state=OPEN, ts=1450872334767, server=hadoopslave04,60020,1450869198826}
2015-12-23 13:05:34,767 INFO org.apache.hadoop.hbase.master.RegionStates: Transition {743bfa035be56bf412d00803abe433b8 state=SPLITTING_NEW, ts=1450872334767, server=hadoopslave04,60020,1450869198826} to {743bfa035be56bf412d00803abe433b8 state=OPEN, ts=1450872334767, server=hadoopslave04,60020,1450869198826}

// daughter a and b = new regions with start keys; the parent being the split region
2015-12-23 13:05:34,873 INFO org.apache.hadoop.hbase.master.AssignmentManager: Handled SPLIT event; parent=my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a., daughter a=my_table,fe7f...,1450872332556.2c142664dc0929d7c6cc5fa6fe3b4e40., daughter b=my_table,feff7...,1450872332556.743bfa035be56bf412d00803abe433b8., on hadoopslave04,60020,1450869198826

// then the split region saw it reference in the metadata (zookeeper) removed
2015-12-23 13:08:28,965 INFO org.apache.hadoop.hbase.MetaTableAccessor: Deleted references in merged region my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a., qualifier=mergeA and qualifier=mergeB

Back to HDFS

After a while, HBase is done with the daughter b region 743b..., and starts creating the daughter a region 2c14....

>           0 2015-12-23 13:25 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40
>         226 2015-12-23 13:05 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.regioninfo
>           0 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/.tmp
>           0 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t
>  8732040437 2015-12-23 13:41 .../my_table/2c142664dc0929d7c6cc5fa6fe3b4e40/t/2388513b0d55429888478924914af494

>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a
>         226 2015-12-23 12:24 .../my_table/65bd82b5477fcc2090804c351d89700a/.regioninfo
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.splits
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/.tmp
>           0 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t
> 17860937303 2015-12-23 13:05 .../my_table/65bd82b5477fcc2090804c351d89700a/t/752530e58ae8478d812696b066edcc9f

>           0 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8
>         226 2015-12-23 13:05 .../my_table/743bfa035be56bf412d00803abe433b8/.regioninfo
>           0 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/.tmp
>           0 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/t
>  8733203481 2015-12-23 13:25 .../my_table/743bfa035be56bf412d00803abe433b8/t/e377603958894f8ca1ec598112b95bf4

It’s done. The region has been successfully split.
After a few minutes, the big region 65bd... will be removed automatically.

2015-12-23 13:43:28,908 INFO org.apache.hadoop.hbase.MetaTableAccessor: Deleted my_table,fe7f...,1450869853820.65bd82b5477fcc2090804c351d89700a.
2015-12-23 13:43:28,908 INFO org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 722 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)

Note: if we do the difference between the sum of the daughter region sizes and the big one, we get a delta of +395MB (the single HFile is bigger).

We’ve successfully merged 2 regions into one, that was automatically split into two. Hurray !

Raising hbase.hregion.max.filesize to avoid splitting

Now, let’s change hbase.hregion.max.filesize in Cloudera to 20G and merge again to get a big one, without split.

We apply the same process than before and merge manually the 2 regions we got previously 2c14... and 743b....
That creates a new region 1e64... which size is surprisingly lesser than our previous merge (we only get a delta of 212KB) that is not going to be split.

$ hdfs dfs -ls -R /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7
-rw-r--r--   3 hbase hbase         226 2015-12-23 13:45 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/.regioninfo
drwxr-xr-x   - hbase hbase           0 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/.tmp
drwxr-xr-x   - hbase hbase           0 2015-12-23 13:48 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/recovered.edits
-rw-r--r--   3 hbase hbase           0 2015-12-23 13:48 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/recovered.edits/2206186536.seqid
drwxr-xr-x   - hbase hbase           0 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/t
-rw-r--r--   3 hbase hbase 17465031518 2015-12-23 14:12 /hbase/data/default/my_table/1e64aa6f3f5cf067f6d5339230ef6db7/t/d1109c52de404b0c9d07e2e9c7fdeb5e

So that worked, I have one less region ! Let’s continue why the thousands more I have.

Why should I know this ?

Knowing what’s going on in HDFS with HBase is important when you are facing issues and errors in HBase table structure.
To know if you have that kind of issues, just give a try with :

$ hbase hbck my_table

If you see some ERRORS such as :
– No HDFS region dir found
– Region not deployed on any region server.
– Region found in META, but not in HDFS or deployed on any region server.
– First region should start with an empty key. You need to create a new region and regioninfo in HDFS to plug the hole.
– You need to create a new .regioninfo and region dir in hdfs to plug the hole.
– ERROR: Found lingering reference file hdfs://…

That’s going to be a bit useful. (and maybe complicated for you, that depends of the problem!)

That can happen quite easily unfortunately (I successfully ran into those issue just with a merge_region, not sure why exactly).