Java CLI, GC, memory, and tools overview

January 10th, 2016 | gc, java, performance |

Back to the Java world, I’ve made my mind and knew that I didn’t know enough, that I wasn’t enough confident.

Therefore I’ve looked around myself at some “simple” aspects of Java (CLI, GC, tools) to consolidate my knowledge, and made this post to give a global overview of what the Java CLI has to offer, how to configure the memory heap, what are the GC principles and most useful options, and introduce some tools to debug and profile the JVM.

I assume you already know Java and know what do the Garbage Collector with the Young Generation and the Old Generation. Hopefully, this post will learn you some new tricks.

I won’t talk about framework here, just about the basics :

  • Java command line and options
  • Concise summary of the Garbage Collectors and theirs logs
  • Memory tuning and its limits
  • UI tools to debug and profile a JVM
  • CLI tools shipped with the JVM

java version “1.8.0_66″
Java(TM) SE Runtime Environment (build 1.8.0_66-b18)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b18, mixed mode)

Java flags overview

I’ll introduce some of the most useful flags we can enable with java to get more info and understand a bit more what’s going on under the hood.

Quick note :

  • -XX:+[option] : enable the following option
  • -XX:-[option] : disable the following option
  • -XX:[property]= : give a value to the property

-XX:+PrintCommandLineFlags

First, it’s interesting to know what are the default options of the JVM.

$ java -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=268055680 -XX:MaxHeapSize=4288890880 -XX:+PrintCommandLineFlags
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:-UseLargePagesIndividualAllocation
-XX:+UseParallelGC
  • -XX:InitialHeapSize=268055680 : 256m, default to 1/64 of the RAM (alias for -Xms)
  • -XX:MaxHeapSize=4288890880 : 4g, default to 1/4 of the RAM (alias for -Xmx)
  • -XX:+UseParallelGC : Parallel GC (PSYoungGen (Parallel Scavenge) + ParOldGen). Checkout the GC chapter if it’s scary.
  • the others are for misc optimisations

-XX:+PrintFlagsFinal

We can list every existing flags and their values. There are hundreds of them.
Below is just the list of all the Print* we can use to display more info in the logs and some GC related ones.

$ java -XX:+PrintFlagsFinal -version
[Global flags]
...
bool PrintAdaptiveSizePolicy             = false      {product}
bool PrintCMSInitiationStatistics        = false      {product}
intx PrintCMSStatistics                  = 0          {product}
bool PrintClassHistogram                 = false      {manageable}
bool PrintClassHistogramAfterFullGC      = false      {manageable}
bool PrintClassHistogramBeforeFullGC     = false      {manageable}
bool PrintCodeCache                      = false      {product}
bool PrintCodeCacheOnCompilation         = false      {product}
bool PrintCommandLineFlags               = false      {product}
bool PrintCompilation                    = false      {product}
bool PrintConcurrentLocks                = false      {manageable}
intx PrintFLSCensus                      = 0          {product}
intx PrintFLSStatistics                  = 0          {product}
bool PrintFlagsFinal                    := true       {product}
bool PrintFlagsInitial                   = false      {product}
bool PrintGC                             = false      {manageable}
bool PrintGCApplicationConcurrentTime    = false      {product}
bool PrintGCApplicationStoppedTime       = false      {product}
bool PrintGCCause                        = true       {product}
bool PrintGCDateStamps                   = false      {manageable}
bool PrintGCDetails                      = false      {manageable}
bool PrintGCID                           = false      {manageable}
bool PrintGCTaskTimeStamps               = false      {product}
bool PrintGCTimeStamps                   = false      {manageable}
bool PrintHeapAtGC                       = false      {product rw}
bool PrintHeapAtGCExtended               = false      {product rw}
bool PrintHeapAtSIGBREAK                 = true       {product}
bool PrintJNIGCStalls                    = false      {product}
bool PrintJNIResolving                   = false      {product}
bool PrintOldPLAB                        = false      {product}
bool PrintOopAddress                     = false      {product}
bool PrintPLAB                           = false      {product}
bool PrintParallelOldGCPhaseTimes        = false      {product}
bool PrintPromotionFailure               = false      {product}
bool PrintReferenceGC                    = false      {product}
bool PrintSafepointStatistics            = false      {product}
intx PrintSafepointStatisticsCount       = 300        {product}
intx PrintSafepointStatisticsTimeout     = -1         {product}
bool PrintSharedArchiveAndExit           = false      {product}
bool PrintSharedDictionary               = false      {product}
bool PrintSharedSpaces                   = false      {product}
bool PrintStringDeduplicationStatistics  = false      {product}
bool PrintStringTableStatistics          = false      {product}
bool PrintTLAB                           = false      {product}
bool PrintTenuringDistribution           = false      {product}
bool PrintTieredEvents                   = false      {product}
bool PrintVMOptions                      = false      {product}
bool PrintVMQWaitTime                    = false      {product}
bool PrintWarnings                       = true       {product}
...
bool UseParNewGC                         = false      {product}
bool UseParallelGC                      := true       {product}
bool UseParallelOldGC                    = true       {product}
...

The := means that the default value was overridden by something (you or the JVM Ergonomics).
You can see the JVM Ergonomics estimates java could use the ParallelGC on my PC.

Moreover, you can know what is the value of any flag the JVM handles.
For instance, you can find out what is the Young Generation size (“NewSize”) doing some | grep NewSize :

uintx MaxNewSize                       := 1430257664  {product}
uintx NewSize                          := 89128960    {product}

More details on how to read that on javaworld or codecentric.

Get more details in the logs

As a reminder :

-XX:+PrintGC / -verbose:gc

This is the first step to know what’s going on with your program and its GC.

[GC (Allocation Failure)  954K->896K(1536K), 0.0008951 secs]
[Full GC (Ergonomics)  896K->290K(1536K), 0.0026976 secs]
[GC (Allocation Failure)  778K->290K(1536K), 0.0006170 secs]

We can see the total heap going from 954K to 896K, with a max at 1536K for instance.

  • Allocation Failure : the JVM couldn’t find any more space in the Young Generation and had to clean it up. This is a normal behavior.
  • Ergonomics : the JVM decided to start a Full GC on its own.
  • Metadata GC Threshold : Metaspace size is exhausted. Raise the default MetaspaceSize and maybe the max MaxMetaspaceSize.

-XX:+PrintGCDetails

It’s more interesting, you still see the heap size changes, but you also see the young generation PSYoungGen, the old generation ParOldGen, and the Metaspace changes (because I was running with the Parallel GC, it’s different according to which GC is used).

[GC (Allocation Failure)
  [PSYoungGen: 465K->400K(1024K)]
  954K->896K(1536K), 0.0011948 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC (Ergonomics)
  [PSYoungGen: 400K->0K(1024K)]
  [ParOldGen: 496K->290K(512K)]
  896K->290K(1536K),
  [Metaspace: 2520K->2520K(1056768K)], 0.0032720 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC (Allocation Failure)
  [PSYoungGen: 488K->0K(1024K)]
  778K->290K(1536K), 0.0010046 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]

-XX:+PrintReferenceGC

This options works with the previous one.
It adds information about the different *Reference type variables (Soft, Weak, Final, Phantom, JNI) the program might used.

The PhantomReference are quite tricky to work with the GC, be careful. But if you’re using them, I’m pretty sure you know it, right ? plumbr has some nice tips about it.

[GC (Allocation Failure)
  [SoftReference, 0 refs, 0.0003665 secs]
  [WeakReference, 9 refs, 0.0001271 secs]
  [FinalReference, 7 refs, 0.0001104 secs]
  [PhantomReference, 0 refs, 0 refs, 0.0001707 secs]
  [JNI Weak Reference, 0.0002208 secs]
  [PSYoungGen: 465K->400K(1024K)]
  954K->896K(1536K), 0.0026939 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]

-XX:+PrintGCTimeStamps / -XX:+PrintGCDateStamps

It’s useful to know when things happen and how often.
The date is useful to be able to match easily with other logs.

2016-01-11T01:12:48.878+0100: 1.071: [GC (Allocation Failure)  954K->928K(1536K), 0.0020453 secs]
2016-01-11T01:12:48.878+0100: 1.072: [Full GC (Ergonomics)  928K->290K(1536K), 0.0031099 secs]
2016-01-11T01:12:49.883+0100: 2.075: [GC (Allocation Failure)  778K->290K(1536K), 0.0012529 secs]

-XX:+PrintGCApplicationStoppedTime

It’s useful to know how much time your application didn’t do anything, because the World was Stopped.
You really want to minimize those times.

Total time for which application threads were stopped: 0.0000492 seconds,
  Stopping threads took: 0.0000179 seconds
Total time for which application threads were stopped: 0.0033140 seconds,
  Stopping threads took: 0.0000130 seconds
Total time for which application threads were stopped: 0.0004002 seconds,
  Stopping threads took: 0.0000161 seconds

-XX:+PrintAdaptiveSizePolicy

This displays some metrics about survivals and promotions that the JVM Ergonomics is using to tune and optimize the GC behavior (by modifying space sizes).

[GC (Allocation Failure)
  AdaptiveSizePolicy::update_averages:  survived: 409616  promoted: 8192  overflow: false
  AdaptiveSizeStart: 1.087 collection: 1
  PSAdaptiveSizePolicy::compute_eden_space_size: costs minor_time: 0.000377 major_cost: 0.000000
    mutator_cost: 0.999623 throughput_goal: 0.990000 live_space: 268845056 free_space: 1048576
    old_eden_size: 524288 desired_eden_size: 524288
  AdaptiveSizeStop: collection: 1
 954K->896K(1536K), 0.0022971 secs]

-XX:-UseAdaptiveSizePolicy

The JVM Ergonomics tries to enhance the latency and the throughput of your application by tuning the GC behavior such as modifying the space sizes.
You can disable this behavior if you know you don’t need it.

And you can still have the details about the survivors and promotions if combined with the previous flag.

$ java -XX:+PrintAdaptiveSizePolicy -XX:-UseAdaptiveSizePolicy -XX:+PrintGC ...
[GC (Allocation Failure)
  AdaptiveSizePolicy::update_averages:  survived: 442384  promoted: 8192  overflow: false
  954K->928K(1536K), 0.0027480 secs]

Memory tuning

Heap size

Heap = Young Generation (Eden + Survivors) + Old Generation (Tenured)

This is the big part that you can impact for the better or for the worse.
If you think you need to change it, you need to be sure it’s necessary, know the existing GC cycles, know that you have reached the limit (or not).
Or you can just give a try and check the behavior, latency, and throughput of your application. ;-)

  • -Xms / -XX:InitialHeapSize : initial heap size
  • -Xmx / -XX:MaxHeapSize : maximum heap size

The MaxHeapSize influences the InitialHeapSize up until 256m.

if MaxHeapSize=2m   then InitialHeapSize=2m   (max)
if MaxHeapSize=256m then InitialHeapSize=256m (max)
if MaxHeapSize=512m then InitialHeapSize=256m (half)

Default size

As we already said, the default MaxHeapSize is 1/4 of the machine RAM, and the InitialHeapSize is 1/64.

For instance, on my machine, I have 16GB of RAM, that gives :

InitialHeapSize = 268435456  = 256m
MaxHeapSize     = 4290772992 = 4092m

Be careful with big numbers and PrintFlagsFinal, it won’t display them properly if greater than 4094m, because it displays them as uint, thus the limit is 4,294,967,295.

$ java -Xmx4g -XX:+PrintFlagsFinal -version |  grep "MaxHeapSize"
 uintx MaxHeapSize                              := 0                                   {product}

Minimum size

The minimum heap size you can set is 1M (your program doesn’t do much but it’s still possible!).

If you try to put less, you’ll end up with :

Too small initial heap

But actually, even if you ask for 1m, you’ll end with 2m :

$ java -Xmx1m -XX:+PrintFlagsFinal -version |  grep HeapSize
    uintx InitialHeapSize                          := 2097152                             {product}
    uintx MaxHeapSize                              := 2097152                             {product}

You will always get a MaxHeapSize divisible by 2.

Not enough heap ?

And if your program needs more heap and can’t find it with a lovely OOM :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Young generation (in heap)

Young Generation = Eden + Survivors

This part of the heap is where all variables start their lifecycle. They are born here, and they will likely evolve to survivors, then ends up in the old generation; if alive long enough.

  • -XX:NewSize : young generation initial size
  • -XX:MaxNewSize : young generation maximum size
  • -Xmn : shortcut for both

The MaxHeapSize and InitialHeapSize influences the MaxNewSize and NewSize.

if MaxHeapSize=1g (InitialHeapSize=256m) then MaxNewSize=341m       and NewSize=85m
if MaxHeapSize=4g (InitialHeapSize=256m) then MaxNewSize=1365m (x4) and NewSize=85m
if MaxHeapSize=4g and InitialHeapSize=1g then MaxNewSize=1365m      and NewSize=341m (x4)

By default, the ratio is 3/1 between MaxHeap/MaxNewSize and InitialHeapSize/NewSize.

Default size

We just saw that NewSize/MaxNewSize are linked to InitialHeapSize/MaxHeapSize.

The default of MaxHeapSize is 1/4 of the machine RAM, and the InitialHeapSize is 1/64.
Therefore, the default of MaxNewSize is (1/4)/3 of the RAM, and NewSize is (1/4)/3 of InitialHeapSize.

On my machine, I have 16GB of RAM, that gives :

InitialHeapSize = 256m
MaxHeapSize     = 4092m

MaxNewSize = 1430257664 = 1364m (= 4092m/3)
NewSize    = 89128960   = 85m   (= 256m/3)

Minimum size

You can’t have MaxNewSize < NewSize : [code] Java HotSpot(TM) 64-Bit Server VM warning: NewSize (1536k) is greater than the MaxNewSize (1024k). A new max generation size of 1536k will be used. [/code] The 1536k will be equally separated between the Eden Space, the from Survivor space, and the to Survivor space. (512k each)

You can’t neither have the MaxNewSize >= HeapSize (young gen size can’t be greater than the total heap size) :

$ java -Xmx2m -XX:MaxNewSize=2m -XX:+PrintFlagsFinal -version | grep NewSize
Java HotSpot(TM) 64-Bit Server VM warning:
MaxNewSize (2048k) is equal to or greater than the entire heap (2048k).
A new max generation size of 1536k will be used.
    uintx MaxNewSize                               := 1572864                             {product}
    uintx NewSize                                  := 1572864                             {product}

Not enough space ?

Even if you have a MaxNewSize of 1m and your program tries to allocate 1g bytes, it will work if you have a big enough heap size. The allocation will just directly go into the old generation space.

Thread Stack (off heap)

Each and every threads in the program will allocate this size for their stack.

This is where they will store the function parameters values they are currently executing (and they are removed when the function exits). The deeper calls you do, the deeper you’re going into the stack. (FILO)

Recursive calls can go very deeply because of their intrinsic natures. This is where you have to be careful in your logic and maybe increase the default ThreadStackSize.

  • -Xss / -XX:ThreadStackSize : thread stack size

Default size

If you look yourself, you’ll find out it’s 0 :

$ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize
intx ThreadStackSize                           = 0                                   {pd product}

0 means it will fallback to the OS default.
Check out Oracle website for unix or Windows, it’s between 320kB and 1MB.

Minimum size

The usage says you must specify at least 108k, but you can run *only* with 65k to start a simple program.

The stack size specified is too small, Specify at least 108k
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

StackOverflow

Be specially careful with recursion and its stop condition :

Exception in thread "main" java.lang.StackOverflowError

A simple program with a recursive function taking 2 int parameters can be called up to :

ThreadStackSize at 65k  : 888 times
ThreadStackSize at 130k : 1580 times
ThreadStackSize at 260k : 2944 times

But the more parameters to the function you add, the less times you’ll be able to call it.

Metaspace (off heap)

This is where the class definitions are. It’s a special space because class definitions are not mutable, they are loaded once and for all.

You will probably never touched the default configuration.

  • (Java < 8) -XX:MaxPermSize (it was a fixed size, difficult to estimate)
  • (Java >= 8) -XX:MaxMetaSpaceSize (is unlimited by default)

Default size

As we said, it’s unlimited.
Well, if we look closely, it’s defined at 4GB for me :

$ java -XX:+PrintFlagsFinal -version | grep Metaspace
{product}
uintx MaxMetaspaceSize                          = 4294901760                          {product}
uintx MetaspaceSize                             = 21807104                            {pd product}

MaxHeapSize has no impact on it, it’s a memory off-heap.

Minimum size

You can’t set a too small size (< 5m) otherwise you'll end up with those errors : [code] Error occurred during initialization of VM OutOfMemoryError: Metaspace - java.lang.OutOfMemoryError: Metaspace <> - Exception in thread "main" java.lang.OutOfMemoryError: Metaspace [/code] But it's a good idea to set a max size, to be sure the JVM will never take an “unlimited” memory (and breaks the other apps on the server) in case of some bugs.

But you can mess with it

I have a program that create dynamically classes on the fly and load them up. (https://gist.github.com/chtefi/018493089f4c75f36662)

$ java -XX:MaxMetaspaceSize=10m
-Djava.home="C:\Program Files\Java\jdk1.8.0_60\jre"
-classpath "C:\wip\out\production\test"
com.company.TestMetaspaceLimit

With 10m of metaspace, it crashes around 300 classes loaded (it’s not a lot in the Java world if you’re using some frameworks).

If you enable the GCDetails logs, you’ll see a lot of cool errors :

[Full GC (Metadata GC Threshold)
  [PSYoungGen: 64K->0K(1396224K)]
  [PSOldGen: 6548K->6548K(2793472K)] 6612K->6548K(4189696K),
  [Metaspace: 9954K->9954K(1058816K)], 0.0174232 secs]
  [Times: user=0.02 sys=0.00, real=0.02 secs]
[GC (Last ditch collection)
  [PSYoungGen: 0K->0K(1396224K)] 6548K->6548K(4189696K), 0.0006371 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC (Last ditch collection)
  [PSYoungGen: 0K->0K(1396224K)]
  [PSOldGen: 6548K->6548K(2793472K)] 6548K->6548K(4189696K),
  [Metaspace: 9954K->9954K(1058816K)], 0.0183340 secs]
  [Times: user=0.01 sys=0.00, real=0.02 secs]

Garbage Collectors

Each GC deals differently with the Young Generation Space (new objects) and the Old Generation Space (variables referenced since a while), because the Young is a very fast-paced space, not the old one.

The Young Generation space should never be too big, 2GB seems like a good limit. Otherwise, algorithms could not be as performant when processing it.

-XX:+UseSerialGC

It’s the basic GC : the Serial GC.

It’s using a single-core and Stop the World when processing.

  • Young Generation : Mark + Copy (using survivor spaces) / in the logs : DefNew
  • Old Generation : Mark + Sweep + Compact / in the logs : Tenured

Example of output :

[GC (Allocation Failure)
  [DefNew: 652352K->81514K(733888K), 0.2248788 secs]
  1630766K->1620630K(2364580K), 0.2255284 secs]
  [Times: user=0.19 sys=0.03, real=0.22 secs]
[GC (Allocation Failure)
  [DefNew: 733839K->81489K(733888K), 0.2495329 secs]
  [Tenured: 2180251K->1993562K(2180276K), 0.3855474 secs]
  2272954K->1993562K(2914164K),
  [Metaspace: 2765K->2765K(1056768K)], 0.6373276 secs]
  [Times: user=0.55 sys=0.09, real=0.64 secs]

-XX:+UseParallelGC -XX:+UseOldParallelGC

The Parallel GC.

It’s an evolution of the Serial one.
It’s doing the same, but faster because it’s using multiple cores to do the job.
And again, it’s Stopping the World when processing.

With Java 8, specifying -XX:+UseParallelGC automatically set -XX:+UseOldParallelGC.

  • Young Generation : Parallel Mark + Copy (using survivor spaces) / in the logs : PSYoungGen
  • Old Generation : Parallel Mark + Sweep + Compact / in the logs : ParOldGen

Example of output :

[GC (Allocation Failure)
  [PSYoungGen: 76221K->10729K(141824K)]
  127345K->126994K(316928K), 0.0173292 secs]
  [Times: user=0.05 sys=0.02, real=0.02 secs]
[Full GC (Ergonomics)
  [PSYoungGen: 10729K->0K(141824K)]
  [ParOldGen: 116265K->126876K(287744K)]
  126994K->126876K(429568K),
  [Metaspace: 2742K->2742K(1056768K)], 0.0224399 secs]
  [Times: user=0.03 sys=0.00, real=0.02 secs]
  • GC (Allocation Failure) : a minor GC (Young generation) was done because space was not available
  • Full GC (Ergonomics) : the JVM decided to do a Full GC (Young + Old generations) because of some thresholds

But you can force-disable it by doing -XX:-UseOldParallelGC : you’ll end up using the PSOldGen old generation collector. It’s not parallel anymore but serial (as the SerialGC). You should probably not used it.

You can control how many threads the parallel phases are using -XX:ParallelGCThreads=N.
It is by default the number of cores the computer has. (must be at least 1)

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC

The Concurrent Mark and Sweep GC.

It’s an evolution of the Parallel GC. This time, it’s not a Stop The World algo everywhere.
It can collect the old generation concurrently while the application is still running, meaning you should have a better latency.

ParNewGC, while collecting the young generation, will send some stats to the ConcMarkSweepGC, that will estimate if it should run a GC (according to the trend of the promotion rates in the young generation). This is why the CMS works with this one and not the classic parallel UseParallelGC.

Moreover, while being mostly concurrent, it has just of few phases where it still must Stop The World but they are very short period, contrary to the previous algorithms.

With Java 8, specifying -XX:+UseConcMarkSweepGC automatically set -XX:+UseParNewGC.

  • Young Generation : Mark + Copy (using survivor spaces) / in the logs : ParNew
  • Old Generation : Mark + Sweep : do NOT Stop the World (mostly) / in the logs : CMS Initial Mark, CMS Final Remark

Example of output (times were removed for the sake of clarity) :

[GC (CMS Initial Mark) [1 CMS-initial-mark: 1446700K(1716408K)] 1456064K(1795064K), 0.0006139 secs]
[CMS-concurrent-mark-start]
[CMS-concurrent-mark: 0.014/0.014 secs]
[CMS-concurrent-preclean-start]
[CMS-concurrent-preclean: 0.003/0.003 secs]
[CMS-concurrent-abortable-preclean-start]
[CMS-concurrent-abortable-preclean: 0.021/0.381 secs]
[GC (CMS Final Remark)
  [YG occupancy: 14224 K (78656 K)]
  [Rescan (parallel) , 0.0013603 secs]
  [1 CMS-remark: 1585968K(1716408K)] 1600193K(1795064K), 0.0032058 secs]
[CMS-concurrent-sweep-start]
[CMS-concurrent-sweep: 0.003/0.003 secs]
[CMS-concurrent-reset-start]
[CMS-concurrent-reset: 0.004/0.004 secs]

The Stop The World events happen during the CMS Initial Mark and CMS Final Remark.

You can have notice that the Old Generation is not compacted at the end, meaning it can still exists holes in the memory that is not used because it’s too small.

If Java can’t find any more memory because of that, it will trigger a Full GC by calling the Parallel GC (well, a GC that does the compaction but Stop The World). Moreover, this can happen also when a CMS compaction is currently in progress (concurrently), then suddenly, a lot of survivors are promoted to the old generation and boom, no more space.
This is why the CMS must be triggered way before the space is filled.

It is the role of the flag +XX:CMSInitiatingOccupancyFraction, by default, it’s around 92% according to Oracle.

Moreover, you can control how many threads to use for the concurrent part using -XX:ConcGCThreads=N. (measure before change)

-XX:+UseG1GC

The latest Java HotSpot VM GC.

It handles the space differently compared to its predecessors (being closer to the ConcMarkSweepGC).

There is no more Young and Old regions only. There are a bunch of regions of different sizes (certains will be automatically resized on the fly by the GC to enhance performance), each of them deal with only one type of generation : an Eden, a Survivor, or a Old. (and some with Humongous objects : they are so big they span on several regions). It targets around 2000 regions, each of them between 1MB and 32MB.

It is oriented for quite big heaps (> 4GB), and for small latency environments : you specify how much pause time (max) you desire for GCs (default: 0.5s).

It is mostly concurrent (does not affect too much the latency of the application) and parallel (for the Stop the World phases), but is a bit more computing intensive (compute stats to enhance behavior, predict what to clean, to reach the desired pause time).

It’s a bit more complicated than the others, you can refer to those two great resources to get more details : Getting Started with the G1 Garbage Collector, and Garbage First Garbage Collector Tuning.

It’s a bit like the CMS GC :

  • you have a STW Mark
  • then a concurrent scan from the marked references
  • then a STW Remark (to take into account the updates since the Mark)
  • then the cleaning and copy of regions

Example of output :

[GC pause (G1 Evacuation Pause) (young) 1478M->1475M(3764M), 0.0540170 secs]
[GC pause (G1 Evacuation Pause) (young) 1767M->1766M(3830M), 0.0581689 secs]
[GC pause (G1 Evacuation Pause) (young) (initial-mark) 2105M->2106M(3830M), 0.0674928 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0004460 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0153593 secs]
[GC remark, 0.0065189 secs]
[GC cleanup 2126M->2114M(3830M), 0.0021820 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0001478 secs]
[GC pause (G1 Evacuation Pause) (young) 2483M->2484M(3830M), 0.0773962 secs]
[GC pause (G1 Evacuation Pause) (mixed) 2620M->2586M(3830M), 0.0467784 secs]
[GC pause (G1 Evacuation Pause) (young) 3029M->3023M(3830M), 0.0782551 secs]
[GC pause (G1 Evacuation Pause) (young) (initial-mark) 3248M->3237M(3830M), 0.0752451 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0003445 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0189316 secs]
[GC remark, 0.0083292 secs]
[GC cleanup 3278M->2968M(3830M), 0.0026447 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0004819 secs]
[GC pause (G1 Evacuation Pause) (young) 3082M->3078M(3830M), 0.0309070 secs]
[GC pause (G1 Evacuation Pause) (mixed) 3245M->3078M(3830M), 0.0408398 secs]
  • G1 Evacuation Pause : copy alive objects (Eden or Survivors) to another region(s) compacting them and promoting them if old enough (to an Old Generation region). It’s a Stop The World process
  • concurrent-* : marks and scan alive objects and do some cleaning while the application is still running
  • (mixed) : both young and old generations copied (“evacuated”) elsewhere at the same time

Profiling

ASCII profiling

If you’re an hardcore player, you can use the Java agent hpref to retrieve a human-readable heap dump with the Java profile of your application (when it ends).
It’s bundled by default in the HotSpot JVM.

$ java -agentlib:hprof=heap=sites com.company.MyApp

That will generate a file java.hprof.txt where you can easily find out what are the most expensive function calls :

SITES BEGIN (ordered by live bytes) Tue Jan 12 22:38:06 2016
          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1 14.87% 14.87%   2103552 30499   2103552 30499 302579 char[]
    2 10.35% 25.21%   1463952 30499   1463952 30499 302580 com.sun.tools.javac.file.ZipFileIndex$Entry
    3  9.27% 34.48%   1311424   11   1311424    11 301304 com.sun.tools.javac.util.SharedNameTable$NameImpl[]

So, it seems I’ve allocated a tons of char[] (2M, 31000 objects).
To know the callstack, find the trace value in the file, you’ll end up with something like this :

TRACE 302579:
        java.lang.StringCoding$StringDecoder.decode(:Unknown line)
        java.lang.StringCoding.decode(:Unknown line)
        java.lang.String.(:Unknown line)
        com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry(ZipFileIndex.java:665)

Et voilà, this is it. (it was not my fault!)

Another option is to collect function calls count and CPU usage using cpu=times :

$ java -agentlib:hprof=cpu=times com.company.MyApp
...
TRACE 312480:
        com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry(ZipFileIndex.java:Unknown line)
        com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.buildIndex(ZipFileIndex.java:Unknown line)
        com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.access$000(ZipFileIndex.java:Unknown line)
        com.sun.tools.javac.file.ZipFileIndex.checkIndex(ZipFileIndex.java:Unknown line)
...
CPU TIME (ms) BEGIN (total = 17046) Tue Jan 12 22:52:08 2016
rank   self  accum   count trace method
   1  3.64%  3.64%   30711 312480 com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry
   2  2.53%  6.17%    7392 312914 java.io.WinNTFileSystem.normalize
   3  2.38%  8.54%    3984 301205 java.lang.String$CaseInsensitiveComparator.compare
   4  2.09% 10.64%  324312 301204 java.lang.String.charAt

In a few seconds :

  • 30711 calls to ZipDirectory.readEntry
  • 324312 calls to String.charAt

That’s quite straight-forward, can be processed by third-party tools, or gather for comparaisons.

If you want the live stats, this is not your tool. An IDE with a true Profiler will be a better solution.
But anyway, that can come in handy !

There are a few more options, check out hprof.

JMX

A nice and easy way to get in-depth of your live program (local or distant), is to enable JMX when starting the application. JMX can be secured, but if you don’t want to be bother with that, start the JVM with those settings :

  • -Dcom.sun.management.jmxremote.port=5432
  • -Dcom.sun.management.jmxremote.authenticate=false
  • -Dcom.sun.management.jmxremote.ssl=false

If will expose its internals through the protocol JMX on port 5432.

You need a program to read from it. Fortunately, there is one installed by default : jvisualvm.
Just start your Java program somewhere, then start jvisualvm.

If on the same computer, it will automatically find it.
Install the plugin VisualGC if you don’t have it, to monitor the GC in details it’s a win.
You can even do the CPU and Memory profiling live.


Alternatives exist (ofc) such as JProfiler and YourKit. Check out .

You can also use jconsole (shipped with java). You don’t even need to start your process with JMX, jconsole can just plug itself.

Java CLI tools

The HotSpot JVM has some useful console tools shipped within too.

If you encounter any odd errors, ensure you have access to folder /tmp/hsperfdata_ being the user that started the Java process.

jps

List the java process running on the machine. (remember doing ps aux | grep java ?)

$ jps
11080 Launcher
11144 Jps
12140 TestMetaspaceLimit
$ jps -lvV
11080 org.jetbrains.jps.cmdline.Launcher -Xmx700m -D...
12140 com.company.TestMetaspaceLimit -Djava.home=C:\Program Files\Java\jdk1.8.0_60\jre -D...
6028 sun.tools.jps.Jps -Dapplication.home=C:\Program Files\Java\jdk1.8.0_60 -Xms8m

Official documentation.

jstat

Monitor some aspects of a running JVM (no JMX needed).

List of aspects :

-class
-compiler
-gc
-gccapacity
-gccause
-gcmetacapacity
-gcnew
-gcnewcapacity
-gcold
-gcoldcapacity
-gcutil
-printcompilation

Monitor the GC, show the timestamp in front, pull every 1s :

$ jstat -gc -t 7844 1s
Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
           14,0 40960,0 53248,0 40947,8  0,0   506880,0 343724,3  175104,0   62801,2   13440,0 12979,7 1664,0 1552,3      8    0,144   0      0,000    0,144
           15,1 40960,0 53248,0 40947,8  0,0   506880,0 454765,2  175104,0   62801,2   13440,0 12979,7 1664,0 1552,3      8    0,144   0      0,000    0,144
           16,1 77824,0 53248,0  0,0   53240,9 506880,0 40423,7   175104,0   104781,8  13952,0 13581,6 1664,0 1596,0      9    0,203   0      0,000    0,203

Official documentation.

jinfo

Get the value of any flag of a running Java process.

$ jinfo -flag MaxHeapSize 5044
-XX:MaxHeapSize=4290772992

Official documentation.

jstack

Get the current stack trace of all the threads of a running Java process.
Useful if you wonder what’s going on with a process.

$ jstack 1204
...
"main" #1 prio=5 os_prio=0 tid=0x0000000002c9e000 nid=0x2d88 runnable [0x000000000347e000]
   java.lang.Thread.State: RUNNABLE
        at java.io.RandomAccessFile.length(Native Method)
        at java.io.RandomAccessFile.skipBytes(Unknown Source)
        at com.sun.tools.javac.file.ZipFileIndex.readBytes(ZipFileIndex.java:381)
        ...
        at com.sun.tools.javac.api.JavacTaskImpl.call(JavacTaskImpl.java:138)
        at com.company.TestMetaspaceLimit.createClass(TestMetaspaceLimit.java:42)
        at com.company.TestMetaspaceLimit.main(TestMetaspaceLimit.java:28)
...

Official documentation.

jmap

You can display the config and usage of the whole heap.
It’s useful but I think it’s more useful to use some UI to monitor that such as jvisualvm or jconsole.

$ jmap -heap 11080
Attaching to process ID 11080, please wait...          
Debugger attached successfully.                        
Server compiler detected.                              
JVM version is 25.60-b23                               
                                                       
using thread-local object allocation.                  
Parallel GC with 4 thread(s)                           
                                                       
Heap Configuration:                                    
   MinHeapFreeRatio         = 0                        
   MaxHeapFreeRatio         = 100                      
   MaxHeapSize              = 734003200 (700.0MB)      
   NewSize                  = 89128960 (85.0MB)        
   MaxNewSize               = 244318208 (233.0MB)      
   OldSize                  = 179306496 (171.0MB)      
   NewRatio                 = 2                        
   SurvivorRatio            = 8                        
   MetaspaceSize            = 21807104 (20.796875MB)   
   CompressedClassSpaceSize = 1073741824 (1024.0MB)    
   MaxMetaspaceSize         = 17592186044415 MB        
   G1HeapRegionSize         = 0 (0.0MB)                
                                                       
Heap Usage:                                            
PS Young Generation                                    
Eden Space:                                            
   capacity = 67108864 (64.0MB)                        
   used     = 8111152 (7.7353973388671875MB)           
   free     = 58997712 (56.26460266113281MB)           
   12.08655834197998% used                             
From Space:                                            
   capacity = 11010048 (10.5MB)                        
   used     = 6575688 (6.271064758300781MB)            
   free     = 4434360 (4.228935241699219MB)            
   59.72442626953125% used                             
To Space:                                              
   capacity = 11010048 (10.5MB)                        
   used     = 0 (0.0MB)                                
   free     = 11010048 (10.5MB)                        
   0.0% used                                           
PS Old Generation                                      
   capacity = 179306496 (171.0MB)                      
   used     = 81936 (0.0781402587890625MB)             
   free     = 179224560 (170.92185974121094MB)         
   0.04569605777138158% used                           
                                                       
6521 interned Strings occupying 524504 bytes.          

Official documentation.

Resources

I hope this overview was clear and wide enough to make you feel stronger about the basis of Java and that you learned some new tricks. I did.

React inline styles VS CSS : benchmark

August 17th, 2015 | javascript, performance, react |

CSS vs Inline

So, I was wondering if there was any performance downside to use React inline styles (I like them!) vs keep the old plain css stylesheets (that we love to override!). Of course there is, but I wanted some numbers!

I think both ways are useful actually, as Michael Chan said at React Europe ():
– the css can be used for the layout (such as the bootstrap layout classes: row-fluid, span*, etc.)
– the inline styles should be used for the “state-styles” (the style that can change according to a state, which is dynamic).

Why stupid

I decided to try something stupid and not truly relevant but anyway :

Generate a big table, one with inline styles, the other (in another page), with the same data, but with css, and then check the performance/size/time of both.

That’s why it’s not relevant because you would not generate a big table in your DOM. You will more likely have some kind of virtual scrolling / lazy rendering.
But, what about a test, just to see if that’s more slow when I scroll or not, and what about the rendering time ?

 
Chrome 45b, React 0.13.3 (debug and production)

Test case

So I generated a

of 1,000 rows with 9 columns, all of them with the same content “”, then I did the same with 10,000 rows to check the trend.

I’m going to test :
– the smoothness when I scroll down the page through Chrome DevTools Timeline
size of the DOM : useful to know for isomorphic universal javascript (server-side rendering)
– time to mount the DOM : I’m going to use componentDidMount, this is where the DOM is available after render is called
– time to be rendered by the browser : I’m going to use requestAnimationFrame to know when the next frame after render is ready

Basically, the skeleton of the React component is :

const createOneInlineStyleRow = (i) => {
  return (
... ); }; export default class AppInline extends React.Component { componentDidMount() { console.timeEnd('didMount'); } render() { console.time('render'); console.time('didMount'); requestAnimationFrame(function() { console.timeEnd('render'); }); const rows = times(NB_ROWS, createOneInlineStyleRow); return (
{rows}
); } }

The style of the :

const style = {
  fontFamily: 'Consolas',
  padding: 10,
  color: "#444",
  border: "3px solid orange",
  position: "relative",
  width: "15%",
  height: "25px",
  letterSpacing: 0,
  overflow: "hidden",
  fontSize: 10,
  fontVariant: "small-caps"
};

And the same in css for the css table :

.tableCss td {
   font-family: Consolas;
   padding: 10px;
   color: #444;
   border: none;
   border: 3px solid orange;
   position: relative;
   width: 15%;
   height: 25px;
   letter-spacing: 0;
   overflow: hidden;
   font-size: 10px;
   font-variant: small-caps;
}

Basically, here is this beautiful design :

Results

Timeline

I found no difference when scrolling, Chrome DevTools timeline supports my perception : no difference, no fps drop.
The fact that the DOM is inline or in a proper css seems to not matter, I guess it’s already in the browser memory when it’s rendered so it does not influence the smoothness.

Size of the DOM

Without surprise, the DOM size is huge with the inline styles and is linearly proportional to the number of rows.

Type 1,000 rows 10,000 rows Ratio
CSS 517,725 chars ~5,177,250 chars x10
Inline 2,283,742 chars ~22,837,420 chars x10
x4.4 x4.4

That’s clearly something to take into account if you do some universal javascript. 2,000,000 chars is 2MB of data. (but who sends a big table anyway?!)

Time to mount the DOM

This is where componentDidMount is useful. We start a timer in the render method and end it when componentDidMount is called.
React has done its job converting the virtual DOM into a proper DOM and has injected it into our mounted node.

I let the times with the development version of React, because I started with and used the production version afterwards. They should not be taken into account, it’s not relevant in our case.

Type 1,000 rows 10,000 rows Ratio
CSS (debug) 650ms-750ms 6,000ms ~x10
Inline (debug) 1,000ms-1100ms 10,000ms ~x10
CSS 310ms 2,300ms ~x7.5
Inline 600ms 4,900ms ~x8
x2 x2.1

Yes, it takes more time to stringify our virtual DOM / mount it into the DOM.
The delta decreases for less rows but it’s still significant: 70ms (css) against 105ms (inline) for 100 rows (x1.5).

Rendering time

The DOM is mounted, the browser needs to render the table now. We can assume it will take longer to render the inline styles, because it has more to parse and store every style attribute of every td.

Type 1,000 rows 10,000 rows Ratio
CSS (debug) 1,100ms-1,200ms 9,500ms ~x10
Inline (debug) 1,500-1,600ms 14,000ms ~x10
CSS 720ms 6,000ms ~x8
Inline 1,080ms 9,400ms ~x8.7
x1.5 x1.56

Still a small overhead. I guess we could see some nice differences according to which browser is tested out!

What about another browser ? Edge !

I gave a try on Edge with 10,000 rows (React production version) :

Type mount time rendering time
CSS 4,115ms 11,591ms
Inline 11,418ms 21,113ms

It’s basically the double of Chrome times. :cry:

We knew that

So yes, we were kinda aware of all these behaviors, but now, we have some numbers !

This global analysis demonstrated that the inline styles take way more size and more time to be handled by the browser, but it has no more impact as soon as it’s rendered.

Compared to all the advantages the inline styles offer, we just have to be aware of those potential performance bottlenecks when thinking about the .

Further

Some plugins exists to extract the inline styles to css such as :
https://github.com/petehunt/jsxstyle : code inline then extract the constants to a stylesheet
https://github.com/js-next/react-style + https://github.com/js-next/react-style-webpack-plugin :

Or if you want the full power of css in javascript (media queries, pseudo-selectors, modifiers, and more!) :
Radium

#perfmatters

March 1st, 2015 | javascript, performance |

Goal : be jank-free at 60fps

Jank is when missed frame appears when navigating a website, you don’t want that. http://jankfree.org/

You can find a lot of talks and articles from Paul Lewis, Tom Wiltzius and Nat Duca, who are great speakers btw, that explain clearly how to avoid jank, how to diagnose it, and how to fix it. Check the resources at the end of this post to have some pointers.

I decided to resume quickly everything I learned from a lot of different sources in order to not do those mistakes again and help you and me to understand what can cause this jank. I don’t explain in details everything, every term I use, check out the resources of the post to have a better understanding if needed.

I’ll start with general javascript concepts and then I’ll continue with deeper concepts such as the layout/painting/layers processes the browser do and how to optimize them, then I’ll finish with some network advices.

Some of the items are maybe not relevant anymore nowadays for certain browsers because it was optimized since. I tried to filter them but still.

Feel free to correct me if I’m wrong.

Object instances

If you have a bunch of object to create, forget about module pattern, prototypes are faster to create, or create plain simple object. http://jsperf.com/prototypal-performance/55

Holey array

Don’t do holey array (it can be converted to a dictionary internally thus being way slower). Try to not use the delete keyword. It can change on the fly the internal class being used for the object you just deleted from (for instance, an array with a delete in the middle can transform it into a map). http://jsperf.com/packed-vs-holey-arrays

Objects pooling

If you are creating and destructing a lot of objects/references, consider using an object pool to avoid triggering the Garbage Collector (which would make you lose some precious ms at random times). Check http://www.html5rocks.com/en/tutorials/speed/static-mem-pools/ for more details.

Image resizing

Resize your image exactly as the size you are using in your page (we are not talking about multi-support here) to avoid the browser to resize the image itself (cpu consuming). Consider using srcset on or use if you want to handle high DPI devices.

Background-size

Beware of background-size, keep the image at the original size instead of asking the browser to resize on the fly. Create multiple version of the same image if needed. You can also create a spritesheet image of your images to do only one HTTP request (and maybe have a faster memory access but don’t take this into consideration, that’s way too insignificant to be considered).

CSS gradients

Beware of repeating gradients. That slows down a lot the browser rendering. For instance, if you put it on the background of your page in which you can scroll. Prefer to use a tiled background image.

Custom fonts instead of image

Don’t forget that you can use nice custom fonts instead of images to draw symbolish things. Such as FontAwesome. You will save bandwidth and designer time. http://fortawesome.github.io/Font-Awesome/icons/

Or you can use unicode characters (it has a lot of icons, not only characters) or svg (light and scalable).

Image lazy loading

Consider using an image lazy loading plugin to load images only if the user can see them of if everything else has been loaded first. If by default the image is not displayed in the first view at loading time, you can defer its request when the user is scrolling into for instance.

If you have a carousel of images, consider to not load every images of the carousel at the beginning. Either wait the user to hover it or load them after everything else is loaded. If the user does not even use the carousel, it’s useless to load the other images.

Event listeners

Try to not bind event listeners on the whole document just to handle the event on a deep child. Try to bind listeners on the element you need or a nearest parent. Otherwise the compositor (a thread dedicated to handle inputs and scrolling) will be triggered for nothing each time the event will happen on the page, that could have no link with your element.

Memory management / variable scoping

Try to not use global variables. Release what you can when you can. With ES6, use only let, no more var. Remove the event listeners you don’t need. Be especially careful with closures, they can contain a lot of references that will never be released if you are careless.

Events throttling

Throttle the events that can be called numerous times : onscroll, onmousemove, onresize, ondragover. They will be called only a few times and that will be enough.

requestAnimationFrame() is your best friend

Do not use setTimeout() or setInterval() to throttle event handlers such as onscroll or onresize. Use requestAnimationFrame() to only change the visual state when a new frame is going to be rendered, to not compute things that won’t be render in-between 2 frames. Check http://www.html5rocks.com/en/tutorials/speed/animations/ for more details.

Event handlers logic

Don’t do any complex logic in the event handlers. Just save somewhere the values which you care of the current state (scrollTop, mouse position…), and let the browser pursues its events pipeline. If you are doing something heavy in a handlers, you will delay every other events to be handle and will delay the rendering (then you call a requestAnimationFrame(fn) to use those saved values). Try to bind global events only once, such as onscroll or mousemove. Just save their positions in those handlers and when you need them, just call functions in requestAnimationFrame() using those saved values.

:hover and scrolling

Be careful with :hover effect on elements in the middle of a scrollable page. If the user scrolls using its mousewheel on top of those elements, they will trigger the :hover at the same time the scrolling is occurring. That can slow down dramatically the fps and you end with jank (missed frames during scrolling becausee of a lot of restyle/paint/composite layers operations). You can for instance add a class on the container when it’s scrolling to not trigger the :hover effect while it’s present.

Dynamic class adds

Don’t add classes on top of your document if that only change an item deep in the DOM (that could cause a full repainting from the parent). Try to target specifically the element or a near parent.

Avoid Layout Trashing

When you need to read some positions or sizes from the DOM (.offsetTop, .offsetWidth …), don’t alternate them with writes of positions or sizes on the DOM (element.style.width = width). That would cause what is called “Layout Trashing” because each time you change a size or position on the DOM, that triggers a forced synchronous layout (painting). Do all your readings at the beginning, store them into variables, then do your writings. You can call a requestAnimationFrame() to read those values, then inside, call another requestAnimationFrame() to write your transformations.
To know if have to kind of cycles, check your timeline in the debugger and look for cycles of :
– recalculate style
– layout
– recalculate style
– layout
You can know what property triggers what on http://csstriggers.com/ (only Chrome for now)

transformZ(0) aka the null transform hack

There is a trick called the “null transform hack” to force the browser to create a distinct layer containing a DOM element that will be painted using the hardware acceleration (and independently of the other layers) using transform: translateZ(0) or transform: translate3d(0,0,0). Those statements changes nothing visually but on a desktop computer, that can really boost your fps. If you combine that on elements that have a shadow (like, a lot on your page), you clearly see the difference when playing with. But be careful, mobiles do not have the same capability than desktops, and the performances could be worse.

Will-change

Instead of this hack, consider implementing the fairly new css property will-change: [transform|scroll-position|contents| to hint the browser to create a distinct composite layer for this element. You just notify it that the particular property will probably change for this element. The browser will try to optimize what it can (ie: create a new layer for it). But don’t abuse it to not create another problem (it’s useless to do * { will-change: * } the browser is already trying hard).

If you know you are going to change the text of your item, you can set will-change: contents. Check https://dev.opera.com/articles/css-will-change-property/ for more details.

Remove this property when you are done with the change. Do not apply it on a global rule, just apply it when just before it’s needed. For instance, listen to the :hover state of its container to set it on the child using css : no hover on parent = no will-change on child. Or set it using javascript and reset it to auto when you’re done.

Composite layer

What creates a new composite layer is the when you use translationZ/3d, and  (they are automatically embedded in a new layer), animations with opacity and transform, and some css filters that use hardware acceleration. It has nothing to do with z-index layers. Check http://www.html5rocks.com/en/tutorials/speed/layers/ for more details.

Animation/Transitions

Do not animate with javascript. Use css animations/transitions. The animations will be processed by another thread than the main one. Even if your js does complex calculations and slows down the main thread, the css animation will still be smooth.

Transform to animate position

When you have element which position is dynamic, do not use css top/left, use transform: translate. top/left should be reserved for initial positioning only, just to organize the layout. css transform is especially done for transformations, it does not reflow/redraw whereas top/left does. In DevTools, enable Show Paint Rectangle (Timeline > Rendering) to see what is repainting on your page.

Moreover, the translate transform animation is using subpixel rendering instead of pixel per pixel for top/left animation. The rendering is smoother.

FLIP

It’s a technique used to animate an element the cheapest way possible when the positions/dimensions are unknown/dynamic. Paul Lewis explains it:

Basically, it stands for: First, Last, Invert, Play. From the initial state, you set directly its last state, read some values from it (position, size), apply an inverted transform style to put it back where it was, and clear the transform style to trigger the animation.

Fixed elements

If you have a top bar in fixed position, consider to put it in its own layer. Otherwise, scrolling the content underneath cause the element to be repainted each time. You can check with the Show Paint Rectangle toggle.

Chrome DevTools <3

In Chrome DevTools, you can press h on an element to toggle its visibility. That avoids you to add a display: none manually on its style. Life saver. Of course, their performance tools is uberly useful and you can use about:tracing if you are desperate (and you will probably be more desperate with that ;-).

To check if you have some weird or expansive painting behavior, enable the continuous painting mode and toggle elements visibility to see which one(s) is(are) causing the problem : Timeline > Rendering (press to toggle the panel visibility if not there) > Show paint rectangles + Enable continuous page repainting. You can enable the fps meter too. Now, scroll into your page, do some actions, hide some elements to see if you can find the culprit (if you notice a boost).

Multiple adds to the DOM / DocumentFragment

When you want to append several items to the DOM, use document.createFragment(). It’s available in all browsers since a while (jQuery internally use it when you call append() with an array for instance). You can use traditional DOM operations on it (appendChild, style etc.) without the browser to render it while you create it. Then when you are done, you just append it to the real DOM in one shot.

Avoid to read value from CSS

Try to not use getComputedStyle() (or .css() getter from jquery, it’s using that internally) to read value from CSS directly, that’s calling the “Recalculate Style” process. Save the values you need before somewhere if you can and use them, or if it’s for animate something, use css transition/animation.

getBoundingClientRect()

This magic function returns in one call the position and the dimension of the item it’s called on : height, width, top, left, bottom, right. Use it once when you need to read those differents values.

User reaction

When a user clicks on a button, you have about 100ms before doing any animation. The user won’t notice anything before, it’s its reaction time. Therefore, you can prepare anything you need (get sizes, positions…) and have an ugly 100ms frame if you need, he won’t notice. ;-)

If you want to deal with mobiles, add in your . That could enable the GPU rasterization on the phone (Androids). But basically, add that even if it’s not an Android, that won’t do harm on the contrary.

DOM element count

Try to not have too many DOM nodes on your page. For instance, 1000 for a mobile app is a good number. (desktop version : 4000 twitter.com, 2500 for amazon or facebook, mobile version: 1000 for m.twitter.com, 1700 for m.facebook)

CSS stylesheets

CSS are blocking resources for the browser. It waits for them to be loaded and parsed before rendering anything to avoid a flickering effect. So, keep it light, avoid to load unnecessary css rules. You can split your stylesheet by media for instance. The browser won’t wait for them if their rule does not apply :



Put your CSS in the for the browser to request them as soon as possible.

Avoid to use @import in stylesheets to not block again the browser (because it will request again a stylesheet).

Async JS scripts

Inline blocks the DOM construction when the browser encounters them, which will therefore delay the initial render. Generally, you put your