Flame graphs are a nifty debugging tool to determine where CPU time is being spent. Using the Java Flight recorder, you can do this for Java processes without adding significant runtime overhead.
Shivaram Venkataraman and I have found these flame recordings to be useful for diagnosing coarse-grained performance problems. We started using them at the suggestion of Josh Rosen, who quickly made one for the Spark scheduler when we were talking to him about why the scheduler caps out at a throughput of a few thousand tasks per second. Josh generated a graph similar to the one below, which illustrates that a significant amount of time is spent in serialization (if you click in the top right hand corner and search for "serialize", you can see that 78.6% of the sampled CPU time was spent in serialization). We used this insight to speed up the scheduler by switching to use Kryo instead of the Java serializer.
The flight recorder is only available in the Hotspot JVM, so if you're using OpenJDK, you'll need to replace it with the Oracle JDK (version 7u40 or higher). These instructions describe how to replace Open JDK on Red Hat.
To see which JDK you're using, run:
java -version
If you have OpenJDK, first remove it:
yum -y remove java*
Now if you try java -version
again, it should say "No such file or directory".
Next, download the Oracle JDK and use the Red Hat package manager to install it (the extra wget flags are necessary to avoid an error saying you need to accept Oracle's license):
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u60-b19/jdk-7u60-linux-x64.rpm
rpm -ivh jdk-7u60-linux-x64.rpm
You may also need to fix JAVA_HOME
, which Spark uses, to point to the correct location. If you're using EC2 instances launched by the spark-ec2 script, the easiest way to do this is to create a symlink from the expected JAVA_HOME
to the new installation:
ln -s /usr/java/jdk1.7.0_60/ /usr/lib/jvm/java-1.7.0
You may also need to add $JAVA_HOME/bin
to your path so that Java commands like jps
work correctly (export PATH=$PATH:$JAVA_HOME/bin
).
If you're running on a cluster of machines, you'll need to update Java on all of the machines. The easiest way to do this, if you're using the spark-ec2 scripts, is to put the 4 commands above into a file (e.g., upgrade_java.sh
), copy that file to all of the machines on the cluster:
/root/spark-ec2/copy-dir /root/upgrade_java.sh
and then use the helpful slaves.sh
script to run the script on all of the workers:
/root/spark/sbin/slaves.sh upgrade_java.sh
Next, you'll need to re-compile Spark, copy the updated Spark to all of the machines:
/root/spark-ec2/copy-dir --delete /root/spark
and re-start Spark:
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh
To enable the flight recorder, you need to add -XX:+UnlockCommercialFeatures -XX:+FlightRecorder
when you start the JVM. For Spark, you can do this by adding the following to spark-defaults.conf
:
# Enable flight recording on the driver.
spark.driver.extraJavaOptions -XX:+UnlockCommercialFeatures -XX:+FlightRecorder
# Enable flight recording on the executors.
spark.executor.extraJavaOptions -XX:+UnlockCommercialFeatures -XX:+FlightRecorder
You can do this after you've started the master, but you need to do it before starting your applicaton.
To collect a flight recording from your running Spark job, you'll first need to figure out the process ID. If you want to collect a flight recording from an executor, for example, run jps
on the worker machine:
$ jps
18977 Worker
19163 CoarseGrainedExecutorBackend
5465 DataNode
19224 Jps
You want the ID of CoarseGrainedExecutorBackend
(this will have a different name if you're using YARN or Mesos -- look for the process name that includes ExecutorBackend
). Then use jcmd
to create a flight recording (the example command below creates a 10 second recording). Note that this runs asynchronously, so the command will return immediately, but the file won't be available for 10 seconds.
$ jcmd 19163 JFR.start filename=spark.jfr duration=10s
19163:
Started recording 1. The result will be written to:
/root/spark/work/app-20160628234739-0001/1/spark.jfr
When you run jcmd
, if you get an error that says:
java.lang.IllegalArgumentException: Unknown diagnostic command
it probably means that the extra Java options didn't get passed to the JVM correctly. To see all of the valid commands for a particular process, you can use jcmd
with the help
command:
$ jcmd 19163 help
The following commands are available:
VM.native_memory
VM.commercial_features
ManagementAgent.stop
ManagementAgent.start_local
ManagementAgent.start
Thread.print
GC.class_histogram
GC.heap_dump
GC.run_finalization
GC.run
VM.uptime
VM.flags
VM.system_properties
VM.command_line
VM.version
help
In the output above, JFR.start
isn't listed as a valid command. You'll notice, from the output above, that VM.flags
is a valid command; you can use this to check what what options got passed to the JVM:
$ jcmd 19163 VM.flags
Double check that the output of the above command includes the Java options that you added above. In Spark, you can also check that the Java options got added correctly by clicking the "Environment" tab in the UI and checking the value of the spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions
variables.
First, if you created the recording on a remote machine, copy the file back to your laptop (this will make it easier to view the flame graph). Next, follow the instructions at this GitHub repo to generate a FlameGraph.
In order to run the install-mc-jars.sh
script to install the flame graph tools, you'll need to have the JAVA_HOME
environment variable set correctly. On mac, you can figure out where Java is installed with:
/usr/libexec/java_home -v 1.7
Set JAVA_HOME
to whatever was returned by that command before running the install script. The GitHub repository linked above describes all necessary remaining steps to create a flame graph.
While flame graphs can be useful for spotting big performance issues, we've found them to be less useful for fine-grained performance issues. After fixing some of the major issues with the scheduler, for example, we found the flame graph to be minimally useful, because it doesn't convey anything about the critical path, and many of the later scheduler performance issues we found were subtle issues along the critical path.
Also, flame graphs only display time that the JVM spent using the CPU. They will not show time spent blocked waiting on I/O, nor will they display time spent in native code. This means that, for example, they're not useful for diagnosing when network bandwidth to ship tasks to workers becomes a bottleneck.
Essentially all of the information on this page came from Josh Rosen and Eric Liang, who started using flame graphs for Spark and then showed them to me.
The Flame Graph tool was written by Brendan Gregg, and is described (including a video about how to interpret them) on his blog. He has a blog post on how to generate flame graphs that include CPU time spent in the kernel and in system libraries (in addition to the CPU time spent in Java) here. He also wrote an ACMQueue article about flame graphs.
Isuru Perera wrote the tool to generate a Flame Graph from Java Flight Recordings, which is described in his blog.
Marcus Hirt's blog has some useful writeups about using the Java flight recorder and Java mission control -- e.g., http://hirt.se/blog/?p=364. There's also a great writeup about Java Mission Control on Takipi's blog.
This page describes in more detail how to remove OpenJDK and install Oracle JDK.
Thanks for this! Very helpful. Out of curiosity, what has forced you to compile your own spark instead of taking one of the off the shelf tarballs?