kmhofmann/building_tensorflow.md

## building_tensorflow.md

      
    Raw
  

              building_tensorflow.md
            
          
    Building TensorFlow from source (TF 2.1.0, Ubuntu 19.10)

Why build from source?

The official instructions on installing TensorFlow are here: https://www.tensorflow.org/install.
If you want to install TensorFlow just using pip, you are running a supported Ubuntu LTS distribution, and you're happy to install the respective tested CUDA versions (which often are outdated), by all means go ahead. A good alternative may be to run a Docker image.
I am usually unhappy with installing what in effect are pre-built binaries. These binaries are often not compatible with the Ubuntu version I am running, the CUDA version that I have installed, and so on. Furthermore, they may be slower than binaries optimized for the target architecture, since certain instructions are not being used (e.g. AVX2, FMA).
So installing TensorFlow from source becomes a necessity. The official instructions on building TensorFlow from source are here: https://www.tensorflow.org/install/install_sources.
What they don't mention there is that on supposedly "unsupported" configurations (i.e. up-to-date Linux systems), this can be a task from hell. In fact, building TensorFlow either way is a veritable clusterfuck. I don't know if that is due to the inherent complexity of such a framework or just lazy engineering, but the TensorFlow developers are certainly not trying to make one's life easy. My conservative guess is that quite a few developer years have been wasted out there because of the seemingly bonkers choices that have been made during TensorFlow development.
Or should I say: Building TensorFlow is as intuitive as using its API? ;-)
Described configuration

I am describing the steps necessary to build TensorFlow in (currently) the following configuration:

Ubuntu 19.10
NVIDIA driver 440.44
CUDA 10.2 / cuDNN v7.6.5
TensorFlow v2.1.0

At the time of writing (2020-01-11), these were the latest available versions.
Note that I am not interested in running an outdated Ubuntu version (this includes the truly ancient 18.04 LTS), installing a CUDA/cuDNN version that is not the latest, or using a TensorFlow version that is not the latest. Regressing to either of these is nonsensical to me. Therefore, the below instructions may or may not be useful to you. Please also note that the instructions are likely outdated, since I only update them occasionally. Don't just copy these instructions, but check what the respective latest versions are and use these instead!
Prerequisites

Installing the NVIDIA driver

Download and install the latest NVIDIA graphics driver from here: https://www.nvidia.com/en-us/drivers/unix/. Note that every CUDA version requires a minimum version of the driver; check this beforehand.
Ubuntu 19.10 offers installation of the NVIDIA driver version 435.00 through its built-in 'Additional Drivers' mechanism, but CUDA 10.2 requires a newer version that cannot be obtained this way.
The CUDA runfile also includes a version of the NVIDIA graphics driver, but I like to separate installing either, as installing them in combination can be more brittle on "unsupported" distributions for CUDA.
Installing CUDA

Download the latest CUDA version here. For example, I downloaded:
$ wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run

Here's the first roadblock: Ubuntu 19.10 ships with GCC 9.2.1 by default, but CUDA 10.2 pretends to only support Ubuntu 18.04 and GCC versions up to version 8. When trying to install CUDA on an up-to-date system, it will fail.
Uhm... this is insane. I understand when code needs to be built with a certain minimum version of a compiler, but no well written piece of software ever should specify a maximum version.
You would now think that you can simply install GCC 8 (something along the lines of sudo apt install gcc-8 and running CC=$(which gcc-8) CXX=$(which g++-8) ./cuda_10.2.89_440.33.01_linux.run as root)  and be happy, but alas, no. The CUDA installer conveniently disregards any such set environment variables.
Time for more desperate measures. Go ahead and install CUDA like this:
$ sudo sh cuda_10.2.89_440.33.01_linux.run --override

The --override flag overrides the compiler check, and you can now go on. Deselect the driver if it was installed earlier, but install the rest. Try to build the samples. You will notice that this fails, again with a message such as
unsupported GNU version! gcc versions later than 8 are not supported!

Thanks for nothing, NVIDIA. Thankfully we can disable this error by commenting out the #error pragma in /usr/local/cuda/include/crt/host_config.h. Do so. This is what it looks like for me:
#if defined(__GNUC__)

#if __GNUC__ > 8

//#error -- unsupported GNU version! gcc versions later than 8 are not supported!

#endif /* __GNUC__ > 8 */

I have no idea what the implications are, but so far I haven't found any. There's a similar section on Clang just below, in case you decide to compile TensorFlow with Clang. (I have not tried yet, but it should be a good adventure.)
Installing cuDNN

Just go here and follow the instructions. You'll have to log in, so downloading of the right cuDNN binary packages cannot be easily automated. Meh.
System packages

According to the official instructions, TensorFlow requires Python and pip:
$ sudo apt install python3-dev python3-pip

Installing Bazel

Bazel is Google's monster of a build system and is required to build TensorFlow.
Google apparently did not want to make developers' lives easy and use a de-facto standard build system such as CMake. Life could be so nice. No, Google is big and dangerous enough to force their own creation upon everyone and thus make everyone else's life miserable.
I wouldn't complain if Bazel was nice and easy to use. But I don't think there was a single time when I built TensorFlow and did not have issues with Bazel.
And oh my, there are some issues right here:
There are instructions on how to install Bazel using Ubuntu's APT repository mechanism. Forget those, they won't work for our purposes. Neither will compiling the latest Bazel version (2.0.0 at the time of writing) from source.
This is because TensorFlow actually requires a pretty old version of Bazel (0.29.1, as opposed to 2.0.0 or greater) to be built with. I don't know if this says more about the state of Bazel or TensorFlow, but either way, it's not confidence inducing.
Okay, so let's just try to build the latest supported version of Bazel, 0.29.1, from source. We simply install the prerequisites mentioned in the instructions above, download the respective distribution build, compile it with env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh, and... it doesn't compile. :-( There's a beautiful error message saying error: ambiguating new declaration of 'long int gettid()'.
Long story short: some dependency of Bazel (gRPC) used some function names it shouldn't have been using, and fails in combination with glibc 2.30. This was fixed upstream several months ago, but Bazel developers didn't bother to fix it in a maintenance release (0.29.2 anyone?). They only updated the dependency in Bazel 2.0.0, which does not work with TensorFlow. D'oh.
Anyway. The easiest way to use Bazel for compiling TensorFlow 2.1.0 on Ubuntu 19.10 that I know of is to download a pre-built binary, e.g. using
$ wget https://github.com/bazelbuild/bazel/releases/download/0.29.1/bazel-0.29.1-linux-x86_64
$ mv bazel-0.29.1-linux-x86_64 bazel   # and make sure this is on the PATH

This is utterly sad.
Building TensorFlow

Guess what: not fun either. Actually, the same issue of the gRPC dependency that plagued us with Bazel is coming back here. And this time, we have no choice but to actually fix it.
Cloning and patching

First clone the sources, and check out the desired branch. At the time of writing, v2.1.0 was the latest version; adjust if necessary.
  $ git clone https://github.com/tensorflow/tensorflow
  $ cd tensorflow
  $ git checkout v2.1.0

If we now just went ahead and tried to build TensorFlow, we would soon hit the same beautiful error message again as we hit when trying to compile Bazel 0.29.1.
To fix this, I have recreated the proposed fix on the sources that get downloaded by Bazel. See the resulting patch file in the Appendix below. Create a file named grpc_gettid_fix.patch and add it to the ./third_party directory of the TensorFlow repository.
We now need to add the information that the patch needs to be applied to the Bazel workspace file. See the Appendix for the diff, which also fixed another issue that would hit us during the build step. Apply this diff manually - it's only two lines in two files. (I'm not providing a complete, unified patch file here, because it's likely only valid and applicable for a short amount of time.)
Configuration

Create a Python 3 virtual environment, if you have not done this yet. For example:
  $ python3 -m venv ~/.virtualenvs/tf_dev

Activate it with source ~/.virtualenvs/tf_dev/bin/activate. This can later be deactivated with deactivate.
Install the Python packages mentioned in the official instructions:
$ pip install -U pip six numpy wheel setuptools mock 'future>=0.17.1'
$ pip install -U keras_applications --no-deps
$ pip install -U keras_preprocessing --no-deps

(If you choose to not use a virtual environment, you'll need to add --user to each of the above commands.)
Now run the TensorFlow configuration script
  $ ./configure

We all like interactive scripts called ./configure, don't we? (Whoever devised this atrocity has never used GNU tools before.)
Carefully go through the options. You can leave most defaults, but do specify the required CUDA compute capabilities (as below, or similar):
  CUDA support -> Y
  CUDA compute capability -> 5.2,6.1,7.0

Some of the compute capabilities of popular GPU cards might be good to know:

Maxwell TITAN X: 5.2
Pascal TITAN X (2016): 6.1
GeForce GTX 1080 Ti: 6.1
Tesla V100: 7.0

(See here for the full list.)
Building

Now we can start the TensorFlow build process.
$ bazel build --config=opt -c opt //tensorflow/tools/pip_package:build_pip_package

Totally intuitive, right? :-D
This command will build TensorFlow using optimized settings for the current machine architecture.


Add -c dbg --strip=never in case you do not want debug symbols to be stripped (e.g. for debugging purposes).
Usually, you won't need to add this option.


Add --compilation_mode=dbg to build in debug instead of release mode, i.e. without optimizations.
You shouldn't do this unless you really want to.


This will take some time. Have a coffee, or two, or three. Cook some dinner. Watch a movie.
Building & installing the Python package

Once the above build step has completed without error, the remainder is now easy. Build the Python package, which the build_pip_package script puts into a predefined location (outside of the build tree, yay! </s>).
  $ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

And install the build wheel package:
  $ pip install /tmp/tensorflow_pkg/tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl

Testing the installation

Google suggests to test the TensorFlow installation with the following command:
$ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

This does not make explicit use of CUDA yet, but will emit a whole bunch of initialization messages that can give an indication whether all libraries could be loaded. And it should print that requested sum.
It worked? Great! Be happy and hope you won't have to build TensorFlow again any time soon...

Appendix

Content of the grpc_gettid_fix.patch file, to be added to the third_party directory

diff -rc grpc/src/core/lib/gpr/log_linux.cc grpc-patched/src/core/lib/gpr/log_linux.cc
*** grpc/src/core/lib/gpr/log_linux.cc  2019-04-03 21:06:27.000000000 +0200
--- grpc-patched/src/core/lib/gpr/log_linux.cc  2019-12-22 17:17:02.000000000 +0100
***************
*** 40,46 ****
  #include <time.h>
  #include <unistd.h>

! static long gettid(void) { return syscall(__NR_gettid); }

  void gpr_log(const char* file, int line, gpr_log_severity severity,
               const char* format, ...) {
--- 40,46 ----
  #include <time.h>
  #include <unistd.h>

! static long sys_gettid(void) { return syscall(__NR_gettid); }

  void gpr_log(const char* file, int line, gpr_log_severity severity,
               const char* format, ...) {
***************
*** 70,76 ****
    gpr_timespec now = gpr_now(GPR_CLOCK_REALTIME);
    struct tm tm;
    static __thread long tid = 0;
!   if (tid == 0) tid = gettid();

    timer = static_cast<time_t>(now.tv_sec);
    final_slash = strrchr(args->file, '/');
--- 70,76 ----
    gpr_timespec now = gpr_now(GPR_CLOCK_REALTIME);
    struct tm tm;
    static __thread long tid = 0;
!   if (tid == 0) tid = sys_gettid();

    timer = static_cast<time_t>(now.tv_sec);
    final_slash = strrchr(args->file, '/');
diff -rc grpc/src/core/lib/gpr/log_posix.cc grpc-patched/src/core/lib/gpr/log_posix.cc
*** grpc/src/core/lib/gpr/log_posix.cc  2019-04-03 21:06:27.000000000 +0200
--- grpc-patched/src/core/lib/gpr/log_posix.cc  2019-12-22 17:17:30.000000000 +0100
***************
*** 31,37 ****
  #include <string.h>
  #include <time.h>

! static intptr_t gettid(void) { return (intptr_t)pthread_self(); }

  void gpr_log(const char* file, int line, gpr_log_severity severity,
               const char* format, ...) {
--- 31,37 ----
  #include <string.h>
  #include <time.h>

! static intptr_t sys_gettid(void) { return (intptr_t)pthread_self(); }

  void gpr_log(const char* file, int line, gpr_log_severity severity,
               const char* format, ...) {
***************
*** 86,92 ****
    char* prefix;
    gpr_asprintf(&prefix, "%s%s.%09d %7" PRIdPTR " %s:%d]",
                 gpr_log_severity_string(args->severity), time_buffer,
!                (int)(now.tv_nsec), gettid(), display_file, args->line);

    fprintf(stderr, "%-70s %s\n", prefix, args->message);
    gpr_free(prefix);
--- 86,92 ----
    char* prefix;
    gpr_asprintf(&prefix, "%s%s.%09d %7" PRIdPTR " %s:%d]",
                 gpr_log_severity_string(args->severity), time_buffer,
!                (int)(now.tv_nsec), sys_gettid(), display_file, args->line);

    fprintf(stderr, "%-70s %s\n", prefix, args->message);
    gpr_free(prefix);
diff -rc grpc/src/core/lib/iomgr/ev_epollex_linux.cc grpc-patched/src/core/lib/iomgr/ev_epollex_linux.cc
*** grpc/src/core/lib/iomgr/ev_epollex_linux.cc 2019-04-03 21:06:27.000000000 +0200
--- grpc-patched/src/core/lib/iomgr/ev_epollex_linux.cc 2019-12-22 17:18:12.000000000 +0100
***************
*** 1103,1109 ****
  }

  #ifndef NDEBUG
! static long gettid(void) { return syscall(__NR_gettid); }
  #endif

  /* pollset->mu lock must be held by the caller before calling this.
--- 1103,1109 ----
  }

  #ifndef NDEBUG
! static long sys_gettid(void) { return syscall(__NR_gettid); }
  #endif

  /* pollset->mu lock must be held by the caller before calling this.
***************
*** 1123,1129 ****
  #define WORKER_PTR (&worker)
  #endif
  #ifndef NDEBUG
!   WORKER_PTR->originator = gettid();
  #endif
    if (grpc_polling_trace.enabled()) {
      gpr_log(GPR_INFO,
--- 1123,1129 ----
  #define WORKER_PTR (&worker)
  #endif
  #ifndef NDEBUG
!   WORKER_PTR->originator = sys_gettid();
  #endif
    if (grpc_polling_trace.enabled()) {
      gpr_log(GPR_INFO,

git diff of the TensorFlow repository, identifying modified files

$ git diff
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 77e605fe76..d2dcef48d7 100755
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -514,6 +514,7 @@ def tf_repositories(path_prefix = "", tf_repo_name = ""):
     # WARNING: make sure ncteisen@ and vpai@ are cc-ed on any CL to change the below rule
     tf_http_archive(
         name = "grpc",
+        patch_file = clean_dep("//third_party:grpc_gettid_fix.patch"),
         sha256 = "67a6c26db56f345f7cee846e681db2c23f919eba46dd639b09462d1b6203d28c",
         strip_prefix = "grpc-4566c2a29ebec0835643b972eb99f4306c4234a3",
         system_build_file = clean_dep("//third_party/systemlibs:grpc.BUILD"),
diff --git a/third_party/nccl/build_defs.bzl.tpl b/third_party/nccl/build_defs.bzl.tpl
index 5719139855..5f5c3a1008 100644
--- a/third_party/nccl/build_defs.bzl.tpl
+++ b/third_party/nccl/build_defs.bzl.tpl
@@ -113,7 +113,6 @@ def _device_link_impl(ctx):
             "--cmdline=--compile-only",
             "--link",
             "--compress-all",
-            "--bin2c-path=%s" % bin2c.dirname,
             "--create=%s" % tmp_fatbin.path,
             "--embedded-fatbin=%s" % fatbin_h.path,
         ] + images,