Skip to content

Build: Link CUDA runtime library statically by default#3867

Merged
maxhgerlach merged 3 commits intomasterfrom
cudart_static
Mar 21, 2023
Merged

Build: Link CUDA runtime library statically by default#3867
maxhgerlach merged 3 commits intomasterfrom
cudart_static

Conversation

@maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Mar 20, 2023

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This change follows the discussion in #3831 to avoid the NCCL error "Cuda failure 'CUDA driver is a stub library'".

  • CMAKE_CUDA_RUNTIME_LIBRARY setting is only effective with CMake 3.17+ (and there is no Horovod build environment variable to control it at this time); with CMake 3.13 the CUDA runtime library is always linked statically on my system. https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_RUNTIME_LIBRARY.html
  • added warning if dynamic CUDA RT (non-default) is combined with static NCCL (default)
  • added CMake status message printing full NCCL_VERSION_CODE (e.g., 21505)
  • made build environment variable HOROVOD_NCCL_LINK case-insensitive

Related code in add_cuda macro that is meant to explicitly add libraries depending on CMAKE_CUDA_RUNTIME_LIBRARY may not actually have the intended effect (

horovod/CMakeLists.txt

Lines 183 to 188 in b1d0ce8

string(TOLOWER "${CMAKE_CUDA_RUNTIME_LIBRARY}" lowercase_CMAKE_CUDA_RUNTIME_LIBRARY)
if (lowercase_CMAKE_CUDA_RUNTIME_LIBRARY STREQUAL "static")
list(APPEND LINKER_LIBS CUDA::cudart_static)
elseif (lowercase_CMAKE_CUDA_RUNTIME_LIBRARY STREQUAL "shared")
list(APPEND LINKER_LIBS CUDA::cudart)
endif()
). At least I couldn't get it to change anything with CMake 3.13 -- the build would always link the CUDA runtime lib statically, regardless of what was appended to LINKER_LIBS there. Edit: Removed this in follow-up commit, see below.

Open question: Should we add a build environment variable HOROVOD_CUDA_RUNTIME_LIBRARY so users can easily change this setting when pip installing Horovod? Caveat is that it would only be effective with CMake 3.17+.

Fixes #3831, and hopefully should also help with CUDA 11.8 build problems seen in #3845

- setting is only effective with CMake 3.17+ (and there is no Horovod build environment variable to control it at this time); with CMake 3.13 the CUDA runtime library is always linked statically on my system
- related code in add_cuda macro that adds libraries may not actually have any effect
- added warning if dynamic CUDA RT is combined with static NCCL (default) as this can lead to NCCL error "Cuda failure 'CUDA driver is a stub library'"
- added CMake status message printing full NCCL_VERSION_CODE (e.g., 21505)
- made build environment variable HOROVOD_NCCL_LINK case-insensitive

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@cliffwoolley
Copy link

Thanks for picking this up, @maxhgerlach !

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Mar 20, 2023

Linker observation on master (with set(CMAKE_CUDA_RUNTIME_LIBRARY “Shared”), Buildkite Build 9074, test-gpu-gloo-py3_8-tf2_9_3-keras2_9_0-torch1_11_0-mxnet1_7_0_p1-pyspark3_3_1, using CMake 3.13:

/usr/bin/c++ -fPIC -I/usr/local/lib/python3.8/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=1 -DEIGEN_MAX_ALIGN_BYTES=64  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG  -Wl,--version-script=/tmp/pip-req-build-aw9p2z72/horovod.lds -Wl,-Bsymbolic-functions -Wl,-z,relro,-z,now -shared -Wl,-soname,mpi_lib.cpython-38-x86_64-linux-gnu.so -o ../../../../lib.linux-x86_64-3.8/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so CMakeFiles/tensorflow.dir/__/common/common.cc.o CMakeFiles/tensorflow.dir/__/common/controller.cc.o CMakeFiles/tensorflow.dir/__/common/fusion_buffer_manager.cc.o CMakeFiles/tensorflow.dir/__/common/group_table.cc.o CMakeFiles/tensorflow.dir/__/common/half.cc.o CMakeFiles/tensorflow.dir/__/common/logging.cc.o CMakeFiles/tensorflow.dir/__/common/message.cc.o CMakeFiles/tensorflow.dir/__/common/operations.cc.o CMakeFiles/tensorflow.dir/__/common/parameter_manager.cc.o CMakeFiles/tensorflow.dir/__/common/process_set.cc.o CMakeFiles/tensorflow.dir/__/common/response_cache.cc.o CMakeFiles/tensorflow.dir/__/common/stall_inspector.cc.o CMakeFiles/tensorflow.dir/__/common/thread_pool.cc.o CMakeFiles/tensorflow.dir/__/common/timeline.cc.o CMakeFiles/tensorflow.dir/__/common/tensor_queue.cc.o CMakeFiles/tensorflow.dir/__/common/ops/collective_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/operation_manager.cc.o CMakeFiles/tensorflow.dir/__/common/optim/bayesian_optimization.cc.o CMakeFiles/tensorflow.dir/__/common/optim/gaussian_process.cc.o CMakeFiles/tensorflow.dir/__/common/utils/env_parser.cc.o CMakeFiles/tensorflow.dir/__/common/ops/cuda_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gpu_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/nccl_operations.cc.o CMakeFiles/tensorflow.dir/__/common/nvtx_op_range.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_context.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_controller.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/http_store.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/memory_store.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gloo_operations.cc.o CMakeFiles/tensorflow.dir/mpi_ops.cc.o CMakeFiles/tensorflow.dir/xla_mpi_ops.cc.o CMakeFiles/tensorflow.dir/cmake_device_link.o  -L/usr/local/cuda/targets/x86_64-linux/lib -Wl,-rpath,/usr/local/cuda/lib64 /usr/lib/x86_64-linux-gnu/libnccl_static.a -ldl -L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2 /usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so ../../third_party/gloo/gloo/libgloo.a ../common/ops/cuda/libhorovod_cuda_kernels.a /usr/local/cuda/lib64/libcudart.so -lpthread -lcudadevrt -lcudart_static -lrt -lpthread -ldl

-> This adds both dynamic /usr/local/cuda/lib64/libcudart.so and static -lcudart_static, which doesn't sound safe to me. I suppose -lcudart_static is added by CMake's enable_language(CUDA), which doesn't understand the CMAKE_CUDA_RUNTIME_LIBRARY setting at version 3.13. The libcudart.so is probably added by the code in our add_cuda macro, which depends on CMAKE_CUDA_RUNTIME_LIBRARY.

--

The same observation for the CI run for this PR (with set(CMAKE_CUDA_RUNTIME_LIBRARY "Static")), Buildkite Build 9076, test-gpu-gloo-py3_8-tf2_9_3-keras2_9_0-torch1_11_0-mxnet1_7_0_p1-pyspark3_3_1, also using CMake 3.13:

/usr/bin/c++ -fPIC -I/usr/local/lib/python3.8/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=1 -DEIGEN_MAX_ALIGN_BYTES=64  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG  -Wl,--version-script=/tmp/pip-req-build-jslmu5sl/horovod.lds -Wl,-Bsymbolic-functions -Wl,-z,relro,-z,now -shared -Wl,-soname,mpi_lib.cpython-38-x86_64-linux-gnu.so -o ../../../../lib.linux-x86_64-3.8/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so CMakeFiles/tensorflow.dir/__/common/common.cc.o CMakeFiles/tensorflow.dir/__/common/controller.cc.o CMakeFiles/tensorflow.dir/__/common/fusion_buffer_manager.cc.o CMakeFiles/tensorflow.dir/__/common/group_table.cc.o CMakeFiles/tensorflow.dir/__/common/half.cc.o CMakeFiles/tensorflow.dir/__/common/logging.cc.o CMakeFiles/tensorflow.dir/__/common/message.cc.o CMakeFiles/tensorflow.dir/__/common/operations.cc.o CMakeFiles/tensorflow.dir/__/common/parameter_manager.cc.o CMakeFiles/tensorflow.dir/__/common/process_set.cc.o CMakeFiles/tensorflow.dir/__/common/response_cache.cc.o CMakeFiles/tensorflow.dir/__/common/stall_inspector.cc.o CMakeFiles/tensorflow.dir/__/common/thread_pool.cc.o CMakeFiles/tensorflow.dir/__/common/timeline.cc.o CMakeFiles/tensorflow.dir/__/common/tensor_queue.cc.o CMakeFiles/tensorflow.dir/__/common/ops/collective_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/operation_manager.cc.o CMakeFiles/tensorflow.dir/__/common/optim/bayesian_optimization.cc.o CMakeFiles/tensorflow.dir/__/common/optim/gaussian_process.cc.o CMakeFiles/tensorflow.dir/__/common/utils/env_parser.cc.o CMakeFiles/tensorflow.dir/__/common/ops/cuda_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gpu_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/nccl_operations.cc.o CMakeFiles/tensorflow.dir/__/common/nvtx_op_range.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_context.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_controller.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/http_store.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/memory_store.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gloo_operations.cc.o CMakeFiles/tensorflow.dir/mpi_ops.cc.o CMakeFiles/tensorflow.dir/xla_mpi_ops.cc.o CMakeFiles/tensorflow.dir/cmake_device_link.o  -L/usr/local/cuda/targets/x86_64-linux/lib /usr/lib/x86_64-linux-gnu/libnccl_static.a -ldl -L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2 /usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so ../../third_party/gloo/gloo/libgloo.a ../common/ops/cuda/libhorovod_cuda_kernels.a /usr/local/cuda/lib64/libcudart_static.a /usr/lib/x86_64-linux-gnu/librt.so -ldl -lpthread -lcudadevrt -lcudart_static -lrt -lpthread -ldl

-> This adds /usr/local/cuda/lib64/libcudart_static.a and -lcudart_static, so we repeat ourselves. Probably no harm done, but also nothing gained.

--

On CMake 3.13 (the minimum currently supported by Horovod) it doesn't seem to be possible to use enable_language(CUDA) and explicitly control how the CUDA runtime library is linked. It seems to default to static here, though. (Locally with CMake 3.25 I could also link the shared library.)

I think the code in add_cuda() that appends to LINKER_LIBS can just be removed.

…aneous and on CMake < 3.17 may contradict the setting imposed by `enable_language(CUDA)`

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@github-actions
Copy link

github-actions bot commented Mar 20, 2023

Unit Test Results

  1 047 files   -      70    1 047 suites   - 70   12h 55m 55s ⏱️ - 48m 39s
     886 tests ±       0       770 ✔️  -      62     116 💤 +  62  0 ±0 
23 535 runs   - 1 169  16 154 ✔️  - 1 353  7 381 💤 +184  0 ±0 

Results for commit da3c5c9. ± Comparison against base commit b1d0ce8.

This pull request skips 62 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_process_sets
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_type_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_variable_size
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Mar 20, 2023

Unit Test Results (with flaky tests)

  1 210 files   -   21    1 210 suites   - 21   14h 22m 22s ⏱️ + 1m 20s
     886 tests ±    0       766 ✔️  -   66     116 💤 +  62  4 +4 
27 394 runs   - 196  18 287 ✔️  - 908  9 101 💤 +706  6 +6 

For more details on these failures, see this check.

Results for commit da3c5c9. ± Comparison against base commit b1d0ce8.

This pull request skips 62 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_allgather_object
test.parallel.test_mxnet2.MX2Tests ‑ test_broadcast_object
test.parallel.test_mxnet2.MX2Tests ‑ test_compression_fp16
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_process_sets
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_type_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allgather_variable_size
…

♻️ This comment has been updated with latest results.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Copy link
Collaborator

@nvcastet nvcastet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhgerlach Thanks a lot for investigating the cudart/nccl linking and making the changes!

Open question: Should we add a build environment variable HOROVOD_CUDA_RUNTIME_LIBRARY so users can easily change this setting when pip installing Horovod? Caveat is that it would only be effective with CMake 3.17+.

That is a good question. Since it would only work for specific versions of cmake, I would not worry about it for now except if we have a real use case for it.

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for picking this up @maxhgerlach.

@maxhgerlach
Copy link
Collaborator Author

Thanks for the quick review!

Open question: Should we add a build environment variable HOROVOD_CUDA_RUNTIME_LIBRARY so users can easily change this setting when pip installing Horovod? Caveat is that it would only be effective with CMake 3.17+.

That is a good question. Since it would only work for specific versions of cmake, I would not worry about it for now except if we have a real use case for it.

OK, if we don't see a pressing use case for linking the shared runtime library now, we can leave it as it is. If we want to introduce the option eventually, it might make sense to just bump the CMake requirement to 3.17.

--

For completeness:

/usr/bin/c++ -fPIC -I/usr/local/lib/python3.8/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=1 -DEIGEN_MAX_ALIGN_BYTES=64  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG  -Wl,--version-script=/tmp/pip-req-build-ksyvdwir/horovod.lds -Wl,-Bsymbolic-functions -Wl,-z,relro,-z,now -shared -Wl,-soname,mpi_lib.cpython-38-x86_64-linux-gnu.so -o ../../../../lib.linux-x86_64-3.8/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so CMakeFiles/tensorflow.dir/__/common/common.cc.o CMakeFiles/tensorflow.dir/__/common/controller.cc.o CMakeFiles/tensorflow.dir/__/common/fusion_buffer_manager.cc.o CMakeFiles/tensorflow.dir/__/common/group_table.cc.o CMakeFiles/tensorflow.dir/__/common/half.cc.o CMakeFiles/tensorflow.dir/__/common/logging.cc.o CMakeFiles/tensorflow.dir/__/common/message.cc.o CMakeFiles/tensorflow.dir/__/common/operations.cc.o CMakeFiles/tensorflow.dir/__/common/parameter_manager.cc.o CMakeFiles/tensorflow.dir/__/common/process_set.cc.o CMakeFiles/tensorflow.dir/__/common/response_cache.cc.o CMakeFiles/tensorflow.dir/__/common/stall_inspector.cc.o CMakeFiles/tensorflow.dir/__/common/thread_pool.cc.o CMakeFiles/tensorflow.dir/__/common/timeline.cc.o CMakeFiles/tensorflow.dir/__/common/tensor_queue.cc.o CMakeFiles/tensorflow.dir/__/common/ops/collective_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/operation_manager.cc.o CMakeFiles/tensorflow.dir/__/common/optim/bayesian_optimization.cc.o CMakeFiles/tensorflow.dir/__/common/optim/gaussian_process.cc.o CMakeFiles/tensorflow.dir/__/common/utils/env_parser.cc.o CMakeFiles/tensorflow.dir/__/common/ops/cuda_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gpu_operations.cc.o CMakeFiles/tensorflow.dir/__/common/ops/nccl_operations.cc.o CMakeFiles/tensorflow.dir/__/common/nvtx_op_range.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_context.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/gloo_controller.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/http_store.cc.o CMakeFiles/tensorflow.dir/__/common/gloo/memory_store.cc.o CMakeFiles/tensorflow.dir/__/common/ops/gloo_operations.cc.o CMakeFiles/tensorflow.dir/mpi_ops.cc.o CMakeFiles/tensorflow.dir/xla_mpi_ops.cc.o CMakeFiles/tensorflow.dir/cmake_device_link.o  -L/usr/local/cuda/targets/x86_64-linux/lib /usr/lib/x86_64-linux-gnu/libnccl_static.a -ldl -L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2 /usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so ../../third_party/gloo/gloo/libgloo.a ../common/ops/cuda/libhorovod_cuda_kernels.a -lpthread -lcudadevrt -lcudart_static -lrt -lpthread -ldl

After the last changes on this PR, the library appears only one time in the linker command line: -lcudart_static. (https://buildkite.com/horovod/horovod/builds/9079#01870096-f93a-4351-af3b-9626b9da31c2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

NCCL WARN Cuda failure 'CUDA driver is a stub library', then segfault

4 participants