v0.28.0: Keras 2.11+ optimizers, faster reducescatter, fixes for latest TensorFlow, CUDA, NCCL

TensorFlow: Added new get_local_and_global_gradients to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)

Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
TensorFlow: Ensured that tf.logical_and within allreduce tf.cond runs on CPU. (#3885)
TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
CUDA_VISIBLE_DEVICES environment variable is no longer passed to remote nodes. (#3865)

Fixed build with ROCm. (#3839, #3848)
Fixed build of Docker image horovod-nvtabular. (#3851)
Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
Fixed missing arguments of Keras allreduce function. (#3905)
Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)

Provide feedback