cutlass

Star

Here are 23 public repositories matching this topic...

bytedance / flux

Star

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

gpu cuda pytorch cutlass

Updated Aug 28, 2025
C++

NVlabs / vibetensor

Star

Our first fully AI generated deep learning system

machine-learning cuda pytorch cutlass vibe-coding

Updated Feb 2, 2026
Python

coderonion / awesome-cuda-and-hpc

Star

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

Updated Aug 2, 2025

DD-DuDa / Cute-Learning

Star

Examples of CUDA implementations by Cutlass CuTe

gpu cuda cutlass

Updated Jul 1, 2025
Makefile

leimao / CUTLASS-Examples

Sponsor

Star

CUTLASS and CuTe Examples

docker cuda cutlass

Updated Nov 30, 2025
Cuda

Bruce-Lee-LY / flash_attention_inference

Star

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

gpu cuda inference nvidia cutlass mha multi-head-attention llm tensor-core large-language-model flash-attention flash-attention-2

Updated Feb 27, 2025
C++

bikrammajhi / 100-days-of-GPU

Star

This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA/CUTLASS kernels, Triton spells, and PTX sorcery.

mojo cuda triton cutlass ptx nsight-compute thunderkittens

Updated Mar 11, 2026
HTML

YashasSamaga / ConvolutionBuildingBlocks

Star

GEMM and Winograd based convolutions using CUTLASS

deep-learning cuda convolution cutlass

Updated Jul 15, 2020
Cuda

yester31 / Cutlass_EX

Star

study of cutlass

cmake cuda cpp17 cutlass linux-programming parallel-programming

Updated Nov 10, 2024
Cuda

Bruce-Lee-LY / cutlass_gemm

Star

Multiple GEMM operators are constructed with cutlass to support LLM inference.

gpu cublas nvidia cutlass gemm cublaslt llm matrix-multiply tensor-core

Updated Aug 3, 2025
C++

qdLMF / LightGlue-with-FlashAttentionV2-TensorRT

Star

A cutlass cute implementation of headdim-64 flashattentionv2 TensorRT plugin for LightGlue. Run on Jetson Orin NX 8GB with TensorRT 8.5.2.

cuda transformer cutlass cute tensorrt feature-matching multihead-attention superpoint lightglue flash-attention flash-attention-2

Updated Mar 3, 2025
Cuda

psmarter / CUDA-Practice

Star

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

parallel-computing cuda high-performance-computing cuda-kernels quantization cutlass gemm performance-optimization nccl gpu-programming roofline-model tensor-core llm-inference flash-attention nsight-compute

Updated Mar 17, 2026
Cuda

sgl-project / whl

Star

Kernel Library Wheel for SGLang

cuda cutlass sglang flashinfer

Updated Mar 16, 2026
HTML

cjmcv / ai-infra-notes

Star

Reading notes on the open source code of AI infrastructure (sglang, llm, cutlass, hpc, etc.)

hpc gpu cuda inference simd cutlass heterogeneous-computing mlsys llm sglang

Updated Nov 2, 2025

Bruce-Lee-LY / DeepGEMMPerTensor

Star

DeepGEMMPerTensor: clean and efficient FP8 GEMM per tensor kernels without scales

gpu cuda nvidia cutlass tensor-core deep-gemm fp8-gemm

Updated Sep 7, 2025
Python

digital-nomad-cheng / tvm_project_course

Star

neural-network compiler cuda cutlass tensorrt tvm dl-compiler

Updated Nov 2, 2023
Python

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.

linux fedora cuda nvidia cutlass openshmem cuda-programming nvshmem

Updated Feb 19, 2026
C++

Routhleck / blocksparse-pytorch-implement

Star

pytorch implements block sparse

python cuda pytorch matrix-multiplication cutlass blocksparse tilesparse

Updated May 13, 2023
C++

prateekshukla1108 / cutlass3

Star

Docs

cutlass

Updated May 14, 2025
HTML

peterlau123 / Lolly

Star

Lightweight and production level C++ Open source Library

c cpp cuda cutlass

Updated May 7, 2025
C++

Improve this page

Add a description, image, and links to the cutlass topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the cutlass topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutlass

Here are 23 public repositories matching this topic...

bytedance / flux

NVlabs / vibetensor

coderonion / awesome-cuda-and-hpc

DD-DuDa / Cute-Learning

leimao / CUTLASS-Examples

Bruce-Lee-LY / flash_attention_inference

bikrammajhi / 100-days-of-GPU

YashasSamaga / ConvolutionBuildingBlocks

yester31 / Cutlass_EX

Bruce-Lee-LY / cutlass_gemm

qdLMF / LightGlue-with-FlashAttentionV2-TensorRT

psmarter / CUDA-Practice

sgl-project / whl

cjmcv / ai-infra-notes

Bruce-Lee-LY / DeepGEMMPerTensor

digital-nomad-cheng / tvm_project_course

steleman / nvshmem

Routhleck / blocksparse-pytorch-implement

prateekshukla1108 / cutlass3

peterlau123 / Lolly

Improve this page

Add this topic to your repo