DeepGEMM is a specialized library for efficient FP8 matrix multiplications, supporting both normal and Mix-of-Experts grouped operations. Built in CUDA for NVIDIA Hopper tensor cores, it features runtime compilation, simple design with ~300 lines of core code, and performance matching or exceeding other expert-tuned libraries.