ML Engineer and Researcher with 4+ years of experience specializing in large language model (LLM) systems, GPU kernel engineering, and production machine learning infrastructure. Trained a 700M-parameter hybrid Mamba-2/Transformer LLM from scratch; implemented FlashAttention-2 CUDA kernels achieving 2.1x throughput over PyTorch baseline; built a speculative decoding inference runtime delivering 2.4x CPU inference speedup with mathematically identical output guarantees. Founding Engineer at a funded Web3 startup with deep expertise in distributed systems, model alignment (RLHF, DPO, GRPO), mechanistic interpretability research, and scalable ML deployment. Proven track record bridging ML research and production engineering from CUDA kernel profiling to architecture design to cloud-scale deployment on AWS.
2020 - Now: Student(B.tech) @ JSS Academy of technical education noida