Maheep Chaudhary

🔬 Current Research

Working with Fazl Barez at University of Oxford on whitebox monitoring of LLMs, and mentoring research projects at Algoverse.

Previously, I have collaborated with Atticus Geiger and Nandi Schoots, University of Oxford on mechanistic (causal) interpretability; and Haohan Wang at UIUC on trustworthy machine learning, and completed my master's at NTU Singapore. Before diving deep into AI safety research, I won the Smart India Hackathon with 200K+ participants and led international teams solving real-world AI problems.

In my life's causal DAG, my mentors are parent nodes of each success 🙏🏻.

💬 Let's Collaborate

✉️ Get in Touch

Open to collaborations in: model eval awareness, mechanistic interpretability, AI safety

🔬 Research

I research AI safety and interpretability to ensure advanced AI systems are aligned with human values. This involves mechanistic interpretability, deception detection, and causal methods for understanding neural networks.

⭐ FEATURED

Obfuscation Causal Interpretability University of Oxford

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Maheep Chaudhary, Fazl Barez

📝 Under review at NeurIPS 2025

Demonstrated: obfuscation occurs through information shifts from non-linear to linear representation spaces; proves white-box monitoring systems can be made more effective by enabling them to monitor multiple representation spaces.

Faithfulness Causal Interpretability

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Maheep Chaudhary, Atticus Geiger

📄 arXiv

Proved the ineffectiveness of Sparse Autoencoders on disentangling knowledge in latent space, and suggested Distributed Alignment Search might be a better alternative.

📑 Literature Surveys

Theory Causality University of Illinois

Towards Trustworthy and Aligned Machine Learning: A Data-centric Survey with Causality Perspectives

Maheep Chaudhary*, H. Liu*, H. Wang

📄 arXiv

A 80+ page primer on connecting causality to trustworthiness on AI systems and suggesting why it would the future of more faithful and deterministic systems.

Theory Causal Interpretability Stanford University

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Atticus Geiger, Daniel Ibeling, ..., Maheep Chaudhary, ..., Christopher Potts

🏆 JMLR 2024

A survey on causal abstraction of neural network from a mechanistic perspective.

📚 Additional Publications

Modular Training of Neural Networks aids Interpretability

Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots

📄 arXiv

A training methodology that promotes modular cluster formation in neural networks, making models more interpretable by learning disjoint, specialized circuits.

Punctuation and Predicates in Language Models

Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots

📄 arXiv

Investigating the computational importance of punctuation tokens and reasoning patterns across layers in GPT-2, DeepSeek, and Gemma models.

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

S. B. Shah, S. Shiwakoti, Maheep Chaudhary, H. Wang

EMNLP 2024

Using multimodal AI to understand and classify fair memes.

🏆 Background & Recognition

🥇 Smart India Hackathon Winner

World's largest hackathon with 200K+ participants. Developed facial recognition systems for criminal identification.

🌏 ASEAN-India Hackathon Leader

Led international teams across 10+ countries, focusing on marine species detection using AI.

Beyond research, I'm passionate about nurturing the next generation of AI researchers:

Mentored over 40+ students across multiple research programs and institutions.
Selected as mentor for UNESCO-India-Africa Program spanning 20+ countries.
Reviewed for ICML 2025 Workshop and multiple NeurIPS 2024 workshops.

💡 My Research Philosophy

I believe powerful AI systems must be interpretable and aligned with human values. My work focuses on understanding the internal mechanisms of neural networks and developing practical methods to detect when they behave in unintended ways. From theoretical foundations to deployed safety systems, I work across the full spectrum of AI alignment research.