Surgically making neural networks safe!
About Me
My research addresses one of AI safety's hardest problems: models can think one thing internally while saying another—meaning we can't verify true alignment just by checking outputs (Rosati et al., 2024). This drives my focus on white-box analysis: using mechanistic interpretability to see what's actually happening inside models.
Foundation
My work builds on two survey papers: one connecting causal reasoning to ML trustworthiness (Chaudhary et al., 2024), and another on mechanistic interpretability (Geiger et al., 2025). Together, these give me a causal-mechanistic framework for investigating how models work from the inside out.
Key Findings
- Models can detect when they're being evaluated—and this ability increases with scale (Chaudhary et al., 2025). If models behave differently during testing versus deployment, we can't trust safety evaluations.
- Models leave distinct attention signatures when generating harmful content—enabling ~95% detection accuracy (Chaudhary et al., 2025).
- Models can reach harmful outputs through multiple pathways (SafetyNet, Chaudhary et al., 2025). Blocking one route may just cause rerouting.
- Models shift information toward final tokens, using punctuation as intermediate storage (Chauhan et al., 2025).
Solutions
- Made Chain-of-Thought more faithful (Swaroop et al., 2025) and better calibrated (More et al., 2025).
- Reduced privacy leakage in CoT reasoning (Batra et al., 2025).
- Incorporated alignment constraints into circuit pruning (Patel et al., 2025).
Current Focus: Model Organisms of Hyperawareness
Can we trust safety evaluations of AI systems that know they're being tested?
My research shows this awareness is already measurable and growing with scale. Without intervention, we risk models that fake alignment during testing.
Two threats: (1) models naturally developing evaluation-detection as they scale; (2) bad actors intentionally building models that hide misaligned goals until deployment.
My approach: Build controlled "model organisms" that are hyperaware of evaluation, then develop and test countermeasures. By designing organisms with known awareness mechanisms—from simple to sophisticated—I can find where defenses work and where they fail.
Key insight: My SafetyNet work shows models use multiple pathways to reach outputs. Blocking one awareness "direction" may just cause rerouting. I'm studying how flexibly networks encode self-awareness.
The goal: Move from filtering bad outputs to preventing bad internal states—making safety verifiable inside the model, not just outside.
🔬 Current Research
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
Maheep Chaudhary, F. Barez
Under review
Evaluation Awareness Scales Predictably in Open-Weights Large Language Models
Maheep Chaudhary†, I. Su, N. Hooda, N. Shankar, J. Tan, K. Zhu, A. Panda, R. Lagasse, V. Sharma
NeurIPS 2025 Responsible FM Workshop
SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
S. Batra, P. Tillman, S. Gaggar, S. Kesineni, S. Dev, K. Zhu, A. Panda, Maheep Chaudhary†
NeurIPS 2025 Responsible FM Workshop
Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits
D. Patel, G. Gervacio, D. Raimi, K. Zhu, R. Lagasse, G. Grand, A. Panda, Maheep Chaudhary†
NeurIPS 2025 Responsible FM Workshop
Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis
A. More, A. Zhang, N. Bonilla, A. Vivekan, K. Zhu, P. Sharafoleslami, Maheep Chaudhary†
NeurIPS 2025 Responsible FM Workshop
FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
A. Swaroop, A. Nallani, S. Uboweja, A. Uzdenova, M. Nguyen, K. Zhu, S. Dev, A. Panda, V. Sharma, Maheep Chaudhary†
NeurIPS 2025 FoRLM Workshop
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
N. Egbuna, S. Gaur, S. Dev, A. Panda, Maheep Chaudhary†
NeurIPS 2025 Efficient Reasoning Workshop
PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases
S. V. Vuddanti, A. Shah, S. K. Chittiprolu, T. Song, S. Dev, K. Zhu, Maheep Chaudhary†
arXiv preprint
Hydra: A Modular Architecture for Efficient Long-Context Reasoning
S. Chaudhary, D. Patel, Maheep Chaudhary, B. Browning
NeurIPS 2025 Efficient Reasoning Workshop
MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
S. B. Shah, S. Shiwakoti, Maheep Chaudhary, H. Wang
EMNLP 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary, A. Geiger
arXiv preprint
Punctuation and Predicates in Language Models
S. Chauhan, Maheep Chaudhary, K. Choy, S. Nellessen, N. Schoots
arXiv preprint
📑 Literature Surveys & Theory
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
A. Geiger, D. Ibeling, A. Zur, Maheep Chaudhary, S. Chauhan, J. Huang, A. Arora, Z. Wu, N. Goodman, C. Potts, T. Icard
JMLR 2024
Towards Trustworthy and Aligned Machine Learning: A Data-centric Survey with Causality Perspectives
Maheep Chaudhary*, H. Liu*, H. Wang
arXiv preprint
📚 Additional Publications
Modular Training of Neural Networks aids Interpretability
S. Golechha, Maheep Chaudhary, J. Velja, A. Abate, N. Schoots
arXiv preprint
An Intelligent Recommendation cum Reminder System
R. Saxena, Maheep Chaudhary, C.K. Maurya, S. Prasad
ACM IKDD CODS & COMAD 2022
CQFaRAD: Collaborative Query-Answering Framework for a Research Article Dataspace
M. Singh, S. Pandey, R. Saxena, Maheep Chaudhary, N. Lal
ACM COMPUTE 2021
Background & Recognition
Winner of Smart India Hackathon (200K+ participants) and ASEAN-India Hackathon leader across 10+ countries. Mentored 40+ students, selected for UNESCO-India-Africa Program (20+ countries), and reviewer for ICML 2025 and NeurIPS 2024 workshops.
