Surgically making neural networks safe!

About Me

My research addresses one of AI safety's hardest problems: models can think one thing internally while saying another—meaning we can't verify true alignment just by checking outputs (Rosati et al., 2024). This drives my focus on white-box analysis: using mechanistic interpretability to see what's actually happening inside models.

Foundation

My work builds on two survey papers: one connecting causal reasoning to ML trustworthiness (Chaudhary et al., 2024), and another on mechanistic interpretability (Geiger et al., 2025). Together, these give me a causal-mechanistic framework for investigating how models work from the inside out.

Key Findings

  • Models can detect when they're being evaluated—and this ability increases with scale (Chaudhary et al., 2025). If models behave differently during testing versus deployment, we can't trust safety evaluations.
  • Models leave distinct attention signatures when generating harmful content—enabling ~95% detection accuracy (Chaudhary et al., 2025).
  • Models can reach harmful outputs through multiple pathways (SafetyNet, Chaudhary et al., 2025). Blocking one route may just cause rerouting.
  • Models shift information toward final tokens, using punctuation as intermediate storage (Chauhan et al., 2025).

Solutions

Current Focus: Model Organisms of Hyperawareness

Can we trust safety evaluations of AI systems that know they're being tested?

My research shows this awareness is already measurable and growing with scale. Without intervention, we risk models that fake alignment during testing.

Two threats: (1) models naturally developing evaluation-detection as they scale; (2) bad actors intentionally building models that hide misaligned goals until deployment.

My approach: Build controlled "model organisms" that are hyperaware of evaluation, then develop and test countermeasures. By designing organisms with known awareness mechanisms—from simple to sophisticated—I can find where defenses work and where they fail.

Key insight: My SafetyNet work shows models use multiple pathways to reach outputs. Blocking one awareness "direction" may just cause rerouting. I'm studying how flexibly networks encode self-awareness.

The goal: Move from filtering bad outputs to preventing bad internal states—making safety verifiable inside the model, not just outside.

🔬 Current Research

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary†, I. Su, N. Hooda, N. Shankar, J. Tan, K. Zhu, A. Panda, R. Lagasse, V. Sharma

NeurIPS 2025 Responsible FM Workshop

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

S. Batra, P. Tillman, S. Gaggar, S. Kesineni, S. Dev, K. Zhu, A. Panda, Maheep Chaudhary†

NeurIPS 2025 Responsible FM Workshop

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits

D. Patel, G. Gervacio, D. Raimi, K. Zhu, R. Lagasse, G. Grand, A. Panda, Maheep Chaudhary†

NeurIPS 2025 Responsible FM Workshop

Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis

A. More, A. Zhang, N. Bonilla, A. Vivekan, K. Zhu, P. Sharafoleslami, Maheep Chaudhary†

NeurIPS 2025 Responsible FM Workshop

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

A. Swaroop, A. Nallani, S. Uboweja, A. Uzdenova, M. Nguyen, K. Zhu, S. Dev, A. Panda, V. Sharma, Maheep Chaudhary†

NeurIPS 2025 FoRLM Workshop

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

N. Egbuna, S. Gaur, S. Dev, A. Panda, Maheep Chaudhary†

NeurIPS 2025 Efficient Reasoning Workshop

PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

S. V. Vuddanti, A. Shah, S. K. Chittiprolu, T. Song, S. Dev, K. Zhu, Maheep Chaudhary†

arXiv preprint

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

S. Chaudhary, D. Patel, Maheep Chaudhary, B. Browning

NeurIPS 2025 Efficient Reasoning Workshop

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

S. B. Shah, S. Shiwakoti, Maheep Chaudhary, H. Wang

EMNLP 2024

Punctuation and Predicates in Language Models

S. Chauhan, Maheep Chaudhary, K. Choy, S. Nellessen, N. Schoots

arXiv preprint

📑 Literature Surveys & Theory

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

A. Geiger, D. Ibeling, A. Zur, Maheep Chaudhary, S. Chauhan, J. Huang, A. Arora, Z. Wu, N. Goodman, C. Potts, T. Icard

JMLR 2024

📚 Additional Publications

Modular Training of Neural Networks aids Interpretability

S. Golechha, Maheep Chaudhary, J. Velja, A. Abate, N. Schoots

arXiv preprint

An Intelligent Recommendation cum Reminder System

R. Saxena, Maheep Chaudhary, C.K. Maurya, S. Prasad

ACM IKDD CODS & COMAD 2022

CQFaRAD: Collaborative Query-Answering Framework for a Research Article Dataspace

M. Singh, S. Pandey, R. Saxena, Maheep Chaudhary, N. Lal

ACM COMPUTE 2021

Background & Recognition

Winner of Smart India Hackathon (200K+ participants) and ASEAN-India Hackathon leader across 10+ countries. Mentored 40+ students, selected for UNESCO-India-Africa Program (20+ countries), and reviewer for ICML 2025 and NeurIPS 2024 workshops.