Publications

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Maheep Chaudhary, and Atticus Geiger, 2024

A popular new method in mechanistic inter- pretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analy- sis. However, the body of evidence on whether SAE feature spaces are useful for causal analy- sis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT- 2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed align- ment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: github.com/MaheepChaudhary/SAE-Ravel

Download here

Causal abstraction for faithful model interpretation

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard, 2024

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field con- cerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) provid- ing a flexible, yet precise formalization for the core concepts of modular features, polysemantic neurons, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability method- ologies in the common language of causal abstraction, namely activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse au- toencoders, differential binary masking, distributed alignment search, and activation steering.

Download here

Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives

Haoyang Liu*, Maheep Chaudhary∗, and Haohan Wang, 2023

The importance of trustworthy machine learning is a critical topic, covering robustness, security, interpretability, and fairness. Over the last decade, various methods have emerged to address these challenges. This survey systematically reviews advancements from a data centric perspective, emphasizing the limitations of traditional empirical risk minimization (ERM) in handling data related challenges. Interestingly, despite developing independently, these methods across trustworthy machine learning converge, with Pearl causality hierarchy providing a unifying framework. The survey presents a unified language and mathematical vocabulary, connecting methods in robustness, adversarial robustness, interpretability, and fairness. This approach fosters a cohesive understanding of the field. The survey also explores the trustworthiness of large pretrained models, connecting techniques like fine tuning, parameter efficient fine tuning, prompting, and reinforcement learning with human feedback to standard ERM. This connection extends the principled understanding of trustworthy methods to these models, laying the groundwork for future approaches. The survey reviews existing methods and concludes with a summary of applications and potential future aspects. For more information, please visit http://trustai.one.

Download here

An Intelligent Recommendation-cum-Reminder System

Rohan Saxena, Maheep Chaudhary, Chandresh Maurya, and Shitala Prasad, 2022

Intelligent recommendation and reminder systems are the need of the fast-pacing life. Current intelligent systems such as Siri, Google Assistant, Microsoft Cortona, etc., have limited capability. For example, if you want to wake up at 6 am because you have an upcoming trip, you have to set the alarm manually. Besides, these systems do not recommend or remind what else to carry, such as carrying an umbrella during a likely rain. The present work proposes a system that takes an email as input and returns a recommendation-cum-reminder list. As a first step, we parse the emails, recognize the entities using named entity recognition (NER). In the second step, information retrieval over the web is done to identify nearby places, climatic conditions, etc. Imperative sentences from the reviews of all places are extracted and passed to the object extraction module. The main challenge lies in extracting the objects (items) of interest from the review. To solve it, a modified Machine Reading Comprehension-NER (MRC-NER) model is trained to tag objects of interest by formulating annotation rules as a query. The objects so found are recommended to the user one day in advance. The final reminder list of objects is pruned by our proposed model for tracking objects kept during the ”packing activity.” Eventually, when the user leaves for the event/trip, an alert is sent containing the reminding list items. Our approach achieves superior performance compared to several baselines by as much as 30% on recall and 10% on precision.

Download here

CQFaRAD: Collaborative Query-Answering Framework for a Research Article Dataspace

, 2021

Dataspace systems cope with the problem of integrating a variety of data based on its structures and semantics such as structured, semi-structured, and unstructured data, and returns the best-effort or approximate answers to their users. The existing works on query answering in a dataspace system are contentbased and paid attention to return the best answers to the users without taking care of their preferences. This paper aims to consider not only the content-based information but also the users’ preferences while answering the users’ queries. Therefore, we present a Collaborative Query-Answering Framework for a Research Article Dataspace (CQFaRAD) that helps to efficiently answer the users’ queries and returns more prominent answers to them. In this work, we present a collaborative approach that adopts the advantages of existing content-based and users’ preferences-based approaches. To achieve this task, we use the BERT model to represent our dataspace and users’ query. We have validated our proposed approach on the research papers dataset available on Kaggle. The experimental results show that our approach works fairly well to return relevant information to the users

Download here