Robust Counterfactual Explanations Paper List

Generally sorted by date.


  • On the Robustness of Interpretability Methods

    • Author(s)

      • David Alvarez-Melis, Tommi S. Jaakkola

    • Publication

      • presented at 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018)

    • Date

      • 21 Jun 2018

    • Link

    • Abstract

      • We argue that robustness of explanations—i.e., that similar inputs should give rise to similar explanations—is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

    • Comment

      • 解释的鲁棒性:相似的输入有相似解释


  • Interpretation of Neural Networks is Fragile 神经网络的解释是脆弱的

    • Author(s)

      • Amirata Ghorbani, Abubakar Abid, James Zou

    • Publication

      • AAAI 2019 (CCF A)

    • Date

      • 6 Nov 2018

    • Link

    • Abstract

      • In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

    • Comment

      • 当前的黑盒可解释方法是脆弱的,容易受到对抗样本甚至是随机扰动影响

      • 是否可以仿照此文系统描述CE的脆弱性, i.e., systematically characterize the fragility of several widely-used counterfactual explanation methods


  • Explaining Explanations in AI

    • Author(s)

      • Brent Mittelstadt, Chris Russell, Sandra Wachter

    • Publication

      • FAT* ‘19: Proceedings of the Conference on Fairness, Accountability, and Transparency

    • Date

      • Jul 2019

    • Link

    • Abstract

      • Recent work on interpretability in machine learning and AI has focused on the building of simplified models that approximate the true criteria used to make decisions. These models are a useful pedagogical device for teaching trained professionals how to predict what decisions will be made by the complex system, and most importantly how the system might break. However, when considering any such model it’s important to remember Box’s maxim that “All models are wrong but some are useful.” We focus on the distinction between these models and explanations in philosophy and sociology. These models can be understood as a “do it yourself kit” for explanations, allowing a practitioner to directly answer “what if questions” or generate contrastive explanations without external assistance. Although a valuable ability, giving these models as explanations appears more difficult than necessary, and other forms of explanation may not have the same trade-offs. We contrast the different schools of thought on what makes an explanation, and suggest that machine learning might benefit from viewing the problem more broadly.


  • The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations 事后解释性的危险:不合理的反事实解释

    • Author(s)

      • Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, Marcin Detyniecki 法国索邦大学

    • Publication

      • IJCAI-19 (CCF A)

    • Date

      • Jul 2019

    • Link

    • Abstract

      • Post-hoc interpretability approaches have been proven to be powerful tools to generate explanations for the predictions made by a trained black-box model. However, they create the risk of having explanations that are a result of some artifacts learned by the model instead of actual knowledge from the data. This paper focuses on the case of counterfactual explanations and asks whether the generated instances can be justified, i.e. continuously connected to some ground-truth data. We evaluate the risk of generating unjustified counterfactual examples by investigating the local neighborhoods of instances whose predictions are to be explained and show that this risk is quite high for several datasets. Furthermore, we show that most state of the art approaches do not differentiate justified from unjustified counterfactual examples, leading to less useful explanations.

    • Code

    • Comment

      • 黑盒模型的事后解释是从模型中学习而非从数据中学习 导致了不合理(即,与真实数据不一致)的解释

      • Laugel et al. says that if the explanation is not based on training data, but the artifacts of non-robustness of the classifier, it is unjustified. They define justified explanations to be connected to training data by a continuous set of datapoints, termed E-chainability.

    • Report


  • Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling 基于对抗性路径抽样的反事实视觉和语言导航

    • Author(s)

      • Tsu-Jui Fu, Xin Eric Wang, Matthew Peterson, Scott Grafton, Miguel Eckstein, William Yang Wang

    • Publication

      • ECCV 2020 : European Conference on Computer Vision (CCF B, Spotlight)

    • Date

      • 2019

    • Link

    • Abstract

      • Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal by grounding natural language instructions to the visual surroundings. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. In this paper, we explore the use of counterfactual thinking as a human-inspired data augmentation method that results in robust models. Counterfactual thinking is a concept that describes the human propensity to create possible alternatives to life events that have already occurred. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance. APS also serves to do pre-exploration of unseen environments to strengthen the model’s ability to generalize. We evaluate the influence of APS on the performance of different VLN baseline models using the room-to-room dataset (R2R). The results show that the adversarial training process with our proposed APS benefits VLN models under both seen and unseen environments. And the pre-exploration process can further gain additional improvements under unseen environments.


  • Robustness in machine learning explanations: does it matter? 机器学习解释中的鲁棒性:它重要吗?

    • Author(s)

      • Leif Hancox-Li

    • Publication

      • FAT* ‘20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

    • Date

      • 27 January 2020

    • Link

    • Abstract

      • The explainable AI literature contains multiple notions of what an explanation is and what desiderata explanations should satisfy. One implicit source of disagreement is how far the explanations should reflect real patterns in the data or the world. This disagreement underlies debates about other desiderata, such as how robust explanations are to slight perturbations in the input data. I argue that robustness is desirable to the extent that we’re concerned about finding real patterns in the world. The import of real patterns differs according to the problem context. In some contexts, non-robust explanations can constitute a moral hazard. By being clear about the extent to which we care about capturing real patterns, we can also determine whether the Rashomon Effect is a boon or a bane.


  • CERTIFAI: A Common Framework to Provide Explanations and Analyse the Fairness and Robustness of Black-box Models 提供解释和分析黑盒模型的公平性和鲁棒性的通用框架

    • Author(s)

      • Shubham Sharma, Jette Henderson, Joydeep Ghosh

    • Publication

      • AIES ‘20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

    • Date

      • 07 February 2020

    • Link

    • Abstract

      • Concerns within the machine learning community and external pressures from regulators over the vulnerabilities of machine learning algorithms have spurred on the fields of explainability, robustness, and fairness. Often, issues in explainability, robustness, and fairness are confined to their specific sub-fields and few tools exist for model developers to use to simultaneously build their modeling pipelines in a transparent, accountable, and fair way. This can lead to a bottleneck on the model developer’s side as they must juggle multiple methods to evaluate their algorithms. In this paper, we present a single framework for analyzing the robustness, fairness, and explainability of a classifier. The framework, which is based on the generation of counterfactual explanations through a custom genetic algorithm, is flexible, model-agnostic, and does not require access to model internals. The framework allows the user to calculate robustness and fairness scores for individual models and generate explanations for individual predictions which provide a means for actionable recourse (changes to an input to help get a desired outcome). This is the first time that a unified tool has been developed to address three key issues pertaining towards building a responsible artificial intelligence system.


  • Adversarial Robustness on In- and Out-Distribution Improves Explainability 在内外分布上的对抗鲁棒性提高了可解释性

    • Author(s)

      • Maximilian Augustin, Alexander Meinke, Matthias Hein

    • Publication

      • ECCV 2020 : European Conference on Computer Vision (CCF B)

    • Date

      • 2020

    • Link

    • Abstract

      • Neural networks have led to major improvements in image classification but suffer from being non-robust to adversarial changes, unreliable uncertainty estimates on out-distribution samples and their inscrutable black-box decisions. In this work we propose RATIO, a training procedure for Robustness via Adversarial Training on In- and Out-distribution, which leads to robust models with reliable and robust confidence estimates on the out-distribution. RATIO has similar generative properties to adversarial training so that visual counterfactuals produce class specific features. While adversarial training comes at the price of lower clean accuracy, RATIO achieves state-of-the-art l2-adversarial robustness on CIFAR10 and maintains better clean accuracy.


  • No Feature Is An Island: Adaptive Collaborations Between Features Improve Adversarial Robustness

    • Author(s)

      • Yufeng Zhang, Yunan Zhang, ChengXiang Zhai

    • Publication

      • ICLR 2021 (Submission Withdrawn by the Authors)

    • Date

      • 06 Mar 2021

    • Link

    • Abstract

      • To classify images, neural networks extract features from raw inputs and then sum them up with fixed weights via the fully connected layer. However, the weights are fixed despite the input types. Such fixed prior limits networks’ flexibility in adjusting feature reliance, which in turn enables attackers to flip networks’ predictions by corrupting the most brittle features whose value would change drastically by minor perturbations. Inspired by the analysis, we replace the original fixed fully connected layer by dynamically calculating the posterior weight for each feature according to the input and connections between them. Also, a counterfactual baseline is integrated to precisely characterize the credit of each feature’s contribution to the robustness and generality of the model. We empirically demonstrate that the proposed algorithm improves both standard and robust error against several strong attacks across various major benchmarks. Finally, we theoretically prove the minimal structure requirement for our framework to improve adversarial robustness in a fairly simple and natural setting.

    • One-sentence Summary

      • We find dynamic feature weighting can improve adversarial robustness and formulate our algorithm as a cooperative game.


  • Plausible Counterfactuals: Auditing Deep Learning Classifiers with Realistic Adversarial Examples 似是而非的反事实:用现实的对抗样本审计深度学习分类器

    • Author(s)

      • Alejandro Barredo-Arrieta, Javier Del Ser

    • Publication

      • IJCNN 2020 : International Joint Conference on Neural Networks (CCF C)

    • Date

      • Mar 2020

    • Link

    • Abstract

      • The last decade has witnessed the proliferation of Deep Learning models in many applications, achieving unrivaled levels of predictive performance. Unfortunately, the black-box nature of Deep Learning models has posed unanswered questions about what they learn from data. Certain application scenarios have highlighted the importance of assessing the bounds under which Deep Learning models operate, a problem addressed by using assorted approaches aimed at audiences from different domains. However, as the focus of the application is placed more on non-expert users, it results mandatory to provide the means for him/her to trust the model, just like a human gets familiar with a system or process: by understanding the hypothetical circumstances under which it fails. This is indeed the angular stone for this research work: to undertake an adversarial analysis of a Deep Learning model. The proposed framework constructs counterfactual examples by ensuring their plausibility, e.g. there is a reasonable probability that a human could generate them without resorting to a computer program. Therefore, this work must be regarded as valuable auditing exercise of the usable bounds a certain model is constrained within, thereby allowing for a much greater understanding of the capabilities and pitfalls of a model used in a real application. To this end, a Generative Adversarial Network (GAN) and multi-objective heuristics are used to furnish a plausible attack to the audited model, efficiently trading between the confusion of this model, the intensity and plausibility of the generated counterfactual. Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.


  • DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization

    • Author(s)

      • Kentaro Kanamori , Takuya Takagi , Ken Kobayashi, and Hiroki Arimura

    • Publication

      • IJCAI-20 (CCF A)

    • Date

      • June 2020

    • Link

    • Abstract

      • Counterfactual Explanation (CE) is one of the posthoc explanation methods that provides a perturbation vector so as to alter the prediction result obtained from a classifier. Users can directly interpret the perturbation as an ”action” for obtaining their desired decision results. However, an action extracted by existing methods often becomes unrealistic for users because they do not adequately care about the characteristics corresponding to the empirical data distribution such as feature-correlations and outlier risk. To suggest an executable action for users, we propose a new framework of CE for extracting an action by evaluating its reality on the empirical data distribution. The key idea of our proposed method is to define a new cost function based on the Mahalanobis’ distance and the local outlier factor. Then, we propose a mixed-integer linear optimization approach to extracting an optimal action by minimizing our cost function. By experiments on real datasets, we confirm the effectiveness of our method in comparison with existing methods for CE.


  • Adversarial Learning for Counterfactual Fairness 反事实公平的对抗性学习

    • Author(s)

      • Vincent Grari, Sylvain Lamprier, Marcin Detyniecki

    • Publication

      • eprint arXiv:2008.13122

    • Date

      • 2020

    • Link

    • Abstract

      • In recent years, fairness has become an important topic in the machine learning research community. In particular, counterfactual fairness aims at building prediction models which ensure fairness at the most individual level. Rather than globally considering equity over the entire population, the idea is to imagine what any individual would look like with a variation of a given attribute of interest, such as a different gender or race for instance. Existing approaches rely on Variational Auto-encoding of individuals, using Maximum Mean Discrepancy (MMD) penalization to limit the statistical dependence of inferred representations with their corresponding sensitive attributes. This enables the simulation of counterfactual samples used for training the target fair model, the goal being to produce similar outcomes for every alternate version of any individual. In this work, we propose to rely on an adversarial neural learning approach, that enables more powerful inference than with MMD penalties, and is particularly better fitted for the continuous setting, where values of sensitive attributes cannot be exhaustively enumerated. Experiments show significant improvements in term of counterfactual fairness for both the discrete and the continuous settings.


  • Counterfactual Explanations & Adversarial Examples – Common Grounds, Essential Differences, and Potential Transfers 反事实解释和对抗样本——共同点、本质差异和潜在转移

    • Author(s)

      • Timo Freiesleben

    • Publication

      • eprint arXiv:2009.05487

    • Date

      • September 2020

    • Link

    • Abstract

      • The same optimization problem underlies counterfactual explanations (CEs) and adversarial examples (AEs). While this is well known, the relationship between the two at the conceptual level remains unclear. The present paper provides exactly the missing conceptual link. We compare CEs and AEs with respect to their philosophical basis, aims, and modeling techniques. We argue that CEs are a more general object-class than AEs. In particular, we introduce the conceptual distinction between feasible and contesting CEs and show that AEs correspond to the latter.


  • Semi-supervised counterfactual explanations

    • Author(s)

      • SURYA SHRAVAN KUMAR SAJJA, Sumanta Mukherjee, Satyam Dwivedi, Vikas C. Raykar

    • Publication

      • ICLR 2021 Reject

    • Date

      • 28 Sept 2020

    • Link

    • Abstract

      • Counterfactual explanations for machine learning models are used to find minimal interventions to the feature values such that the model changes the prediction to a different output or a target output. A valid counterfactual explanation should have likely feature values. Here, we address the challenge of generating counterfactual explanations that lie in the same data distribution as that of the training data and more importantly, they belong to the target class distribution. This requirement has been addressed through the incorporation of auto-encoder reconstruction loss in the counterfactual search process. Connecting the output behavior of the classifier to the latent space of the auto-encoder has further improved the speed of the counterfactual search process and the interpretability of the resulting counterfactual explanations. Continuing this line of research, we show further improvement in the interpretability of counterfactual explanations when the auto-encoder is trained in a semi-supervised fashion with class tagged input data. We empirically evaluate our approach on several datasets and show considerable improvement in-terms of several metrics.


  • Semantics and explanation: why counterfactual explanations produce adversarial examples in deep neural networks 语义和解释:为什么反事实解释会在深度神经网络中产生对抗样本

    • Author(s)

      • Kieran Browne, Ben Swift Australian National University

    • Publication

      • Preprint submitted to Artificial Intelligence (CCF A, Under Review)

    • Date

      • 2020

    • Link

    • Abstract

      • Recent papers in explainable AI have made a compelling case for counterfactual modes of explanation. While counterfactual explanations appear to be extremely effective in some instances, they are formally equivalent to adversarial examples. This presents an apparent paradox for explainability researchers: if these two procedures are formally equivalent, what accounts for the explanatory divide apparent between counterfactual explanations and adversarial examples? We resolve this paradox by placing emphasis back on the semantics of counterfactual expressions. Producing satisfactory explanations for deep learning systems will require that we find ways to interpret the semantics of hidden layer representations in deep neural networks.

    • Comment

      • 反事实解释和对抗样本形式相同,什么导致了它们在解释上的区别?需要从语义层面进行解释。


  • Robust Explanations for Private Support Vector Machines

    • Author(s)

      • Rami Mochaourab, Sugandh Sinha, Stanley Greenstein, Panagiotis Papapetrou

    • Publication

      • arXiv e-print

    • Date

      • February 2021

    • Link

    • Abstract

      • We consider counterfactual explanations for private support vector machines (SVM), where the privacy mechanism that publicly releases the classifier guarantees differential privacy. While privacy preservation is essential when dealing with sensitive data, there is a consequent degradation in the classification accuracy due to the introduced perturbations in the classifier weights. For such classifiers, counterfactual explanations need to be robust against the uncertainties in the SVM weights in order to ensure, with high confidence, that the classification of the data instance to be explained is different than its explanation. We model the uncertainties in the SVM weights through a random vector, and formulate the explanation problem as an optimization problem with probabilistic constraint. Subsequently, we characterize the problem’s deterministic equivalent and study its solution. For linear SVMs, the problem is a convex second-order cone program. For non-linear SVMs, the problem is non-convex. Thus, we propose a sub-optimal solution that is based on the bisection method. The results show that, contrary to non-robust explanations, the quality of explanations from the robust solution degrades with increasing privacy in order to guarantee a prespecified confidence level for correct classifications.


  • Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

    • Author(s)

      • Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, Yarin Gal

    • Publication

      • Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021 (CCF C)

    • Date

      • Mar 2021

    • Link

    • Abstract

      • Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. For CEs to be useful, it is important that they are easy for users to interpret. Existing methods for generating interpretable CEs rely on auxiliary generative models, which may not be suitable for complex datasets, and incur engineering overhead. We introduce a simple and fast method for generating interpretable CEs in a white-box setting without an auxiliary model, by using the predictive uncertainty of the classifier. Our experiments show that our proposed algorithm generates more interpretable CEs, according to IM1 scores, than existing methods. Additionally, our approach allows us to estimate the uncertainty of a CE, which may be important in safety-critical applications, such as those in the medical domain.

    • Code


  • Individual Explanations in Machine Learning Models: A Survey for Practitioners

    • Author(s)

      • Alfredo Carrillo, Luis F. Cantú, Alejandro Noriega

    • Publication

      • eprint arXiv:2104.04144

    • Date

      • April 2021

    • Link

    • Abstract

      • In recent years, the use of sophisticated statistical models that influence decisions in domains of high societal relevance is on the rise. Although these models can often bring substantial improvements in the accuracy and efficiency of organizations, many governments, institutions, and companies are reluctant to their adoption as their output is often difficult to explain in human-interpretable ways. Hence, these models are often regarded as black-boxes, in the sense that their internal mechanisms can be opaque to human audit. In real-world applications, particularly in domains where decisions can have a sensitive impact–e.g., criminal justice, estimating credit scores, insurance risk, health risks, etc.–model interpretability is desired. Recently, the academic literature has proposed a substantial amount of methods for providing interpretable explanations to machine learning models. This survey reviews the most relevant and novel methods that form the state-of-the-art for addressing the particular problem of explaining individual instances in machine learning. It seeks to provide a succinct review that can guide data science and machine learning practitioners in the search for appropriate methods to their problem domain.


  • On the Connections between Counterfactual Explanations and Adversarial Examples 论反事实解释与对抗样本之间的联系

    • Author(s)

      • Martin Pawelczyk, Shalmali Joshi, Chirag Agarwal, Sohini Upadhyay, Himabindu Lakkaraju Harvard University

    • Publication

      • arXiv e-print

    • Date

      • 2021

    • Link

    • Abstract

      • Counterfactual explanations and adversarial examples have emerged as critical research areas for addressing the explainability and robustness goals of machine learning (ML). While counterfactual explanations were developed with the goal of providing recourse to individuals adversely impacted by algorithmic decisions, adversarial examples were designed to expose the vulnerabilities of ML models. While prior research has hinted at the commonalities between these frameworks, there has been little to no work on systematically exploring the connections between the literature on counterfactual explanations and adversarial examples. In this work, we make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples. More specifically, we theoretically analyze salient counterfactual explanation and adversarial example generation methods, and highlight the conditions under which they behave similarly. Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods such as the ones proposed by Wachter et. al. and Carlini and Wagner (with mean squared error loss), and C-CHVAE and natural adversarial examples by Zhao et. al. are equivalent. We also bound the distance between counterfactual explanations and adversarial examples generated by Wachter et. al. and DeepFool methods for linear models. Finally, we empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.