Seth Neel University of Pennsylvania EC 2018: MD4SG Workshop

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Computational Learning Theory
Is Random Model Better? -On its accuracy and efficiency-
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Introduction to Support Vector Machines (SVM)
Informational Complexity Notion of Reduction for Concept Classes Shai Ben-David Cornell University, and Technion Joint work with Ami Litman Technion.
An Introduction of Support Vector Machine
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Active Learning of Binary Classifiers
Reduced Support Vector Machine
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Data mining and statistical learning - lecture 13 Separating hyperplane.
SVM Support Vectors Machines
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Incorporating Unlabeled Data in the Learning Process
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Benk Erika Kelemen Zsolt
An Introduction to Support Vector Machines (M. Law)
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Algorithmic Transparency & Quantitative Influence
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Dan Roth Department of Computer and Information Science
Sofus A. Macskassy Fetch Technologies
Dan Roth Department of Computer and Information Science
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Understanding Generalization in Adaptive Data Analysis
Algorithmic Transparency with Quantitative Input Influence
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Importance Weighted Active Learning
Thinking (and teaching) Quantitatively about Bias and Privacy
Students: Meiling He Advisor: Prof. Brain Armstrong
Privacy-Preserving Classification
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
CSCI B609: “Foundations of Data Science”
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Computational Learning Theory
Recitation 6: Kernel SVM
Computational Learning Theory
Generalization in deep learning
Machine learning overview
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Presentation transcript:

Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness Seth Neel University of Pennsylvania EC 2018: MD4SG Workshop Cornell University June 22, 2018 Joint work with Michael Kearns, Aaron Roth, and Steven Wu To appear, ICML 2018 Pre-print available on arXiv

Can we design fair ML algorithms? What does fairness mean?

Statistical Fairness Notions Protected groups defined by protected features (race, gender, age…) example of binary classification: giving out loans Statistical parity [Dwork et al. 2012]: equality of acceptance rates across groups (ignores creditworthiness) Equalized odds [Hardt et al. 2016]: equality of false positive, false negative rates across groups (accounts for creditworthiness) Calibration [Kleinberg et al. 2016]: equality of positive predictive values (PPV) across groups (PPV = Pr[ y = 1 | h(x) = 1 ])

Interpolating group fairness and individual fairness Problem: achieving group fairness by subgroup discrimination e.g. "disabled Hispanic women over the age of 50" (conjunction) no reason to expect it won’t happen under standard fairness notions But also “impossible" to protect arbitrary subgroups (individuals) Focus on computationally/statistically identifiable subgroups e.g. subgroups defined by conjunctions over protected features

Finer-Grained Subgroups? No promises made to individuals or finer-grained subgroups : accepted individuals Blue Green Male Female

Binary Classification Formulation n samples (or individuals) (x, x’, y) ~ P; features x are “protected” Model/Decision algorithm D makes prediction/decision D(x, x’) g(x) is characteristic function of some subgroup: g(x) = 1 indicates (x, x’, y) belongs the subgroup g G: rich but limited class of subgroups over protected features e.g. G = conjunctions, linear thresholds, decision lists,…

Statistical Fairness Notions VC dimension of G allows us to bound generalization error Related: [Hebert-Johnson et al. 2018] for calibration

Auditing and Learning Auditing for γ-fairness w.r.t. G given access to samples (x, D(x, x’), y) induced by a black-box algorithm D, decide if D is γ-fair w.r.t. G, or output a violated g in the class G Learning a classifier that is γ-fair w.r.t. G given a hypothesis class H over features (x, x’), find a distribution D over H that satisfies γ-fairness

Hardness of Auditing Theorem (Informal): Auditing an arbitrary D for γ-(SP & FP) fairness w.r.t. G is computationally equivalent to weak agnostic learning of G. Weak agnostic learning of G [Kearns et al. 1994; Kalai et al. 2008]: given a dataset of (x, y) pairs drawn from P, whenever the best classifier in G has accuracy at least (1/2 + γ), find a function g in G with accuracy (1/2 + γ - ε), for some ε ≤ γ The labels y are not promised to be generated by any g in G Intuition: If g can predict the decisions D(x, x’) only using the protected features x, then D discriminates on the subgroup g

Bad News/Good News Bad news theoretically auditing even for simple structured classes G (e.g. conjunctions, half-spaces) is computationally intractable in the worst case learning over many classes of H (even without fairness constraint) is also NP-hard in the worst case Potentially good news in practice there exist powerful heuristics for agnostic learning/ERM problems SVM, logistic regressions, neural nets, boosting…

Learning for Subgroup Fairness Goal: find the optimal fair (randomized) classifier in H Main result: reduction to a sequence of CSC problems (generalize [Agarwal et al. 2017] to many subgroups) Key Idea: simulate a zero-sum game between an Auditor and a Learner

Cost-Sensitive Classification (CSC) CSC problem is equivalent to agnostic learning [Zadrozny et al. 2003]

Fair Empirical Risk Minimization Problem Let P be the empirical distribution over the n data points

From Lagrangian to Zero-Sum Games Optimal solution for fair ERM is the minimax equilibrium: Learner controlling D v.s. Auditor controlling  λ

Best Response as CSC Lemma: The best response for both the Learner and the Auditor can be computed by solving a cost- sensitive classification problem over H and G respectively. How to compute the equilibrium by relying on the CSC oracles?

Main Theoretical Result Theorem (Informal): Given access to CSC oracles over classes of H and G, our algorithm runs in polynomial time, and outputs a randomized classifier D that is γ-free w.r.t. all groups in G. Is this useful?

Empirical Evaluation Solving the zero-sum game using Fictitious Play: Each player in each round plays best response against the opponent’s empirical play so far Avoids repeated sampling from FTPL distributions Only guarantees asymptotic convergence In practice has the merit of simplicity and faster per- step computation

Dataset Both H (d = 122) and G (d = 18) are linear threshold functions Communities and Crime dataset: census and other data on 2K U.S. communities target prediction: high vs. low violent crime rate 122 features total; 18 protected (racial groups, incomes, police) Both H (d = 122) and G (d = 18) are linear threshold functions Implemented with thresholded linear regression as CSC oracle

Convergence for different input γ’s

Pareto Curves: (error, γ) Across a range of datasets unfairness starts around .02-.03 We are able to drive near 0 with only a 3-6% increase in error

Flattening the Discrimination Heatmap

Subgroup Fairness!

Does fairness to marginal subgroups achieve finer subgroup fairness? No!

Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness Seth Neel University of Pennsylvania MD4SG Cornell University June 22, 2018 Joint work with Michael Kearns, Aaron Roth, and Steven Wu To appear, ICML 2018 Pre-print available on arXiv