What does it mean for an algorithm to be fair
What does it mean for an algorithm to be fair
In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.
Warning: algorithms can facilitate illegal discrimination!
Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have ahistorical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.
There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.
Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].
Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.
While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.
The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.
And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.
A typical example of how a deep neural network learns to tag images. Image source:http://engineering.flipboard.com/2015/05/scaling-convnets/
The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?
What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.
Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if ew knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning(FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.
From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.
In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.
Until then!
Like this:
What does it mean for an algorithm to be fair的更多相关文章
- 挑子学习笔记:两步聚类算法(TwoStep Cluster Algorithm)——改进的BIRCH算法
转载请标明出处:http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html 两步聚类算法是在SPSS Modeler中使用的 ...
- PE Checksum Algorithm的较简实现
这篇BLOG是我很早以前写的,因为现在搬移到CNBLOGS了,经过整理后重新发出来. 工作之前的几年一直都在搞计算机安全/病毒相关的东西(纯学习,不作恶),其中PE文件格式是必须知识.有些PE文件,比 ...
- [异常解决] windows用SSH和linux同步文件&linux开启SSH&ssh client 报 algorithm negotiation failed的解决方法之一
1.安装.配置与启动 SSH分客户端openssh-client和openssh-server 如果你只是想登陆别的机器的SSH只需要安装openssh-client(ubuntu有默认安装,如果没有 ...
- [Algorithm] 使用SimHash进行海量文本去重
在之前的两篇博文分别介绍了常用的hash方法([Data Structure & Algorithm] Hash那点事儿)以及局部敏感hash算法([Algorithm] 局部敏感哈希算法(L ...
- Backtracking algorithm: rat in maze
Sept. 10, 2015 Study again the back tracking algorithm using recursive solution, rat in maze, a clas ...
- [Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec词向量模型
深度学习掀开了机器学习的新篇章,目前深度学习应用于图像和语音已经产生了突破性的研究进展.深度学习一直被人们推崇为一种类似于人脑结构的人工智能算法,那为什么深度学习在语义分析领域仍然没有实质性的进展呢? ...
- [Algorithm] 群体智能优化算法之粒子群优化算法
同进化算法(见博客<[Evolutionary Algorithm] 进化算法简介>,进化算法是受生物进化机制启发而产生的一系列算法)和人工神经网络算法(Neural Networks,简 ...
- [Evolutionary Algorithm] 进化算法简介
进化算法,也被成为是演化算法(evolutionary algorithms,简称EAs),它不是一个具体的算法,而是一个“算法簇”.进化算法的产生的灵感借鉴了大自然中生物的进化操作,它一般包括基因编 ...
- Debian 8 jessie, OpenSSH ssh connection server responded Algorithm negotiation failed
安装了debian 8.5 就出问题了. root@debian8:~# lsb_release -aNo LSB modules are available.Distributor ID: Debi ...
随机推荐
- 幻灯片の纯CSS,NO JavaScript
之前就遇到有人问,不用js,纯css实现幻灯片. 那么对于使用纯的css + html 怎样来实现幻灯片呢?下面有几种方法可供参考,有些还不成熟. 方案一:利用css3的animation 例子传送门 ...
- ScriptManager的用法
资料中如实是说: 1, ScriptManager(脚本控制器)是asp.net ajax存在的基础. 2, 一个页面只允许有一个ScriptManager,并且放在其他ajax控件的前面 ...
- nginx上搭建HLS流媒体服务器
http://blog.csdn.net/cjsafty/article/details/7922849 简介:HTTP Live Streaming(缩写是 HLS)是一个由苹果公司提出的基于HTT ...
- java.lang.IllegalStateException: You need to use a theme.appcompat theme (or descendant) with this activity
错误描述:java.lang.IllegalStateException: You need to use a theme.appcompat theme (or descendant) with t ...
- Javascript模仿C语言的链表实现(增删改查),并且使用控制台输入输出
Js新手最近在研究Js数据结构,刚好看到链表实现这一块儿,觉得有些资料和自己理解的有冲突,于是借着自己以前一点点C语言的基础,用Javascript模仿了C的链表实现,并且用了process.stdi ...
- javascript 中caller,callee,call,apply 的概念[转载]
在提到上述的概念之前,首先想说说javascript中函数的隐含参数:arguments Arguments : 该对象代表正在执行的函数和调用它的函数的参数. [function.]argument ...
- Malformed network data
Malformed network data
- ImageIcon图像处理相关测试【一些特殊的处理方式】
/*************以下源码通过测试******************************/ package cn.jason.ios.images; import java.awt.F ...
- hibernate3.0 org.dom4j.DocumentException: Connection timed out: connect Nested exception:
hibernate3.0 org.dom4j.DocumentException: Connection timed out: connect Nested exception: 所报异常: 严重 ...
- DAG模型——硬币问题
硬币问题 有n种硬币,面值分别为V1,V2,...,Vn,每种都有无限多.给定非负整数S,可以选用多少个硬币,使得面值之和恰好为S?输出硬币数目的最小值和最大值.1<=n<=100, 0& ...