What does it mean for an algorithm to be fair

In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.

Warning: algorithms can facilitate illegal discrimination!

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have ahistorical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

A typical example of how a deep neural network learns to tag images. Image source:http://engineering.flipboard.com/2015/05/scaling-convnets/

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if ew knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning(FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

 
 

What does it mean for an algorithm to be fair的更多相关文章

  1. 挑子学习笔记:两步聚类算法(TwoStep Cluster Algorithm)——改进的BIRCH算法

    转载请标明出处:http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html 两步聚类算法是在SPSS Modeler中使用的 ...

  2. PE Checksum Algorithm的较简实现

    这篇BLOG是我很早以前写的,因为现在搬移到CNBLOGS了,经过整理后重新发出来. 工作之前的几年一直都在搞计算机安全/病毒相关的东西(纯学习,不作恶),其中PE文件格式是必须知识.有些PE文件,比 ...

  3. [异常解决] windows用SSH和linux同步文件&linux开启SSH&ssh client 报 algorithm negotiation failed的解决方法之一

    1.安装.配置与启动 SSH分客户端openssh-client和openssh-server 如果你只是想登陆别的机器的SSH只需要安装openssh-client(ubuntu有默认安装,如果没有 ...

  4. [Algorithm] 使用SimHash进行海量文本去重

    在之前的两篇博文分别介绍了常用的hash方法([Data Structure & Algorithm] Hash那点事儿)以及局部敏感hash算法([Algorithm] 局部敏感哈希算法(L ...

  5. Backtracking algorithm: rat in maze

    Sept. 10, 2015 Study again the back tracking algorithm using recursive solution, rat in maze, a clas ...

  6. [Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec词向量模型

    深度学习掀开了机器学习的新篇章,目前深度学习应用于图像和语音已经产生了突破性的研究进展.深度学习一直被人们推崇为一种类似于人脑结构的人工智能算法,那为什么深度学习在语义分析领域仍然没有实质性的进展呢? ...

  7. [Algorithm] 群体智能优化算法之粒子群优化算法

    同进化算法(见博客<[Evolutionary Algorithm] 进化算法简介>,进化算法是受生物进化机制启发而产生的一系列算法)和人工神经网络算法(Neural Networks,简 ...

  8. [Evolutionary Algorithm] 进化算法简介

    进化算法,也被成为是演化算法(evolutionary algorithms,简称EAs),它不是一个具体的算法,而是一个“算法簇”.进化算法的产生的灵感借鉴了大自然中生物的进化操作,它一般包括基因编 ...

  9. Debian 8 jessie, OpenSSH ssh connection server responded Algorithm negotiation failed

    安装了debian 8.5 就出问题了. root@debian8:~# lsb_release -aNo LSB modules are available.Distributor ID: Debi ...

随机推荐

  1. php笔记05:http协议中防盗链技术

    倘若我们自己在电脑上写了一个网站文件(可以是html,php文件等等),但是只希望本机可以访问这个文件,不希望别的电脑访问就需要这里的防盗链技术 1.我们在本地写了一个import.php文件: 而且 ...

  2. [Form Builder]Oracle Form系统变量中文版总结大全

    转:http://yedward.net/?id=57 Form中的系统变量,它存在于一个Form的整个运行时期的会话之中,变量包含了有关Form相关属性的字节信息.有些变量标明了当前状态,还有些变量 ...

  3. 【邮件】imap与pop3的区别

    文:铁乐猫 2015 10月14日 今天替一位在外出差的用户安装和设置完foxmail用于收发邮件,到下午被告知对方用foxmail发完邮件后,在网页上登录邮箱后并没有看到在foxmail中" ...

  4. 通过ApplicationContextAwareSpring实现手工加载配置的javabean

    在做一个多线程的数据采集器实现的过程中,由于框架是集成srping,因此希望统一使用原有的数据库配置信息,但是需要手工获取数据库配置bean.我们可以通过继承ApplicationContextAwa ...

  5. Mybatis特殊字符处理,Mybatis中xml文件特殊字符的处理

    Mybatis特殊字符处理,Mybatis中xml文件特殊字符的处理 >>>>>>>>>>>>>>>>& ...

  6. VBA控件ListBox的BoundColumn和TextColumn用法,Value和Text的用法

    在使用Excel编写VBA程序时,用到ListBox,然后研究了下它的所有属性.其实这个控件功能很不好用,太老了,最重要的是还不支持鼠标滚轮,很不好操作,但是考虑到兼容性,还是使用它. 其实读取.写入 ...

  7. Google Developers中国网站发布!(转)

    Google Developers 中国网站是特别为中国开发者而建立的,它汇集了 Google 为全球开发者所提供的开发技术资源,包括 API 文档.开发案例.技术培训的视频.并涵盖了以下关键开发技术 ...

  8. mysql的分区技术

    一.概述 当 MySQL的总记录数超过了100万后,会出现性能的大幅度下降吗?答案是肯定的,但是,性 能下降>的比率不一而同,要看系统的架构.应用程序.还有>包括索引.服务器硬件等多种因素 ...

  9. MYSQL连接数据库

    web.config <connectionStrings>   <add name="MysqlDB" connectionString="Data ...

  10. 用Session实现验证码

    新建一个 ashx 一般处理程序 如: YZM.ashx继承接口 IRequiresSessionState //在一般处理程序里面继承 HttpContext context 为请求上下文,包含此次 ...