The picture above is funny.

But for me it is also one of those examples that make me sad about the outlook for AI and for Computer Vision. What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that have to fall in place for it to make sense. Here is my short attempt:

  • You recognize it is an image of a bunch of people and you understand they are in a hallway
  • You recognize that there are 3 mirrors in the scene so some of those people are "fake" replicas from different viewpoints.
  • You recognize Obama from the few pixels that make up his face. It helps that he is in his suit and that he is surrounded by other people with suits.
  • You recognize that there's a person standing on a scale, even though the scale occupies only very few white pixels that blend with the background. But, you've used the person's pose and knowledge of how people interact with objects to figure it out.
  • You recognize that Obama has his foot positioned just slightly on top of the scale. Notice the language I'm using: It is in terms of the 3D structure of the scene, not the position of the leg in the 2D coordinate system of the image.
  • You know how physics works: Obama is leaning in on the scale, which applies a force on it. Scale measures force that is applied on it, that's how it works => it will over-estimate the weight of the person standing on it.
  • The person measuring his weight is not aware of Obama doing this. You derive this because you know his pose, you understand that the field of view of a person is finite, and you understand that he is not very likely to sense the slight push of Obama's foot.
  • You understand that people are self-conscious about their weight. You also understand that he is reading off the scale measurement, and that shortly the over-estimated weight will confuse him because it will probably be much higher than what he expects. In other words, you reason about implications of the events that are about to unfold seconds after this photo was taken, and especially about the thoughts and how they will develop inside people's heads. You also reason about what pieces of information are available to people.
  • There are people in the back who find the person's imminent confusion funny. In other words you are reasoning about state of mind of people, and their view of the state of mind of another person. That's getting frighteningly meta.
  • Finally, the fact that the perpetrator here is the president makes it maybe even a little more funnier. You understand what actions are more or less likely to be undertaken by different people based on their status and identity.

I could go on, but the point here is that you've used a HUGE amount of information in that half second when you look at the picture and laugh. Information about the 3D structure of the scene, confounding visual elements like mirrors, identities of people, affordances and how people interact with objects, physics (how a particular instrument works,  leaning and what that does), people, their tendency to be insecure about weight, you've reasoned about the situation from the point of view of the person on the scale, what he is aware of, what his intents are and what information is available to him, and you've reasoned about people reasoning about people. You've also thought about the dynamics of the scene and made guesses about how the situation will unfold in the next few seconds visually, how it will unfold in the thoughts of people involved, and you reasoned about how likely or unlikely it is for people of particular identity/status to carry out some action. Somehow all these things come together to "make sense" of the scene.

It is mind-boggling that all of the above inferences unfold from a brief glance at a 2D array of R,G,B values. The core issue is that the pixel values are just a tip of a huge iceberg and deriving the entire shape and size of the icerberg from prior knowledge is the most difficult task ahead of us. How can we even begin to go about writing an algorithm that can reason about the scene like I did? Forget for a moment the inference algorithm that is capable of putting all of this together; How do we even begin to gather data that can support these inferences (for example how a scale works)? How do we go about even giving the computer a chance?

Now consider that the state of the art techniques in Computer Vision are tested on things like Imagenet (task of assigning 1-of-k labels for entire images), or Pascal VOC detection challenge (+ include bounding boxes). There is also quite a bit of work on pose estimation, action recognition, etc., but it is all specific, disconnected, and only half works. I hate to say it but the state of CV and AI is pathetic when we consider the task ahead, and when we think about how we can ever go from here to there. The road ahead is long, uncertain and unclear.

I've seen some arguments that all we need is lots more data from images, video, maybe text and run some clever learning algorithm: maybe a better objective function, run SGD, maybe anneal the step size, use adagrad, or slap an L1 here and there and everything will just pop out. If we only had a few more tricks up our sleeves! But to me, examples like this illustrate that we are missing many crucial pieces of the puzzle and that a central problem will be as much about obtaining the right training data in the right form to support these inferences as it will be about making them.

Thinking about the complexity and scale of the problem further, a seemingly inescapable conclusion for me is that we may also need embodiment, and that the only way to build computers that can interpret scenes like we do is to allow them to get exposed to all the years  of (structured, temporally coherent) experience we have,  ability to interact with the world, and some magical active learning/inference architecture that I can barely even imagine when I think backwards about what it should be capable of.

In any case, we are very, very far and this depresses me. What is the way forward? :( Maybe I should just do a startup. I have a really cool idea for a mobile local social iPhone app.

from: http://karpathy.github.io/2012/10/22/state-of-computer-vision/

计算机视觉和人工智能的状态:我们已经走得很远了 The state of Computer Vision and AI: we are really, really far away.的更多相关文章

  1. ACM一年记,总结报告(希望自己可以走得很远)

    一. 知识点梳理 (一) 先从工具STL说起: 容器学习了:stack,queue,priority_queue,set/multiset,map/multimap,vector. 1.stack: ...

  2. 2019年度【计算机视觉&机器学习&人工智能】国际重要会议汇总

    简介 每年全世界都会举办很多计算机视觉(Computer Vision,CV). 机器学习(Machine Learning,ML).人工智能(Artificial Intelligence ,AI) ...

  3. 在计算机视觉与人工智能领域,顶级会议比SCI更重要(内容转)

    很多领域,SCI是王道,尤其在中国,在教师科研职称评审和学生毕业条件中都对SCI极为重视,而会议则充当了补充者的身份.但是在计算机领域,尤其是人工智能与机器学习领域里,往往研究者们更加青睐于会议 我无 ...

  4. paper 156:专家主页汇总-计算机视觉-computer vision

    持续更新ing~ all *.files come from the author:http://www.cnblogs.com/findumars/p/5009003.html 1 牛人Homepa ...

  5. 边缘计算、区块链、5G,哪个能走的更远

    频繁出现的新词汇5G.区块链.边缘计算,这些都代表了什么,又能给我们的生活带来什么巨大的改变么?抉择之时已至,能够走向未来的真的只有一个吗? "没有什么能够阻挡,你对自由的向往....&qu ...

  6. 计算机视觉中的边缘检测Edge Detection in Computer Vision

    计算机视觉中的边缘检测   边缘检测是计算机视觉中最重要的概念之一.这是一个很直观的概念,在一个图像上运行图像检测应该只输出边缘,与素描比较相似.我的目标不仅是清晰地解释边缘检测是怎样工作的,同时也提 ...

  7. AI-Azure上的认知服务之Computer Vision(计算机视觉)

    使用 Azure 的计算机视觉服务,开发人员可以访问用于处理图像并返回信息的高级算法. 主要包含如下高级算法: 标记视觉特性Tag visual features 检测对象Detect objects ...

  8. 如何创建Azure Face API和计算机视觉Computer Vision API

    在人工智能技术飞速发展的当前,利用技术手段实现人脸识别.图片识别已经不是什么难事.目前,百度.微软等云计算厂商均推出了人脸识别和计算机视觉的API,其优势在于不需要搭建本地环境,只需要通过网络交互,就 ...

  9. 【29】带你了解计算机视觉(Computer vision)

    计算机视觉(Computer vision) 计算机视觉是一个飞速发展的一个领域,这多亏了深度学习. 深度学习与计算机视觉可以帮助汽车,查明周围的行人和汽车,并帮助汽车避开它们. 还使得人脸识别技术变 ...

随机推荐

  1. SQL Server 本地语言版本

    要一些实验是往往喜欢使用英文的Windows 以及SQL Server ,但有时需要使用中文的环境方便理解.中文的SQL Server 不能被安装在英文的Windows 系统上. 根据文档可得知以下兼 ...

  2. IE中出现 "Stack overflow at line" 错误的解决方法

    在做网站时遇到一个问题,网站用的以前的程序,在没有改过什么程序的情况下,页面总是提示Stack overflow at line 0的错误,而以前的网站都正常没有出现过这种情况,在网上找了一下解决办法 ...

  3. Mysql数据库中的计数器表实时更新

    如果某个应用中存在计数器,例如网站的总访问量.用户的粉丝数.文件下载数等等.如果相关应用在Mysql数据库的表中保存计数器,在更新计数器的时候可能会碰到并发问题.例如在web应用中,记录网站的点击次数 ...

  4. vs2010旗舰版后,运行调试一个项目时调试不了,提示的是:无法使用“pc”附加到应用程序“webdev.webserver40.exe(PID:2260”

    具体问题描述: vs2010旗舰版后,运行调试一个项目时调试不了,能编译,按ctrl+f5 可以运行,但是就是调试就不行,提示的是:无法使用“pc”附加到应用程序“webdev.webserver40 ...

  5. How to: Add SharePoint 2010 Search Web Parts to Web Part Gallery for Upgraded Site Collections

    When you upgrade to Microsoft SharePoint Server 2010, some of the new SharePoint Enterprise Search W ...

  6. Daily Scrum 11.11

    摘要:本次会议继续讨论程序的问题以及单元测试和集成测试,本次测试为1.02版本.本次的Task列表如下: Task列表 出席人员 Today's Task Tomorrow's Task 刘昊岩  t ...

  7. Thinkphp中路由Url获取的使用方法

    Thinkphp是一个体系较为完整的框架,很多地方比国外的框架功能都全,唯一不喜之处是性能,和传说中的.NET有点像. Thinkphp提供较全url处理体系,通过同一规则实现Url的路由和Url生成 ...

  8. 牛客最近来了一个新员工Fish,每天早晨总是会拿着一本英文杂志,写些句子在本子上。同事Cat对Fish写的内容颇感兴趣,有一天他向Fish借来翻看,但却读不懂它的意思。例如,“student. a am I”。后来才意识到,这家伙原来把句子单词的顺序翻转了,正确的句子应该是“I am a student.”。Cat对一一的翻转这些单词顺序可不在行,你能帮助他么?

    // test20.cpp : 定义控制台应用程序的入口点. // #include "stdafx.h" #include<iostream> #include< ...

  9. 如何通过CSS3实现背景图片色彩的梯度渐变

    随着网站的越来越普及,CSS3和HTML5必将成为网站前端发展的主流方向,特别是在移动端,高端浏览器给前端工程师们带来了无以言表的体验. 通俗的来讲,CSS3可以说是CSS技术的升级版本,CSS3语言 ...

  10. PHP之session相关实例教程与经典代码

    ·php 中cookie和session的用法比较 ·phpmyadmin报错:Cannot start session without errors问题 ·php中cookie与session应用学 ...