Pushing state-of-the-art in 3D content understanding
Pushing state-of-the-art in 3D content understanding
2019-10-31 06:34:08
This blog is copied from: https://ai.facebook.com/blog/pushing-state-of-the-art-in-3d-content-understanding/
In order to interpret the world around us, AI systems must understand visual scenes in three dimensions. This need extends beyond robotics, navigation, and even augmented reality applications. Even with 2D photos and videos, the scenes and objects depicted are themselves three-dimensional, of course, and truly intelligent content-understanding systems must be able to recognize the geometry of a cup’s handle when it’s being rotated in a video, or identify which objects are in the foreground and background of a photo.
Today, we’re sharing details on several new Facebook AI research projects that advance the state of the art in 3D image understanding in different but complementary ways. This work, which is being presented at the International Conference on Computer Vision (ICCV) in Seoul, addresses a variety of use cases and circumstances, with different types and amounts of training data and inputs.
Mesh R-CNN is a novel, state-of-the-art method to predict the most accurate 3D shapes in a wide range of real-world 2D images. This method, which leverages our general Mask R-CNN framework for object instance segmentation, can detect even complex objects, such as the legs of a chair or overlapping furniture.
Using an alternative and complementary approach to Mesh R-CNN, termed C3DPO, we’re the first to achieve a successful large-scale 3D reconstruction of nonrigid shapes on three benchmarks for more than 14 object categories by interpreting 3D geometry. We achieve this using only 2D keypoints and zero 3D annotations.
We’ve introduced a novel method to learn association between images and 3D shapes while significantly reducing the need for annotated training examples. This brings us closer to self-supervised systems that can create 3D representations for more kinds of objects.
We’ve developed a novel technique, called VoteNet, to perform object detection for circumstances when 3D input from LIDAR or other sensors is available. While most traditional systems for this task depend on 2D image signals, ours is based purely on 3D point clouds, which achieves higher precision than prior work.
This research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes. The field of computer vision extends to a wide range of tasks, but 3D understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.
Achieving state-of-the-art in predicting 3D shapes of unconstrained, obstructed objects
Perception systems like Mask R-CNN are powerful and versatile tools for understanding images. But because they make predictions in 2D, they ignore the 3D structure of the world. Leveraging the advances in 2D perception, we designed a 3D object reconstruction model that predicts 3D object shapes from unconstrained real-world images with a range of optical challenges, including objects with occlusion, clutter, and diverse topologies. Adding a third dimension to object detection systems that are robust against such complexities requires stronger engineering capabilities, and current engineering frameworks have hindered progress in this area.
取消静音
进入全屏
To address these challenges, we augmented Mask R-CNN’s 2D object segmentation system with a mesh prediction branch, and we built Torch3d, a Pytorch library with highly optimized 3D operators in order to implement the system. Mesh R-CNN uses Mask R-CNN to detect and classify the various objects in an image. It then infers 3D shapes with a novel mesh predictor, which is composed of a hybrid approach of voxel prediction followed by mesh refinement. This two-step process enables us to achieve higher results than prior work for predicting fine-grained 3D structures. Torch3d helps make this possible by enabling efficient, flexible, and modular implementation of complex operations, like chamfer distance, differentiable mesh sampling, and a differentiable renderer.
We use Detectron2 to implement the resulting system, which uses RGB images as input in order to both detect objects and predict 3D shapes. Similar to Mask R-CNN’s use of supervised learning for strong 2D perception, our novel approach learns 3D prediction using fully supervised learning with pairs of images and meshes. For training, we use the Pix3D data set, composed of 10,000 pairs of images and meshes, which is significantly smaller than 2D benchmarks typicallying contain hundreds of thousands of images and object annotations.
We evaluated Mesh R-CNN on two data sets and achieved strong results on both. On the Pix3D data set, Mesh R-CNN is the first system to be able to jointly detect objects of all categories and estimate their full 3D shape across diverse, cluttered, and occluded scenes of furniture. Previous work focused on evaluating models that were trained on perfectly cropped, unoccluded image segments. And on the ShapeNet data set, our hybrid approach of voxel prediction and mesh refinement outperforms prior work by a 7 percent relative margin.
取消静音
进入全屏
Accurately predicting and reconstructing the shapes of unconstrained scenes in the real world is an important step toward enhancing new experiences, like virtual reality and other forms of telepresence. Still, gathering annotated data for 3D images is substantially more complex and time-consuming than doing so for 2D images, which is why data sets for 3D shape prediction have lagged compared with their 2D counterparts. We’re therefore exploring different approaches to leveraging both supervised and self-supervised learning for reconstructing objects in 3D.
Read the full paper on Mesh R-CNN here.
Reconstructing 3D object categories with 2D keypoints
For scenarios when meshes and corresponding images are not available for training and full reconstruction of static objects or scenes are not necessary, we’ve developed an alternative approach. Our new C3DPO (Canonical 3D Pose Networks) system builds reconstructions of 3D keypoint models and achieves state-of-the-art reconstruction results using the more widely accessible and abundant 2D keypoint supervision. C3DPO helps us understand the 3D geometry of objects in a weakly supervised fashion suitable for large-scale deployment.

2D keypoints, which track specific parts of the object category (e.g., human joints or bird wings), provide a complete set of cues about the object geometry and its deformations, or viewpoint changes. The resulting 3D keypoints are useful, for instance, in modeling 3D faces and full-body meshes for more lifelike avatar graphics in VR. Similar to Mesh R-CNN, C3DPO reconstructs 3D objects using unconstrained images with occlusions and missing values.
C3DPO is the first method capable of reconstructing data sets consisting of hundreds of thousands of images with several thousand 2D keypoints. We achieve state-of-the-art reconstruction accuracy on three different data sets for more than 14 diverse nonrigid object categories. And we’ve made the code for this work available here.
Our model has two important innovations. First, given a set of monocular 2D keypoints, our new 3D reconstruction network predicts the parameters of the corresponding camera viewpoint as well as the 3D keypoint locations in a canonical orientation. Second, we introduce a novel regularization technique termed canonicalization, which consists of a second auxiliary deep network that learns alongside the 3D reconstruction network. This technique addresses the ambiguity that comes with factorizing 3D viewpoint and shape. These two innovations enable us to capture much better statistical models of the data than is possible with traditional approaches.
Such reconstructions were previously unachievable mainly because of memory constraints with the previous matrix-factorization-based methods which, unlike our deep network, cannot operate in a “minibatch” regime. Previous methods addressed the modeling of deformations by leveraging multiple simultaneous images and establishing correspondences between instantaneous 3D reconstructions, which requires hardware that’s mostly found in special labs. The efficiencies introduced by C3DPO makes it possible to enable 3D reconstruction in cases where employing hardware for 3D capture isn’t feasible, such as with large-scale objects like airplanes. Read the full paper on C3DPO here.
Learning pixel-to-surface mappings from image collections
取消静音
进入全屏
We take a step further toward reducing the supervision required for developing 3D understanding for generic classes of objects. We introduce an approach that can leverage unannotated image collections with approximate automatic instance segmentations. Instead of explicitly predicting the 3D structure underlying an image, we tackle a complementary task of mapping pixels in an image to the surface of a category-level template for 3D shapes.
Not only does this mapping allow us to understand the image in context of a category-level 3D shape, but it also gives us the ability of generalizing correspondences between objects of the same class or category. For instance, when people see the highlighted beak of the bird in the left image, we can easily locate the corresponding point in the image on the right.

This is possible because we intuitively understand the shared 3D structure across these instances. Our novel approach of mapping pixels of images to a canonical 3D surface enables our learned system to have this capability as well. When evaluating our approach by measuring its accuracy of transferring correspondences across instances, we achieved results that are about twice as accurate as previous self-supervised methods that did not leverage the underlying 3D structure of the task.
Our key insight – which allows learning with significantly less supervision – is that mapping from pixel to 3D surface can be paired with the inverse operation (going from 3D to pixel) in order to complete a cycle. Our novel approach operationalizes this and can learn using only unannotated, free, publicly available image collections with approximate segmentations from a detection method. Our resulting system can be used off the shelf, applied generally alongside other methods of top-down 3D prediction to provide a complementary pixelwise 3D understanding, and we’ve released the code here.
As demonstrated by the consistency of the colors of the cars that are moving in the video above, our system yields an invariant pixelwise embedding for objects undergoing motion and rotation. This consistency extends beyond a specific instance and can be useful in scenarios where we need to understand the commonalities across objects.

For instance, if we train a system to learn the correct place to sit on a chair or where to grasp a mug, our representation can be useful the next time the system needs to understand where to sit on a different chair or how to grasp another mug. Such tasks can not only help deepen our understanding of traditional 2D images and video content, but also enhance AR/VR experiences by transferring representations of objects. Read more about canonical surface mapping here.
Improving the fundamentals of object detection in current 3D systems

As leading-edge technologies, like autonomous agents and systems to scan 3D spaces, continue to advance, we need to push forward the mechanisms for detecting objects when 3D data is readily available. In these cases, a 3D scene understanding system needs to know what objects are in a scene and where they are in order to support high-level tasks like navigation. We’ve improved upon existing systems by constructing VoteNet, a highly accurate end-to-end 3D object detection network tailored for point clouds, which was nominated for the Best Paper Award at ICCV 2019. Unlike traditional systems for this task, which depend on 2D image signals, ours is one of the first systems based purely on 3D point clouds. This approach is more efficient and achieves much higher recognition precision than previous works.
Our model, which we’ve open-sourced here, achieves state-of-the-art 3D detection outperforming all previous methods for 3D object detection by at least 3.7 and 18.4 mAP (mean average precision) increases in SUN RGB-D and ScanNet, respectively. VoteNet outperforms previous methods by using only geometric information, without relying on standard color images.
VoteNet has a simple design, compact model size, and high efficiency, with a speed of about 100 milliseconds for a full scene and a smaller memory footprint than previous methods designed for research. Our algorithm takes in 3D point clouds from depth cameras and returns 3D bounding boxes of objects with their semantic classes.

We introduce a voting mechanism that’s inspired by the classical Hough voting algorithm. Using this method, we essentially generate new points that lie close to object centers, and these points can then be grouped and aggregated to generate box proposals. With the basic idea of voting, which is learned through deep neural networks, a set of 3D seed points vote to object centers in order to recover where they are and what they are.
As the use of 3D scanners grows in the real world — already common in applications from autonomous vehicles to biomedicine — it’s important for us to be able to achieve semantic understanding of the 3D content by localizing and classifying objects of a 3D scene. Supplementing 2D cameras with more advanced depth camera sensors for 3D recognition allows us to capture a more robust view of any given scene. With VoteNet, systems can better recognize major objects in a scene, supporting tasks like placing a virtual object, or navigation and LiveMap construction.
Developing systems with richer understanding of the real world
3D computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for 2D understanding. As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.
It’s also part of Facebook AI’s long-term goal of developing AI systems that understand and interact with the real world as humans do. We have been creating scientific breakthroughs across a broad range of capabilities focused on narrowing the gap between physical and virtual spaces. Our latest 3D-focused research can also help improve and better populate 3D objects in Facebook AI’s simulation platform, which is important for training virtual agents to operate in the real world. In the same way that robotics pushes us to address complex challenges that come from conducting experiments in the physical world, where conditions are more unpredictable, 3D research is important for teaching systems how to understand all viewpoints of objects, even when they’re occluded, hidden, or have other optical challenges.
When combined with other senses, like tactile sensing and natural language understanding, AI systems, such as virtual assistants, can function in a way that’s more seamless and useful. Collectively, this leading-edge research helps us move one step closer to building AI systems that can more intuitively understand three dimensions in the same way that humans do.
The research papers described in this blog post are being presented at ICCV 2019, along with other new work in computer vision, including:
SlowFast, a method for extracting information from video using input at two different frame rates.
TensorMask, an alternate method of object segmentation using the dense, sliding-window technique
Written by
Georgia Gkioxari
Research Scientist
Shubham Tulsiani
Research Scientist
David Novotny
Research Scientist
Pushing state-of-the-art in 3D content understanding的更多相关文章
- Image Processing and Analysis_8_Edge Detection:Edge and line oriented contour detection State of the art ——2011
此主要讨论图像处理与分析.虽然计算机视觉部分的有些内容比如特 征提取等也可以归结到图像分析中来,但鉴于它们与计算机视觉的紧密联系,以 及它们的出处,没有把它们纳入到图像处理与分析中来.同样,这里面也有 ...
- 翻新并行程序设计的认知整理版(state of the art parallel)
近几年,业内对并行和并发积累了丰富的经验.有了较深刻的理解.但之前积累的大量教材,在当今的软硬件体系下.反而都成了负面教材.所以,有必要加强宣传,翻新大家的认知. 首先.天地倒悬,结论先行:当你须要并 ...
- HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
HyperLogLog参考下面这篇blog, http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-par ...
- 从 Quora 的 187 个问题中学习机器学习和NLP
从 Quora 的 187 个问题中学习机器学习和NLP 原创 2017年12月18日 20:41:19 作者:chen_h 微信号 & QQ:862251340 微信公众号:coderpai ...
- 计算机视觉和人工智能的状态:我们已经走得很远了 The state of Computer Vision and AI: we are really, really far away.
The picture above is funny. But for me it is also one of those examples that make me sad about the o ...
- 国外60个专业3D模型网站
原始链接:http://blog.sina.com.cn/s/blog_4ba3c7950100jxkh.html Today, 3D models are used in a wide variet ...
- 优化WPF 3D性能
Maximize WPF 3D Performance .NET Framework 4.5 As you use the Windows Presentation Foundation (WPF ...
- the core of Git is a simple key-value data store The objects directory stores all the content for your database
w https://git-scm.com/book/en/v1/Git-Internals-Plumbing-and-Porcelain Git is a content-addressable f ...
- Godot-3D教程-01.介绍3D
创建一个3D游戏将是个挑战,额外增加的Z坐标将使许多用于2D游戏的通用技术不再有用.为了帮助变换(transition),值得一提的是Godot将使用十分相似的API用于2D和3D. 目前许多节点是公 ...
随机推荐
- Java 之 数学相关类 Math、BigInteger、BigDecimal
一.java.lang.Math 类 一.Math 类概述 java.lang.Math 类包含用于执行基本数学运算的方法,如指数.对数.平方根和三角函数.类似于这样的类,其所有方法均为静态方法,并且 ...
- 记一次针对Centos的入侵分析
离开厂家多年,很久没有碰这类事件了. 回顾: 2017年9月末,接到一个朋友转述的求助信息.他一客户的服务器被黑了.服务器上所跑业务上的金额也全部被人转走了. 朋友的客户加我后,没头没尾的问我能不能做 ...
- Python 之 计算psnr和ssim值
基于python版的PSNR和ssim值计算 总所周知,图像质量评价的常用指标有PSNR和SSIM等,本博文是基于python版的图像numpy的float64格式和uint8格式计算两种指标值(附代 ...
- Spring容器的refresh()介绍
Spring容器的refresh()[创建刷新]; 1.prepareRefresh()刷新前的预处理; 1).initPropertySources()初始化一些属性设置;子类自定义个性化的属性设置 ...
- 基于 Vue + Element 的响应式后台模板
项目地址 https://github.com/caochangkui/vue-element-responsive-demo 主要功能 响应式侧边栏 面包屑导航(结合router.js) 路由动效 ...
- springboot:使用JPA-Hibernate
步骤: 在pom.xml文件中添加mysql,spring-data-jpa的依赖. <!-- 添加mysql数据库驱动依赖--> <dependency> <group ...
- C# 方法执行超时处理
封装了一个方法,用于处理一些需要判断是否执行超时了的操作 internal static T TimeoutCheck<T>(int ms, Func<T> func) { v ...
- Could not get lock /var/lib/dpkg/lock-frontend解决
在安装软件包时如果出现Could not get lock /var/lib/dpkg/lock-frontend,说明之前使用apt时出现异常,没有正常关闭,还在运行. lgj@lgj-Lenovo ...
- Golang 在 Mac、Linux、Windows 下如何交叉编译
转自 https://blog.csdn.net/panshiqu/article/details/53788067 Golang 支持交叉编译,在一个平台上生成另一个平台的可执行程序,最近使用了一下 ...
- WebLogic任意文件上传漏洞(CVE-2019-2618)
WebLogic任意文件上传漏洞(CVE-2019-2618) 0x01 漏洞描述 漏洞介绍 CVE-2019-2618漏洞主要是利用了WebLogic组件中的DeploymentService接口, ...