Pushing state-of-the-art in 3D content understanding

2019-10-31 06:34:08

This blog is copied from: https://ai.facebook.com/blog/pushing-state-of-the-art-in-3d-content-understanding/

In order to interpret the world around us, AI systems must understand visual scenes in three dimensions. This need extends beyond robotics, navigation, and even augmented reality applications. Even with 2D photos and videos, the scenes and objects depicted are themselves three-dimensional, of course, and truly intelligent content-understanding systems must be able to recognize the geometry of a cup’s handle when it’s being rotated in a video, or identify which objects are in the foreground and background of a photo.

Today, we’re sharing details on several new Facebook AI research projects that advance the state of the art in 3D image understanding in different but complementary ways. This work, which is being presented at the International Conference on Computer Vision (ICCV) in Seoul, addresses a variety of use cases and circumstances, with different types and amounts of training data and inputs.

  • Mesh R-CNN is a novel, state-of-the-art method to predict the most accurate 3D shapes in a wide range of real-world 2D images. This method, which leverages our general Mask R-CNN framework for object instance segmentation, can detect even complex objects, such as the legs of a chair or overlapping furniture.

  • Using an alternative and complementary approach to Mesh R-CNN, termed C3DPO, we’re the first to achieve a successful large-scale 3D reconstruction of nonrigid shapes on three benchmarks for more than 14 object categories by interpreting 3D geometry. We achieve this using only 2D keypoints and zero 3D annotations.

  • We’ve introduced a novel method to learn association between images and 3D shapes while significantly reducing the need for annotated training examples. This brings us closer to self-supervised systems that can create 3D representations for more kinds of objects.

  • We’ve developed a novel technique, called VoteNet, to perform object detection for circumstances when 3D input from LIDAR or other sensors is available. While most traditional systems for this task depend on 2D image signals, ours is based purely on 3D point clouds, which achieves higher precision than prior work.

This research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes. The field of computer vision extends to a wide range of tasks, but 3D understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.

Achieving state-of-the-art in predicting 3D shapes of unconstrained, obstructed objects

Perception systems like Mask R-CNN are powerful and versatile tools for understanding images. But because they make predictions in 2D, they ignore the 3D structure of the world. Leveraging the advances in 2D perception, we designed a 3D object reconstruction model that predicts 3D object shapes from unconstrained real-world images with a range of optical challenges, including objects with occlusion, clutter, and diverse topologies. Adding a third dimension to object detection systems that are robust against such complexities requires stronger engineering capabilities, and current engineering frameworks have hindered progress in this area.

 
暂停
 
0:00
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
Mesh R-CNN takes an input image, predicts object instances in that image, and infers their 3D shape. To capture diversity in geometries and topologies, it first predicts coarse voxels, which are refined for accurate mesh predictions.

To address these challenges, we augmented Mask R-CNN’s 2D object segmentation system with a mesh prediction branch, and we built Torch3d, a Pytorch library with highly optimized 3D operators in order to implement the system. Mesh R-CNN uses Mask R-CNN to detect and classify the various objects in an image. It then infers 3D shapes with a novel mesh predictor, which is composed of a hybrid approach of voxel prediction followed by mesh refinement. This two-step process enables us to achieve higher results than prior work for predicting fine-grained 3D structures. Torch3d helps make this possible by enabling efficient, flexible, and modular implementation of complex operations, like chamfer distance, differentiable mesh sampling, and a differentiable renderer.

We use Detectron2 to implement the resulting system, which uses RGB images as input in order to both detect objects and predict 3D shapes. Similar to Mask R-CNN’s use of supervised learning for strong 2D perception, our novel approach learns 3D prediction using fully supervised learning with pairs of images and meshes. For training, we use the Pix3D data set, composed of 10,000 pairs of images and meshes, which is significantly smaller than 2D benchmarks typicallying contain hundreds of thousands of images and object annotations.

We evaluated Mesh R-CNN on two data sets and achieved strong results on both. On the Pix3D data set, Mesh R-CNN is the first system to be able to jointly detect objects of all categories and estimate their full 3D shape across diverse, cluttered, and occluded scenes of furniture. Previous work focused on evaluating models that were trained on perfectly cropped, unoccluded image segments. And on the ShapeNet data set, our hybrid approach of voxel prediction and mesh refinement outperforms prior work by a 7 percent relative margin.

 
暂停
 
-0:35
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
System overview of Mesh R-CNN. We augment Mask R-CNN with 3D shape inference.

Accurately predicting and reconstructing the shapes of unconstrained scenes in the real world is an important step toward enhancing new experiences, like virtual reality and other forms of telepresence. Still, gathering annotated data for 3D images is substantially more complex and time-consuming than doing so for 2D images, which is why data sets for 3D shape prediction have lagged compared with their 2D counterparts. We’re therefore exploring different approaches to leveraging both supervised and self-supervised learning for reconstructing objects in 3D.

Read the full paper on Mesh R-CNN here.

Reconstructing 3D object categories with 2D keypoints

For scenarios when meshes and corresponding images are not available for training and full reconstruction of static objects or scenes are not necessary, we’ve developed an alternative approach. Our new C3DPO (Canonical 3D Pose Networks) system builds reconstructions of 3D keypoint models and achieves state-of-the-art reconstruction results using the more widely accessible and abundant 2D keypoint supervision. C3DPO helps us understand the 3D geometry of objects in a weakly supervised fashion suitable for large-scale deployment.

 
C3DPO generates 3D keypoints from detected 2D keypoints for a range of object categories, accurately differentiating between viewpoint changes and shape deformations.

2D keypoints, which track specific parts of the object category (e.g., human joints or bird wings), provide a complete set of cues about the object geometry and its deformations, or viewpoint changes. The resulting 3D keypoints are useful, for instance, in modeling 3D faces and full-body meshes for more lifelike avatar graphics in VR. Similar to Mesh R-CNN, C3DPO reconstructs 3D objects using unconstrained images with occlusions and missing values.

C3DPO is the first method capable of reconstructing data sets consisting of hundreds of thousands of images with several thousand 2D keypoints. We achieve state-of-the-art reconstruction accuracy on three different data sets for more than 14 diverse nonrigid object categories. And we’ve made the code for this work available here.

Our model has two important innovations. First, given a set of monocular 2D keypoints, our new 3D reconstruction network predicts the parameters of the corresponding camera viewpoint as well as the 3D keypoint locations in a canonical orientation. Second, we introduce a novel regularization technique termed canonicalization, which consists of a second auxiliary deep network that learns alongside the 3D reconstruction network. This technique addresses the ambiguity that comes with factorizing 3D viewpoint and shape. These two innovations enable us to capture much better statistical models of the data than is possible with traditional approaches.

Such reconstructions were previously unachievable mainly because of memory constraints with the previous matrix-factorization-based methods which, unlike our deep network, cannot operate in a “minibatch” regime. Previous methods addressed the modeling of deformations by leveraging multiple simultaneous images and establishing correspondences between instantaneous 3D reconstructions, which requires hardware that’s mostly found in special labs. The efficiencies introduced by C3DPO makes it possible to enable 3D reconstruction in cases where employing hardware for 3D capture isn’t feasible, such as with large-scale objects like airplanes. Read the full paper on C3DPO here.

Learning pixel-to-surface mappings from image collections

 
暂停
 
0:00
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
Our system learns a parameterized convolutional neural network (CNN) that takes an image as input and predicts a per-pixel canonical surface map that indicates a corresponding location point on the template shape. The similar coloring of the predicted canonical surface mapping between the 2D image and 3D shape implies correspondence.

We take a step further toward reducing the supervision required for developing 3D understanding for generic classes of objects. We introduce an approach that can leverage unannotated image collections with approximate automatic instance segmentations. Instead of explicitly predicting the 3D structure underlying an image, we tackle a complementary task of mapping pixels in an image to the surface of a category-level template for 3D shapes.

Not only does this mapping allow us to understand the image in context of a category-level 3D shape, but it also gives us the ability of generalizing correspondences between objects of the same class or category. For instance, when people see the highlighted beak of the bird in the left image, we can easily locate the corresponding point in the image on the right.

This is possible because we intuitively understand the shared 3D structure across these instances. Our novel approach of mapping pixels of images to a canonical 3D surface enables our learned system to have this capability as well. When evaluating our approach by measuring its accuracy of transferring correspondences across instances, we achieved results that are about twice as accurate as previous self-supervised methods that did not leverage the underlying 3D structure of the task.

Our key insight – which allows learning with significantly less supervision – is that mapping from pixel to 3D surface can be paired with the inverse operation (going from 3D to pixel) in order to complete a cycle. Our novel approach operationalizes this and can learn using only unannotated, free, publicly available image collections with approximate segmentations from a detection method. Our resulting system can be used off the shelf, applied generally alongside other methods of top-down 3D prediction to provide a complementary pixelwise 3D understanding, and we’ve released the code here.

As demonstrated by the consistency of the colors of the cars that are moving in the video above, our system yields an invariant pixelwise embedding for objects undergoing motion and rotation. This consistency extends beyond a specific instance and can be useful in scenarios where we need to understand the commonalities across objects.

 
Instead of learning the 2D to 2D correspondence between two images directly, we learn 2D to 3D correspondence and ensure consistency with a 3D to 2D reprojection — and this consistent cycle serves as a supervised signal for learning the 2D to 3D correspondence.

For instance, if we train a system to learn the correct place to sit on a chair or where to grasp a mug, our representation can be useful the next time the system needs to understand where to sit on a different chair or how to grasp another mug. Such tasks can not only help deepen our understanding of traditional 2D images and video content, but also enhance AR/VR experiences by transferring representations of objects. Read more about canonical surface mapping here.

Improving the fundamentals of object detection in current 3D systems

As leading-edge technologies, like autonomous agents and systems to scan 3D spaces, continue to advance, we need to push forward the mechanisms for detecting objects when 3D data is readily available. In these cases, a 3D scene understanding system needs to know what objects are in a scene and where they are in order to support high-level tasks like navigation. We’ve improved upon existing systems by constructing VoteNet, a highly accurate end-to-end 3D object detection network tailored for point clouds, which was nominated for the Best Paper Award at ICCV 2019. Unlike traditional systems for this task, which depend on 2D image signals, ours is one of the first systems based purely on 3D point clouds. This approach is more efficient and achieves much higher recognition precision than previous works.

Our model, which we’ve open-sourced here, achieves state-of-the-art 3D detection outperforming all previous methods for 3D object detection by at least 3.7 and 18.4 mAP (mean average precision) increases in SUN RGB-D and ScanNet, respectively. VoteNet outperforms previous methods by using only geometric information, without relying on standard color images.

VoteNet has a simple design, compact model size, and high efficiency, with a speed of about 100 milliseconds for a full scene and a smaller memory footprint than previous methods designed for research. Our algorithm takes in 3D point clouds from depth cameras and returns 3D bounding boxes of objects with their semantic classes.

 
Illustration of the VoteNet architecture for 3D object detection in point clouds.

We introduce a voting mechanism that’s inspired by the classical Hough voting algorithm. Using this method, we essentially generate new points that lie close to object centers, and these points can then be grouped and aggregated to generate box proposals. With the basic idea of voting, which is learned through deep neural networks, a set of 3D seed points vote to object centers in order to recover where they are and what they are.

As the use of 3D scanners grows in the real world — already common in applications from autonomous vehicles to biomedicine — it’s important for us to be able to achieve semantic understanding of the 3D content by localizing and classifying objects of a 3D scene. Supplementing 2D cameras with more advanced depth camera sensors for 3D recognition allows us to capture a more robust view of any given scene. With VoteNet, systems can better recognize major objects in a scene, supporting tasks like placing a virtual object, or navigation and LiveMap construction.

Developing systems with richer understanding of the real world

3D computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for 2D understanding. As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.

It’s also part of Facebook AI’s long-term goal of developing AI systems that understand and interact with the real world as humans do. We have been creating scientific breakthroughs across a broad range of capabilities focused on narrowing the gap between physical and virtual spaces. Our latest 3D-focused research can also help improve and better populate 3D objects in Facebook AI’s simulation platform, which is important for training virtual agents to operate in the real world. In the same way that robotics pushes us to address complex challenges that come from conducting experiments in the physical world, where conditions are more unpredictable, 3D research is important for teaching systems how to understand all viewpoints of objects, even when they’re occluded, hidden, or have other optical challenges.

When combined with other senses, like tactile sensing and natural language understanding, AI systems, such as virtual assistants, can function in a way that’s more seamless and useful. Collectively, this leading-edge research helps us move one step closer to building AI systems that can more intuitively understand three dimensions in the same way that humans do.

The research papers described in this blog post are being presented at ICCV 2019, along with other new work in computer vision, including:

  • SlowFast, a method for extracting information from video using input at two different frame rates.

  • TensorMask, an alternate method of object segmentation using the dense, sliding-window technique

Written by

Georgia Gkioxari

Research Scientist

Shubham Tulsiani

Research Scientist

David Novotny

Research Scientist

Pushing state-of-the-art in 3D content understanding的更多相关文章

  1. Image Processing and Analysis_8_Edge Detection:Edge and line oriented contour detection State of the art ——2011

    此主要讨论图像处理与分析.虽然计算机视觉部分的有些内容比如特 征提取等也可以归结到图像分析中来,但鉴于它们与计算机视觉的紧密联系,以 及它们的出处,没有把它们纳入到图像处理与分析中来.同样,这里面也有 ...

  2. 翻新并行程序设计的认知整理版(state of the art parallel)

    近几年,业内对并行和并发积累了丰富的经验.有了较深刻的理解.但之前积累的大量教材,在当今的软硬件体系下.反而都成了负面教材.所以,有必要加强宣传,翻新大家的认知. 首先.天地倒悬,结论先行:当你须要并 ...

  3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

    HyperLogLog参考下面这篇blog, http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-par ...

  4. 从 Quora 的 187 个问题中学习机器学习和NLP

    从 Quora 的 187 个问题中学习机器学习和NLP 原创 2017年12月18日 20:41:19 作者:chen_h 微信号 & QQ:862251340 微信公众号:coderpai ...

  5. 计算机视觉和人工智能的状态:我们已经走得很远了 The state of Computer Vision and AI: we are really, really far away.

    The picture above is funny. But for me it is also one of those examples that make me sad about the o ...

  6. 国外60个专业3D模型网站

    原始链接:http://blog.sina.com.cn/s/blog_4ba3c7950100jxkh.html Today, 3D models are used in a wide variet ...

  7. 优化WPF 3D性能

    Maximize WPF 3D Performance .NET Framework 4.5   As you use the Windows Presentation Foundation (WPF ...

  8. the core of Git is a simple key-value data store The objects directory stores all the content for your database

    w https://git-scm.com/book/en/v1/Git-Internals-Plumbing-and-Porcelain Git is a content-addressable f ...

  9. Godot-3D教程-01.介绍3D

    创建一个3D游戏将是个挑战,额外增加的Z坐标将使许多用于2D游戏的通用技术不再有用.为了帮助变换(transition),值得一提的是Godot将使用十分相似的API用于2D和3D. 目前许多节点是公 ...

随机推荐

  1. Number最大范围相关

    今天在leetcode上面做题目,有一道数组形式的整数加法运算,本来以为还蛮简单的,想着直接将数组先转化为String类型,然后直接相加就好, 代码如下: var addToArrayForm = f ...

  2. CRM ORDER SEARCH增强查询条件(已有字段)

    ORDER_H表增强的两个字段,很早了,非AET,非EEWB,所以也加不到标准的搜索界面. GENIL_MODEL_BROWSER找到对应的查询和结果结构,append进字段:ZZZBRAND. 然后 ...

  3. 【OS_Windows】Win10应用商店闪退和点击Cortana搜索框闪退的解决方法

    Windows10用户遇到了打开应用商店时闪退和点击Cortana小娜搜索框闪退的问题,并且在微软社区求助,得到了一种可行的解决方法,那就是查看Network List Service(网络列表服务) ...

  4. Linux学习之组管理和权限管理

    Linux组的基本介绍 在Linux中的每个用户必须属于一个组,不能独立于组外.在Linux中每个文件有所有者,所在组,其他组的概念. 1)所有者 2)所在组 3)其他组 4)改变用户所在的组 文件/ ...

  5. 论文笔记系列-Auto-DeepLab:Hierarchical Neural Architecture Search for Semantic Image Segmentation

    Pytorch实现代码:https://github.com/MenghaoGuo/AutoDeeplab 创新点 cell-level and network-level search 以往的NAS ...

  6. 【比赛游记】NOI2019打铁记

    上接 NOIWC2019冬眠记.(THUPC,CTS,APIO)2019四连爆蛋记 和 THUSC2019酱油记. Day0.5 笔试 AK 是容易的. 国家队选手见面会太好玩了啊! Day1 Day ...

  7. OpenGL是什么?GPU是什么?

    一.GPU与CPU CPU是处理基本算数运算的单元:它处理的数据是数:整型.浮点型.bool等等: GPU是处理图形运算的单元:它处理的数据是图形的数据矩阵:   GPU的输入是一个和多个图形,输出是 ...

  8. Atcoder Beginner Contest 138 简要题解

    D - Ki 题意:给一棵有根树,节点1为根,有$Q$次操作,每次操作将一个节点及其子树的所有节点的权值加上一个值,问最后每个节点的权值. 思路:dfs序再差分一下就行了. #include < ...

  9. 开源项目(9-0)综述--基于深度学习的目标跟踪sort与deep-sort

    基于深度学习的目标跟踪sort与deep-sort https://github.com/Ewenwan/MVision/tree/master/3D_Object_Detection/Object_ ...

  10. Mysql配置C3P0

    需要导入的包 c3p0-0.9.5.2.jar mchange-commons-0.2.15.jar mysql-connector.jar 1. 配置xml 创建c3p0-config.xml文件, ...