Meta researchers present MCC, a method for reconstructing a 3D model from a single image. The company sees applications in VR/AR and robotics.
AI models that rely on architectures like Transformers and huge amounts of training data have produced impressive language models like OpenAI’s GPT-3 or, more recently, ChatGPT.
Breakthroughs in natural language processing have brought a key insight: scaling often enables base models that leave behind previous approaches.
Prerequisites are domain-independent architectures such as transformers capable of handling different modalities, as well as self-supervised training with a large corpus of unlabeled data.
These architectures, together with large-scale, category-independent learning, have been applied in areas other than language processing, such as image synthesis or image recognition.
Metas MCC brings scale to 3D reconstruction
Metas FAIR Lab is now demonstrating Multiview Compressive Coding (MCC), a transformer-based encoder-decoder model that can reconstruct 3D objects from a single RGB-D image.
Researchers see the MCC as a important step towards a general-purpose AI model for 3D reconstruction with applications in robotics or AR/VR, where a better understanding of 3D spaces and objects or their visual reconstruction opens up many possibilities.
While other approaches like NeRFs require multiple images or train their models with 3D CAD models or other hard-to-obtain and therefore non-scalable data, Meta relies on 3D point reconstruction from RGB images -D.
Such images with depth information are now readily available due to the proliferation of iPhones with depth sensors and simple AI networks that derive depth information from RGB images. According to Meta, the approach is therefore easily scalable and large datasets can be easily produced in the future.
To demonstrate the benefits of the approach, the researchers train the MCC with images and videos containing depth information from different datasets, showing objects or entire scenes from many angles.
During training, the model is deprived of certain available views of each scene or object which are used as a training signal. The approach is similar to training language or image models, where parts of the data are also often hidden.
3D reconstruction of Meta shows strong generalizability
Meta’s AI model shows in tests that it works and outperforms other approaches. The team also claims that the MCC can handle categories of objects or entire scenes that it has never seen before.
Moreover, CMC shows expected scaling characteristics: Performance increases dramatically with more training data and more diverse object categories. IPhone, ImageNet and DALL-E 2 images can also be reconstructed into 3D point clouds with appropriate depth information.
We present MCC, a general-purpose 3D reconstruction model that works for both objects and scenes. We show generalization to harsh environments, including in wild captures and AI-generated images of imaginary objects.
Our results show that a simple point-based method coupled with large-scale, category-independent training is effective. We hope this is a step towards building a general vision system for 3D understanding.
The quality of reconstructions is still far from human comprehension. However, with the relatively easy possible scaling up of the MCC, the approach could quickly improve.
A multimodal variant allowing textual synthesis of 3D objects, for example, might just be a matter of time. OpenAI is pursuing similar approaches with Point-E.
Many examples, including 3D models, are available on the MCC project page. The code is available on Github.