Meta has released a new version of data2vec: data2vec 2.0.
Update of December 29, 2022:
Meta’s improved learning algorithm for different modalities is significantly faster than its predecessor.
Nearly eleven months after the release of data2vec, Meta’s AI division presents an improved version of its multimodal learning algorithm. With data2vec, it is much easier to transfer advances in one area of AI research, such as text comprehension, to other areas such as segmentation or image translation, Meta explains. Like its predecessor, data2vec 2.0 can process speech, images and text, but it learns much faster.
Data2vec 2.0 is much more efficient and exceeds the strong performance of the first version, the company said. It achieves roughly the same accuracy as a widely used computer vision algorithm, but is 16 times faster.
Data2vec 2.0 learns contextualized representations
Similar to its predecessor, data2vec 2.0 predicts contextualized representations of data instead of just pixels in an image, words in a passage of text, or sound in a voice file.
More precisely, the algorithm learns the word bank, for example, on the basis of the complete sentence in which this word appears, and thus learns to represent the correct meaning of the word more quickly – that is, as a ‘Financial institution.
Meta suspects that this contextualization is responsible for the algorithm’s fast learning performance. To increase efficiency, the team also relies on student networks learning from a teacher network and CNN rather than a transformer decoder.
Meta hopes that more efficient algorithms like data2vec 2.0 will lead to machines that can understand extremely complex data like the contents of an entire movie.
Examples and code are available on Github.
Original article from January 22, 2022:
Meta presents a learning algorithm that enables self-supervised learning for different modalities and tasks.
Most AI systems still learn in a supervised way with labeled data. But the successes of self-supervised learning in large-scale language models such as GPT-3 and, more recently, image analysis systems such as Meta’s SEER or Google’s Vision Transformer, clearly show that AIs that autonomously learn the structures of languages or images are more flexible and powerful.
However, until now, researchers still need different training regimes for different modalities, which are not compatible with each other: GPT-3 completes sentences in training, a vision transformer segments images, and a voice recognition predicts missing sounds. All AI systems therefore work with different types of data, sometimes pixels, sometimes words and sometimes audio waveforms. This discrepancy means that research advances for one type of algorithm are not automatically transferred to another.
Metas data2vec handles different modalities
Metas AI Research researchers are now introducing a unique learning algorithm that can be used to train an AI system with images, text, or spoken language. The algorithm is called “data2vec”, a reference to the word2vec algorithm, which served as the basis for the development of large-scale language models. Data2vec combines the training process of the three modalities and achieves in the benchmarks the performance of the existing alternatives for the individual modalities.
Data2vec circumvents the need for different training regimens for different modalities with two networks working together. The so-called teacher network first computes an internal representation of, say, a dog image. The internal representations consist, among other things, of weights in the neural network. Next, the researchers mask part of the dog’s image and also let the Student network calculate an internal representation of the image.
However, the Student network must predict the representation of the complete image. But instead of learning with more images, like the Vision Transformer, the Student network learns to predict the representations of the Teacher network.
Since the latter was able to process the complete image, with many passages of continuous training, the Student network learns better and better to predict the representations of the teacher and therefore the complete images.
Since the Student network does not directly predict the pixels in the image, but rather the representations of the Teacher network, from which the pixels can then be reconstructed, the same method works for other data such as speech or text. This intermediate step of representation predictions makes Data2vec suitable for all modalities.
Data2vec aims to help AI learn more generally
Basically, the researchers are interested in learning more generally: AI should be able to learn many different tasks, even those that are completely foreign to it. We want a machine to not only recognize animals shown in its training data, but also be able to adapt to new creatures if we tell it what they look like, Meta’s team said. . The researchers follow the vision of Meta’s AI chief, Yann LeCun, who in the spring of 2021 called self-supervised learning the “dark matter of intelligence”.
Meta is not alone in its efforts to enable self-supervised learning for multiple modalities. In March 2021, Deepmind released Perceiver, a Transformer model that can process image, audio, video, and cloud point data. However, this was still formed in a supervised manner.
Then, in August 2021, Deepmind introduced Perceiver IO, an improved variant that generates a variety of results from different input data, making it suitable for use in speech processing, analysis of images or understanding multimodal data such as video. However, Perceiver IO still uses different training regimens for different modalities.
Meta researchers are now planning further improvements and may consider combining the data2vec learning method with Deepmind’s Perceiver IO. Pre-trained data2vec models are available on Meta’s Github.
Learn more about Artificial Intelligence:
- Really smart AI – three things missing according to Google’s AI chief
- Moffet AI: AI chip startup receives $1 million investment
- History of robots: from Heron to Spot to the future of AI