European companies show the first result of their cooperation: an 80% lighter Aleph Alpha language model.
Large language models like OpenAI’s GPT-3 or Google’s PaLM have over a hundred billion parameters. Even with new insights into the role of training data in Deepmind’s Chinchilla, larger models are to be expected.
In fact, language models such as the Google Switch Transformer already exist with 1.6 trillion parameters, but they rely on parsimonious modeling, in Google’s case specifically on expert blended Transformer architecture.
Whereas with GPT-3, for example, all parts of the neural network are involved in each processing step, sparse models such as Switch Transformer use processes in which only the parts of the network relevant to the task become active. This greatly reduces the computing power required for network queries.
European AI collaboration shows first results
Google uses sparse modeling in the case of Switch Transformer to scale language models. But conversely, it can also be used to train smaller networks with performance similar to larger models.
That’s exactly what AI chipmaker Graphcore and AI startup Aleph Alpha have now done. The two European AI companies announced a collaboration in June 2022 which aims, among other things, to develop large European AI models. Germany’s Aleph Alpha recently launched Europe’s fastest commercial AI data center.
Aleph Alpha CEO Jonas Andrulis highlighted the advantages of Graphcore hardware for sparse modeling last summer: “The Graphcore IPU provides a new opportunity to evaluate advanced technological approaches such as conditional sparseness. These architectures will undoubtedly play a role in future Aleph Alpha research.
Graphcore and Aleph Alpha present lightweight Luminous language model
Both companies managed to lose weight “Luminous Base” language model from 13 billion parameters of Aleph Alpha to 2.6 billion parameters. The companies also showed off the lite variant running Lumi, a “conversational add-on” for Luminous.
Parsimonious modeling reduced nearly 80% of model weights while preserving most of its capabilities, according to the press release.
The new model uses pointwise sparse matrix multiplications supported by Graphcore’s Intelligence Processing Unit (IPU) and requires only 20% of the computing power and 44% of the memory of the original model , did he declare.
The small size allows the 2.6 billion parameter model to be kept entirely on the ultra-fast on-chip memory of a Graphcore IPU-POD16 Classic – for maximum performance. The model also requires 38% less energy.
Central “sparsification” for the next generation of AI models
For the next generation of models, “sparsification” will be key, the companies said. This would allow the specialized sub-models to master the selected knowledge more effectively.
“This breakthrough in sparsification modeling is impacting the commercial potential of AI companies like Aleph Alpha, enabling them to deliver high-performance AI models to customers with minimal compute requirements“, adds the press release.
Google is also following this path. In October 2021, AI chief Jeff Dean spoke for the first time about the search giant’s AI future: Pathways will one day become some sort of general-purpose AI system – and will relies on parsimonious modeling as a central element.