Big AI models could soon get even bigger much faster


Google introduces a new method that improves Mixture-of-Experts models and cuts their training convergence time in half.

Scaling model size, training data, and other factors has led to major advances in AI research, such as natural language processing or analysis and generation of insights. pictures. Researchers have repeatedly demonstrated a direct relationship between scale and model quality.

Therefore, ever larger models with hundreds of billions or even billions of parameters are being developed. To increase the efficiency of training such gigantic networks, some AI companies use so-called parsimonious models.

These models only use parts of their network, for example to process a token. Densely trained models like GPT-3 activate the entire network for each processing step.

A d

With its Pathways project, Google is pursuing the future of artificial intelligence, which should be able to learn new tasks on the fly and process many modalities. A core element of Pathways is scaling – and therefore sparse modeling. In a new paper, Google demonstrates a breakthrough that dramatically improves the training of the mixed-expert architecture often used in sparse models.

Google has been studying MoE architectures for over two years

In August 2020, Google introduced GShard, a method of parallelizing AI calculations. The method allowed for the first time the realization of a poorly trained Mixture-of-Experts model with 600 billion parameters (MoE-Transformer).

In a Transformer module, there is usually a single Feed Forward network that passes information such as tokens. In a MoE-Transformer network, there are several such networks – the eponymous experts. Instead of passing all tokens through a single network, an expert processes only certain tokens.

In the MoE-Transformer formed by GShard, two experts usually process each token. The intuition behind this is that artificial intelligence cannot learn successfully if it cannot compare an expert with at least one other expert.

In Switch Transformer, a router assigns tokens to expert networks. | Image: Google

In January 2021, Google researchers presented the 1.6 trillion-parameter Switch Transformer model, also a low-formed MoE-Transformer. It has a crucial difference: instead of two or more expert networks per token, a router forwards information to only one network at a time. Google compares this process to a switch. Hence the name of the AI ​​model.


Nvidia Instant-NGP: AI graphics are on the rise
Nvidia Instant-NGP: AI graphics are on the rise

In the work, Google showed that the Switch Transformer can be trained faster and performs better than previous approaches.

Conventional MoE architectures tend to be unbalanced

Now Google has released a new article that further refines the MoE system. Existing variants like Switch Transformer have some drawbacks, according to the authors. For example, some expert networks may be trained with the most tokens during training, so not all experts are used enough.

This leads to a load imbalance in which overused expert networks do not process tokens to avoid running out of memory. In practice, this leads to poorer results.

In addition, the latency of the entire system is determined by the most loaded expert. Thus, in the event of a load imbalance, some advantages of parallelization are also lost.

It would also be useful if an MoE model flexibly allocates its computational resources based on the complexity of the input. So far, each token has always been assigned the same number of experts – two in the case of GShard and one in the case of Switch Transformer.

Google demonstrates expert mixing with expert choice routing

Google identifies the chosen routing strategy as the cause of these drawbacks. Conventional MoE models use token-choice routing, which independently selects a number of experts for each token.

In its new work, Google offers an expert for experts: In so-called expert choice routing, the router selects a number of tokens for each expert network. This allows routing to be more flexible, depending on the complexity of available tokens.

With expert choice routing, the router assigns different tokens to expert networks. | Image: Google

According to Google, the expert choice routing method achieves perfect load balancing despite its simplicity. It also allows for more flexible allocation of the model computation, since tokens can be received by a varying number of experts.

In a comparison with Switch Transformer and GShard, Google shows that the new method improves training convergence time by more than twice. With the same computational effort, it also achieves better results by refining eleven selected tasks in the GLUE and SuperGLUE benchmarks. For a lower activation cost, the method also outperforms the densely trained T5 model in seven out of eleven tasks.

The team also shows that expert choice routing allocates a large proportion of tokens to one or two experts, 23% to three or four, and only about 3% to four or more experts. According to the researchers, this supports the hypothesis that expert choice routing learns to assign a variable number of experts to tokens.

Our expert-choice routing approach enables heterogeneous MoE with simple algorithmic innovations. We hope this will lead to further advancements in this area, both at the application and system level.


Leave a Comment

Your email address will not be published. Required fields are marked *