Galactica is an open source language model for scientific advancement


Summary

The Galactica Large Language Model (LLM) is being formed with millions of academic content. It is designed to help the research community better manage the “information explosion”.

Galactica was developed by Meta AI in collaboration with Papers with Code. The team identified information overload as a major obstacle to scientific progress. “Researchers are buried under a mass of papers, increasingly unable to distinguish between the meaningful and the inconsequential.”

Galactica is designed to help sort through scientific information. He was trained with 48 million articles, textbooks and course notes, millions of compounds and proteins, scientific websites, encyclopedias and more from the “NatureBook” dataset.

Language models as a new search interface

Galactica can store, combine and reason about scientific content, the research team explains. In benchmarks such as Math MMLU, it far outperforms larger language models such as Chinchilla (41.3% to 35.7%) or PaLM 540B (20.4% to 8.8%).

A d

For technical knowledge tests such as LaTeX equations, Galactica outperforms GPT-3 by 68.2% versus 49.0%. Galactica also reached new records (77.6% and 52.9%) in answering technical questions in biology and medicine (PubMedQA and MedMCQA).

Image: Galactica/Meta AI

Additionally, Galactica beats the big open source language models Bloom and OPT-175B in the “BIG-Bench”-Benchmark for general language tasks, despite not being optimized for them. According to the team, the generated texts are significantly less toxic than other open source language models.

We suspect that this result reflects the superior quality of the Galactica corpus, stemming from the fact that it is an organized and primarily academic text. Previous open LLM efforts were probably too focused on scale goals and underestimated on data filtering.

Paper

Galactica generates less toxic content than other major language models. | Image: Galactica/Meta AI

As specific application scenarios, the Galactica team mentions the creation of literature reviews, wiki articles or lecture notes on scientific topics or answering scientific questions, including citations.

When asked what a “transformer network” is, Galactica generates the following brief explanation with bibliographic references, including links to articles.

Galactica can explain scientific terms and provide citations. | Image: Meta AI / Galactica

The template also offers a sort of paper search, where you can describe the content of a paper and receive a possibly matching one. It can search for specific mathematical formulas or describe them in natural language or suggest quotes. For the latter function, however, the accuracy is only between 36.6 and 69.1% depending on the test dataset and shows a bias towards well-known articles.

Recommendation

Deepmind shows how AI can handle uncertainty
Deepmind shows how AI can handle uncertainty
Paper research. | Image: Galactica/Meta AI

Lots of room for improvement

“We believe these results demonstrate the potential of language models as a novel interface for science,” the researchers write. Galactica, they say, is just the first leg of that journey.

In their paper, the team outlines many opportunities for improvement, including the use of more academic, non-publicly available sources and multimodal training with out-of-text data, such as protein models.

Demo video for Galactica. | Video: Galactica / Meta AI

“Taken together, we believe there is strong potential for language models to support knowledge tasks that are currently human specialties,” the researchers write. They describe their ultimate vision as a unique neural network for all scientific tasksacting as the “next interface” to access knowledge.

In total, the team formed five Galactica models between 125 million and 120 billion parameters. Galactica’s performance scales smoothly with scale, according to the team. All templates are open source and available for free on Github.

Leave a Comment

Your email address will not be published. Required fields are marked *