Sep. 20—One of the biggest challenges with artificial intelligence today is the quality of data. Many models were trained on the internet, full of falsehoods and lies. This is particularly a problem in science, where clean data is paramount — the exact problem that Sarah Dreier, an assistant professor of political science at the University of New Mexico, is going to be working on.
Dreier is joining a team of researchers hoping to develop truly open AI models that will accelerate scientific discovery. The work is being done as part of the Open Multimodal AI Infrastructure to Accelerate Science, or OMAI, project. While a “tough task,” the project’s objective aims to create “more transparent, more open and more flexible” AI models, Dreier said.
“The engineers (who) are training these models, they don’t know what the data is,” Dreier said. “They’re not reading unfathomably large amounts of text to feed into their model.”
Led by the Allen Institute for AI, the $152 million project will create a fully open suite of advanced AI models designed to support the U.S. scientific community and the broader action plan set forth by the White House to ensure the country produces leading models that enhance its footing for global AI dominance. The U.S. National Science Foundation and Nvidia Corp. are funding the project with $75 million and $77 million, respectively.
The use of AI has gained widespread popularity in recent years, with companies such as OpenAI and Google touting hundreds of millions of monthly users, and with industries rapidly evolving thanks to the technology’s application in areas such as automation and product innovation.
While the multimillion-dollar project includes a group of computer science — focused investigators, Dreier is the only social scientist involved. Using a $600,000 funding allocation, Dreier is aiding the lead investigator in the curation process by thinking “expansively” about different kinds of data with which they can train their open models, specifically those relevant to tasks scientists would need the technology for, like analyzing research or generating code.
The OMAI project is led by Noah Smith, senior director of natural language processing research at the Allen Institute and professor of machine learning at the University of Washington.
“This funding will provide critical infrastructure — advanced computing systems, open-source models, and tools — that will enable researchers across partner universities, including the University of New Mexico, to accelerate breakthroughs in fields ranging from energy to biology,” Smith wrote to the Journal.
Large language models are a category of learning tools trained on immense amounts of data, making them capable of understanding and generating natural language and performing a wide range of tasks.
Smith said many large language models are “closed,” meaning the data and tools used to train them are kept private. This then limits the technology’s outputs, which, he said, creates a real barrier for science as it restricts a user’s ability to inspect, adapt or build upon it.
“Open models are essential for transparency, reproducibility, and collaboration — the core of how scientific progress happens,” Smith said.
Along with Dreier, other researchers working on the five-year project grant include UW computer science and engineering associate professor Hanna Hajishirzi; University of Hawai’i at Hilo computer science associate professor Travis Mandel; and University of New Hampshire computer science assistant professor Samuel Carton.
“Obviously, as a social scientist, I’m going to be thinking most immediately (about) the kinds of data that could be useful to political scientists, sociologists — the kinds of data that we would, in theory, want our large language models to be trained on if we were going to rely on those models in order to help us with our scientific pipeline,” Dreier said.
Dreier was a post-doctoral research fellow in Smith’s lab at UW before coming to UNM. Since leaving, she said the pair has continued working together, specifically on tasks related or similar to the OMAI project.
Smith said developing fully open AI models tackles two intertwined challenges: advancing the science of AI and applying the technology to accelerate discoveries in the broader fields of science and engineering. Overall, he said the Allen Institute aims to help those working in AI research to advance the field in a more “transparent and trustworthy” direction.
“Our models will help scientists in other fields to be able to process and analyze vast amounts of research, generate code and visualizations, and connect new insights to past discoveries,” Smith said. “In practice, that means faster breakthroughs in areas like materials science, protein function prediction, and energy research.”