Unsupervised Learnings of Protein Large Language Models for Make Benefit Glorious Sector of Biotech

Protein language models were nurtured by unlikely parents---corporations. Now that they have come of age, they have been forced to strike out on their own. A common pitfall that biotechnology platforms make is to attempt to solve as many problems, all at once, while in reality solving none. Whether these fledgling protein LLM companies will learn from the mistakes of their industry predecessors remains to be seen.

this out of a sense of corporate responsibility, feigned corporate responsibility, a fear of missing out, a hubristic notion that they can "solve biology," all of the above, or none of the above?Is ByteDance trying to design proteins to do a TikTok dance?Admittedly, I still have no clue.But recent developments in the space provide some leads and, more importantly, a glimpse into the biotechnology landscape.

L
ike many in the protein design space, I have been both amused and confused by the development of AI-based protein design tools by big name tech companies.First, there was Google DeepMind's AlphaFold; a revolutionary structure-prediction tool from protein sequence alone [1].For the first time, a computational tool existed that could predict the shape of a protein-an important determinant of protein function-just from the sequence alone with comparable accuracy to expensive and time-consuming experimental methods.Google had both ample motivation-considering the general importance of protein structure prediction to biology and biology-adjacent fields-and a track record of dabbling in weird side quests, namely DeepMind's previous work creating AlphaGo, AlphaZero, and Al-phaStar.Compared to an AI that plays StarCraft, a protein structure prediction tool did not seem too bizarre.
It was not clear to me that a trend was emerging.But when Meta, Salesforce, and ByteDance published their own models [2,3,4], it became apparent something was afoot, particularly given that all three groups created protein language models (think ChatGPT, but for protein sequences).Companies are incentivized to do things that result in profit-so what was in it for them?Are big tech companies doing Protein language models were nurtured by unlikely parents-corporations. Now that they have come of age, they have been forced to strike out on their own.A common pitfall that biotechnology platforms make is to attempt to solve as many problems, all at once, while in reality solving none.Whether these fledgling protein LLM companies will learn from the mistakes of their industry predecessors remains to be seen.

By "Albin Hartwig"
at the University of Bayreuth, required 128 NVIDIA A100s to train over four days.Using on-demand prices, current as of the time of this writing, it would cost just shy of $43,000 to train, not to mention difficulties obtaining the A100 instances, given the current high demand for GPUs [6].Having inhouse capabilities to scale up these models sidesteps these concerns.
Nevertheless, these only explain the relative ease of development for large tech companies but not reasons to develop protein large language models.One reason may simply be curiosity.Many authors have ties to academia.Quanquan Gu, director of AI research at ByteDance and a co-creator of Byte-Dance's protein LLM, LM-Design, is an associate professor at UCLA.Alex Rives, former scientific lead of Meta's Evolutionary Scale Modeling (ESM) team that developed Meta's protein LLM, has an offer to join MIT and Harvard's joint Broad Institute (more on this later) [8].But as 2022 and 2023 rolled on, the fates of each project, and their associated teams, diverged.
The first break came in mid-2022 when Ali Madani, who led the machine learning research initiatives at Salesforce (including the effort that produced Salesforce's protein LLM, ProGen) launched a separate company.Madani's Profluent Bio aims to commercialize its protein design LLM for novel protein design [9].Next, came the disbanding of the ESM team at Meta. "Meta has tried to align its research strategy to understand more how to create advanced intelligence that can help Meta as a business, rather than just some curiosity projects," according to a former scientist and manager who was part of the ESM team [10].Alex Rives, who I mentioned earlier, went on to found Evolution-aryScale with a founding staff of eight former ESM members.(Though there is speculation he may leave to take the aforementioned faculty position given his role as interim CEO).In the meantime, Isomorphic Labs, which spun off from DeepMind in 2021, continues to work closely with DeepMind to use AlphaFold to speed up drug discovery [8,12].While their specific paths differed, each of the three aforementioned groups ended up becoming An understanding of how these models work can answer a key question: Why are big tech companies building state-of-the-art protein language models rather than research universities or other actors?In short, corporations are uniquely positioned due to in-house expertise and infrastructure.It is not that university labs are unable to create protein language models.In fact, important deep learning models, such as Protein MPNN and ProtGPT2, were developed and trained at academic labs [5,6].Many protein large language models (LLMs) are trained using the same transformer architecture that has led to the recent explosion in chatbots, like ChatGPT, which are able to converse with humans, respond to queries, and write computer code.Instead of text, protein sequences are used to train those models.While a transformer is made up of many components, selfattention is the most critical, enabling transformer models to learn longrange dependencies in the data.In the case of protein sequence data, understanding these dependencies is vital to decoding protein structure, as amino acids far apart from one another on the sequence that covary with one another are likely to be close to one another in 3D structures.Unlike Alpha-Fold, these amino acid contacts can be learned by scaling up the number of parameters in a transformer mod-el, both speeding up the time it takes to predict a protein structure and simplifying the underlying protein prediction algorithm [2].Once trained, these protein LLMs can be sampled to generate sequences designed to fold or function like specific types of natural proteins but are made of sequences with low similarity to natural ones [6,7].As it turns out, the relationships between amino acids in a protein sequence can be represented in similar ways to words in human sentenceslowering the barrier to entry for companies that already have the incentive to develop large-language models for other tasks.Training cost is another obstacle these companies are able to largely remove.ProtGPT2, developed The new protein LLM startups in the space would benefit from focus and specificity in their early days, before branching out to fulfill their greater missions

Screenshot of the "Get to Know Ginkgo" promotional video featuring a Tyrannosaurus rex for no apparent reason.
before branching out to fulfill their greater missions, lest they become as extinct as the T. rex.separate businesses, with the accompanying pressure to launch a product and make money in their own right.However, each company risks making a common mistake in the biotechnology sector, especially platform technologies: namely, lack of sufficient focus on a specific problem or product.
"We're just a biological speculation Sittin' here, vibratin' And we don't know what we're vibratin' about" Drew Endy opened with these lines from "Biological Speculation" by Funkadelic in a 2014 TED Talk meant to advocate for a better genetically engineered future [12].That same year, Ginkgo Bioworks became the first biotechnology startup to join Y Combinator under CEO Jason Kelly, a former student of Endy's [13].Almost 10 years later, the company still does not know what it's vibratin' about.In 2021, Ginkgo Bioworks was sued following a report from activist investment firm Scorpion Capital claiming it misrepresented revenue statements following its acquisition by the special purpose acquisition company (SPAC) Soaring Eagle Acquisition Corp., which took Ginkgo Bioworks public [14,15].At the heart of the allegations was that Ginkgo spun out companies, such as Allonnia, Joyn Bio, and Motif Foodworks, to then become a customer.The investments in these companies would then be used to pay for Ginkgo's biofoundry service.In short, the net effect would be Ginkgo would pay companies to be its customer, artificially inflating its revenues.After the initial report from Scorpion Capital in 2021, Ginkgo Bioworks was trading at about $9.50 per share [15].At the time of this writing two years later, it is down to $1.53 per share.I am skeptical that simple greed solely explains why Ginkgo misrepresented revenues.Instead, the fraud is meant to conceal a critical weakness in the business, one that I hear when Jason Kelly pitches biotech startups to forget about building their own lab and contract their work out to Ginkgo, which is essentially a platform company without customers.The lack of focus can be seen from their website a a www.ginkgobioworks.com/about/("What if you could grow anything") to their promotional videos.b While diversification might be acceptable in mature markets by companies with a proven track record, Ginkgo neither has a proven track record nor operates in mature markets.What Ginkgo does have is hype, distracting institutional and lay investors alike with grand visions of the future.Ginkgo aims to create probiotics, remediate wastewater, manufacture cannabinoids, ferment food proteins….Vibratin' about everything in the end means vibratin' about nothing.
Ginkgo's biofoundry is a quite different platform from a protein LLM, but they share critical similarities.Both are "halfway platforms" in that they are not sufficient to create functional products by themselves.And both crucially have not done the hard work to apply the platform toward a market-vetted product.One example of a company that has is Neoleukin Therapeutics, spun out of the University of Washington's Institute for Protein Design.The Institute under David Baker, with their general expertise in protein design, collaborated with Chris Garcia's group at Stanford, a group with specific expertise in cytokine therapeutics, to create a novel, designed protein with not only enhanced binding properties to a cancer-relevant receptor, but also secondary properties such as thermostability and solubility that are otherwise inaccessible through standard directed evolution methods [16].Both Ginkgo and the new protein LLM startups in the space would benefit from focus and specificity in their early days, b https://www.youtube.com/watch?v=Lvp2Kw3hjt8 Why are big tech companies building state-of-the-art protein language models rather than research universities or other actors?