Notes on end-to-end biology

Jose Luis Ricon

Notes on end-to-end biology

2023-01-26; Last updated: 2023-01-26
Wordcount: 6090 | Reading time: 33 min
• Research • AI • Biology •
Is this article wrong?

Summary

Cost reductions in biological data collection and advances in tools to probe ever deeper into biology might soon revolutionize drug discovery. This has been said before (for decades), but what if this time is different!

Initially I wanted to write a longer piece on the broad topic of "Bio and ML" but it started to grow too many threads, getting into predicting ADME, reproducibility and translatability of animal research, and how optimistic should we be about organoids. Each of these could be its own post. Instead I'll make some high level points and point to a number of recent writings that collectively express what I wanted to say.

We are far from understanding all of biology, but that's okay

Biology is hard to understand. Human biology is even harder because of ethical considerations around experimentation with human subjects. This makes drug development a really hard problem!

Usually, the way we solve problems is first understanding the domain where the problem is and then thinking of a solution that makes use of that understanding. In drug development, it's understanding the function of genes, proteins, small molecules, and their interactions. Drug development tends to start with the assumption that some biological entity (the target, usually a protein) is involved in a disease, then trying to find ways to modulate that protein that are safe and that can be packaged in a pill. But it doesn't have to be like that; one could in principle take cells (or ideally a whole organism that's diseased), compare them to healthy cells, try a million perturbations, then pick what works best. Not a new idea, this is what is known as phenotypic screening. It may be harder to do but the result is cleaner: rather than asking "will molecule X bind to protein Y" one asks "will this perturbation make the cell healthier" which is closer to what we want (making an organism healthier). A recent commentary from Scannell et al. (2022) is of the same opinion: better initial screening buys you a lot:

In parallel, we suspect that much of the pharmaceutical industry sometimes made the wrong technological trade-offs because it had not understood the quantitative power of predictive validity. It sometimes embraced discovery methods with measurably high throughput and low unit costs, whose benefits were offset by less measurable falls in predictive validity. A clear example is antibacterial R&D. In vivo phenotypic screens of a few hundred compounds, circa 1930, were more productive than target-based screens of ~10^7 compounds in the late 1990s and early 2000s.

Once one buys into this one can take it to the next level: Why not train an ML model that predicts efficacy? In theory the triplet (healthy state, diseased state, perturbation) in enough numbers is all one needs. In theory.

Of course, anytime one talks about ML for bio one is reminded of the opinion of industry veterans that have seen ML in bio hype for years (decades?) without much being delivered. Whenever there's a new seemingly breakthrough paper there are many "but-s" that get raised (Has AI discovered a drug? De novo computational generation of antibodies? Protein folding?).

For reflection, here’s a quote about computer-aided drug discovery (CADD), highlighting its importance and impact:

“Drug companies know they simply cannot be without these computer techniques. They make drug design more rational. How? By helping scientists learn what is necessary, on the molecular level, to cure the body, then enabling them to tailor-make a drug to do the job… This whole approach is helping us avoid the blind alleys before we even step into the lab… Pharmaceutical firms are familiar with those alleys. Out of every 8,000 compounds the companies screen for medicinal use, only one reaches the market. The computer should help lower those odds … This means that chemists will not be tied up for weeks, sometimes months, painstakingly assembling test drugs that a computer could show to have little chance of working. The potential saving to the pharmaceutical industry: millions of dollars and thousands of man-hours”

What’s great about this quote is that you can hear its echo in current Silicon Valley tech-solves-biotech pitches, but it was from a Discover magazine article in August 1981 called “Designing Drugs With Computers”. (Four Decades Of Hacking Biotech And Yet Biology Still Consumes Everything, 2017)

Companies that make "Designing drugs with AI" their selling point like Atomwise (2012), Recursion (2013), Schrödinger (1990), or Exscientia (2012) have been around for a while. At least one of them (Schrödinger) have delivered some approved drugs, but the vast majority of drugs developed and approve are still not coming from "throw data at a model and get drugs at the other end". Good thinking and exhaustive experimentation continues to be, to this day, what gets drugs approved, not fancy computational modeling and data alone.

At the same time, at every moment in the history of a field, there is a recurring question: Is this time different? Or is this time like the previous 1000 times?

There's a critique of current work on AI expressed as variations on the argument: "Look, some such systems are impressive as demos. But the people creating the systems have little detailed understanding of how they work or why. And until we have such an understanding we're not really making progress on AI." This argument is then sometimes accompanied by (often rather dogmatic) assertions about what characteristics science "must" have.

I have some instinctive sympathy for such arguments. My original field of physics is full of detailed and often rather satisfying explanations of how things work. So too, of course, are many other fields. And historically new technologies often begin with tinkering and intuitive folk models, but technological progress is then enabled by greatly improved explanations of the underlying phenomena. You can build a sundial with a pretty hazy understanding of the solar system; to build an atomic clock requires a deep understanding of many phenomena.

Work on AI appears to be trying to violate this historic model of improvement. Yes, we're developing what seem to be better and better systems in the tinkering mode. But progress in understanding how those systems work seems to lag far behind. [...]

The underlying thing that's changed is the ease of trying and evaluating systems. If you wanted to develop improved clocks in the past you had to laboriously build actual systems, and then rigorously test them. A single new design might take months or years to build and test. Detailed scientific understanding was important because it helped you figure out which part of the (technological) design space to search in. When each instance of a new technology is expensive, you need detailed explanations which tell you where to search.

By contrast, much progress in AI takes a much more agnostic approach to search. Instead, of using detailed explanations to guide the search it uses a combination of: (a) general architectures; (b) trying trillions (or more) of possibilities, guided by simple ideas (like gradient descent) for improvement; and (c) the ability to recognize progress. This is a radically different mode of experimentation, only made possible by the advent of machines which can do extremely rapid symbol manipulation. (The role of explanation in AI, Michael Nielsen's notes)

I'm not the first to notice some similarity between the research aesthetics of studying neural networks and studying biology. Chris Olah is optimistic about some deeper level of understanding of neural networks. I don't know how optimistic to be about that, but I am certainly more optimistic about that that about the interpretability of biological systems; a point made beautifully in Can a biologist fix a radio and Could a neuroscientist understand a microprocessor?. In artificial neural networks one has very simple entities (neurons that obey simple functions, or layers that perform easy to understand operations), we have a perfectly defined wiring diagram for the network, and we can run all the experiments we want on the neural net itself. Contrast to biology where we have aggregates of squishy bags of molecules (cells) bouncing against each other, and each different from the rest (hyaluronic acid is very different from collagen whereas all neurons in an ANN are basically the same), where the way they interact is not given and has to be studied, often indirectly as we can't easily inspect the state of as cell as we could in an artificial neural network. And of course to make it even worse, the components of biological systems (cells) behave differently in isolation (in a petri dish) than the way they do when they exist in the context of an organ.

So yes, I continue to be not super optimistic about understanding biology! I have written here and there about what does it mean to understand something in biology and asked this same question to my Twitter followers in 2021. Even before that, Bert Hubert already was writing in 2019 that maybe it is the case that biology will never be understood, and then our only hope would be to Gather everything we learn into first-class quality databases that might enable computers to make sense of what we have learned.

We can debate what "understanding" means endlessly but I find more practical to discuss what experiment to do next or what kind of data to gather, and this is driven by what one thinks the road to solving the problems we care about look like. From the point of view of the task "predicting protein structure from its sequence", one could do experiments where we isolate tiny bits of proteins and study how those fold, and attempt to derive rules to predict this, perhaps understanding some aspects of the process. This does work to some extent, we have learned that proteins do have smaller subcomponents (motifs, or domains). We have also learned that there are different kinds of proteins (globular, disorganized, or fibrous), and one can make reasonable guesses about the electric charge distribution in a globular protein (apolar amino acids will be found near the center of the protein). But the road to solving protein folding did not involve eventually discovering some sort of Navier Stokes equations which can be derived from first principles and govern the behavior of the system reasonably well; no, what happened was that a lot of data and massive compute were thrown at the problem and, with some caveats, solved it. Given the nature of the problem, it seems deeply unlikely that we will ever find such simple laws in biological systems except in toy examples.

If one believes this then instead of looking at the problem with our reductionism glasses on, we should take a pair of holistic glasses instead: black box the problem; collect data where we can and let ML figure out the complex pattern of relationships between inputs and outputs.

If you happen to work in a domain that's not biology that's ML-heavy you'll probably be nodding along, but there are some issues with this approach which I'l discuss in a bit.

Protein folding prediction: how useful?

I think for people that are not working in the life sciences, AlphaFold might have changed their view on how optimistic they should be about radically accelerating drug development. For people working in drug discovery, the update might have been small. Both views have something to them: thinking in the short term, indeed solving protein folding doesn't do much to help the current process to find new drugs. In the longer term (and people from outside an industry, and especially those working in tech, might be more likely to take the long/high level view), AlphaFold can indeed be taken as a harbinger of future transformational change.

Mohammed AlQuraishi's commentary on AlphaFold's release back in 2018 continues to be the best summary of the reaction of the protein folding community to DeepMind's monumental achievement. It's a combination of amazement (“What just happened?”)

I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science. There are dozens of academic groups, with researchers likely numbering in the (low) hundreds, working on protein structure prediction. We have been working on this problem for decades, with vast expertise built up on both sides of the Atlantic and Pacific, and not insignificant computational resources when measured collectively. For DeepMind’s group of ~10 researchers, with primarily (but certainly not exclusively) ML expertise, to so thoroughly route everyone surely demonstrates the structural inefficiency of academic science. This is not Go, which had a handful of researchers working on the problem, and which had no direct applications beyond the core problem itself. Protein folding is a central problem of biochemistry, with profound implications for the biological and chemical sciences. How can a problem of such vital importance be so badly neglected? [...]

What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode.

With measured caution, even in his later commentary of the improved AlphaFold2 results:

Drug discovery?

I will end this section with the question that gets asked most often about protein structure prediction—will it change drug discovery? Truthfully, in the short term, the answer is most likely no. But it’s complicated.

One important thing to note is that, of the entire drug development pipeline, the early discovery stage is just that, an early stage. Even if crystallography were to become fast and routine, it would still not fundamentally alter the dynamics of drug discovery as it is practiced today, as most of the cost is in the later stages of drug development beyond medicinal chemistry and well into biology and physiology. Reliable protein structure prediction doesn’t change that.

But at the end of that section there is one comment made which might not be noticed at first because it's almost a remark made in passing:

we can imagine a future in which drugs are designed for their polypharmacology, i.e., to modulate multiple protein targets intentionally. This would very much be unlike conventional medicinal chemistry as practiced today where the emphasis is on minimizing off-targets and making highly selective small molecules. Drugs with designed polypharmacology may be able to modulate entire signaling pathways instead of acting on one protein at a time. There have been many fits and starts in this space and there is no reason to believe that a change is imminent, especially because the systems challenges of the equation remain formidable. Wide availability of structures may hasten progress however.

AlQuraishi's comment is a hint of where I think the future of drug discovery will look like: moving beyond the idea of the target to drugging cell or organism state itself.

Drug discovery and self-driving cars

The ultimate goal of the biomedical enterprise (academia, startups, and big pharmas) is the improvement of human health. The proxy goal for that goal is drug discovery (and development) and the proxy for "drug" is, historically, orally available single-target small molecule inhibitors of some protein (or agonist for some receptor). Or what is the same: Up until recently, the way of thinking if you want to address a disease is:

Understand a disease: Have an idea of what's going on, find the pathways involved, examine human genetics to find correlations between genes and disease incidence.
1. Example: Learning that the mevalonate pathway is involved in LDL synthesis, and that LDL cholesterol is a driver of heart disease
Find a target: a protein (usually) that is involved in a disease and whose activity can potentially be modulated (turned up or down). It tends to be easier to block the action of a protein than to enhance it. (Example: HMG-CoA reductase)
Find a small molecule that delivers the desired effect, i.e. that binds to the catalytic domain of an enzyme to inhibit it
1. Usually one tries lots of compounds (high throughput screening), then picks promising ones and tweaks them until one seems to work well.
2. Here one can also do some ML to speed up e.g. docking calculations
3. Example: Atorvastatin, which inhibits the action of HMG-CoA thus reducing LDL synthesis downstream
Ensure that said small molecule can be taken orally. There are various rules of thumb here like Lipinski's rules to guess if a molecule will be "druglike".
1. This is not always the case; some drugs are injected so no need to worry about gut absorption then. Vaccines are the clearest example.
Ensure that said small molecule doesn't have side-effects
1. Example: Statins do have side effects, but they are considered minimal in relation to the benefit of the drug. Nonetheless the search for even safer interventions has led to other LDL-lowering drugs like PCSK9 inhibitors.
2. Example 2: A gamma secretase inhibitor (For Alzheimer's treatment) caused increased skin cancer, so even if it treated the disease it's not on net worth using
Profit!
1. Example: Lipitor (atorvastatin) generated billions of dollars of revenue for Pfizer over the last 20 years
2. Example 2: Even when drugs don't end up making it all the way to the clinic (As happened with Sirtris) you can profit too sometimes, what's not to like! /jk

You might wonder a few things here:

How do you know what the target should be? First principles thinking, domain knowledge, little experiments here and there.
1. Example: Atorvastatin came from earlier research on other molecules, lovastatin and mevastatin, which in turn was discovered by searching for antimicrobial agents, fermenting broths of a fungus
Why a single target? It's easier to carefully study two entities (a protein and a ligand for it) than to study every molecular entity in a cell in detail
Why restrict ourselves to orally administered drugs? Because these drugs will need to be administered repeatedly (they are small molecules, eventually they get metabolized and excreted), usually at home, and having people injecting themselves daily is considered unfeasible.

As an analogy here, consider self-driving cars: Traditionally the problem of driving a car initially required engineers to identify features like lines and map them to say lanes and then keep the car in there. The algorithms used for this were simple and understandable like Canny edge detectors or Hough transforms. The answers to the equivalent questions would have been something like:

How do you know what features to use? First principles thinking, asking domain experts, little experiments here and there.
Why a small number of features, why two lines/a single lane? It's hard to consider very complex scenarios, let's do a single lane at a time.
Why restrict ourselves to driving on sunny days with good visibility? It's already hard to do this!

With self-driving cars we are now seeing a competition between two approaches: This classical approach just described, where many intermediate, handcrafted representations are modeled explicitly (mostly abandoned), and the new end to end approach (Tesla and comma.ai perhaps being the ones ideologically closer to this) where the car goes straight from pixels perceived to commands issued to the motors and steering system, where visualizations are still provided to the human driver for reassurance, but not being strictly required for model performance.

This second paradigm replaces an object-level research oriented mindset where one aims to understand the system of interest, with an engineering-heavy mindset where understanding is deprioritized in favor of control. What matters then is instead designing systems able to ingest large quantities of data and models able to distill that into solutions to the problem, and that is an easier task than answering the question "how does one drive" or "what is driving" from first principles.

What would the equivalent of this approach look like for drug discovery?

Self driving cars are easier than drug discovery

While not perfect in all cases yet, it's now possible for a commercially available car to drive itself all the way from SF to LA. What made this possible is largely the same thing that made DeepMind's achievements in Go and Chess possible: Lots of data and simulators that are very close to the real domain. The domain where the model is operating (a car in the real world) can be trained on real world data for that same system (from the Tesla fleet) and enhanced with simulated driving data. The physics of driving a car are understood well enough and graphics can be made so realistic than one can train on simulators too!

In biological systems this is not the case: The state of a human body is extremely complex and not yet fully understood. The dynamics of it extend over days (fighting an infection), months (pregnancy), and even decades (for processes like puberty). Measuring this state is also nontrivial; one can only collect blood only so often, and measure only so much. Extracting biopsies to access organ state directly is highly invasive, and the perturbations a human is exposed to in the wild are far from what would be required to find new drugs. Natural data is useful to learn about things like exercise and diet (and even that's hard), but we don't go around taking random pills so that one can build models of what random compounds do to us.

Sometimes a variant of this is possible: Because (twins aside) we are all genetically distinct, nature is running the world greatest clinical trial in us already, and with large enough collections of sequenced genomes it's possible to chart a path towards new drugs.

But genetic lottery aside, short of a large army of willing volunteers, we can't use the actual system (the human body) to experiment with directly at scale, and we can't simulate it yet, we have to settle for something simpler. We can either test in animals (a whole organism) or we can test in human cells in vitro, or eventually in organoids.

I am very pro in vivo testing: Ultimately yes we are made up of cells but there are many different kinds that interact in different ways. If immune rejuvenation is one key part of the future of medicine, we wouldn't know the full extent of that if we just observe that indeed the function of a given type of immune cell can be improved, one has to see what a that cell does when placed in the context of an organism where it can now more effectively fight cancer and pathogens.

Some recent commentary on the future of ML for biology

What originally inspired me to write this are these following articles that I read last year. I got a sense there was a sense of renewed excitement in the field (Or just it was a coincidence that I these ended up in my reaading list) that was worth thinking about.

One idea is say training a large language model on the entirety of Scihub and then asking it to solve a particular problem like predicting protein structure, producing drug candidates, or explaining why Alzheimer's actually happens. This has been tried before and the results have been... far from that promise: Albeit trained on less data, this is what happened with Galactica, and this is the current state of ChatGPT and similar state of the art models. Scraping Scihub is feasible, but the results probably won't be that enlightening: We want the models to tell us new things, and so far LLMs tend to be very conservative. But beyond that, there isn't that much data out there. The papers might describe at a high level an experiment that was done and some particular results, but accessing the raw or processed data that was gathered is something one can't get from the paper or even the public internet in many cases. Sam Rodriques is right when he says that I am also skeptical of the ability of even an AI trained on the entire scientific literature to predict drug efficacy for diseases for which have no effective drugs and no understanding of how they work.

Josh Nicholson, who wrote in more detail about how difficult it would be to do this actual thing, is more optimistic. But as he points out, we already have this: ScholarBERT was actually trained on what seems to be all scientific papers (75M of them, 221B tokens, as opposed to 48M papers/88B tokens for Galactica. Scihub has ~80M), and Science remains an unsolved problem. ScholarBERT is a relatively small model (770M parameters) so one can always think that maybe 100x parameter count would lead to better performance at Solving Science but I doubt it.

However, the real problem we care about is not producing plausible (given current knowledge) completions to papers. An assistant that has access to the world's scientific knowledge (or its publicly available portion) would be valuable but not that useful, especially if scientists working in a domain already have that knowledge. It would be a different matter if a model generates new hypothesis or proposes new experiments that are unintuitive but promising.

Adam Green writing A Future History of Biomedical Progress expresses the same sentiment I share throughout the essay, going perhaps even further than I would. His essay is the most substantial inspiration for my own:

progress in basic biology research tools has created the potential for accelerating medical progress; however, this potential will not be realized unless we fundamentally rethink our approach to biomedical research. Doing so will require discarding the reductionist, human-legibility-centric research ethos underlying current biomedical research, which has generated the remarkable basic biology progress we have seen, in favor of a purely control-centric ethos based on machine learning. Equipped with this new research ethos, we will realize that control of biological systems obeys empirical scaling laws and is bottlenecked by biocompute. These insights point the way toward accelerating biomedical progress. [...]

One cut on this is how physics-like you think the future of biomedical research is: are there human-comprehensible “general laws” of biomedical dynamics left to discover, or have all the important ones already been discovered? And how lumpy is the distribution of returns to these ideas—will we get another theory on par with Darwin’s?

For instance, RNA polymerases were discovered over 50 years ago, and a tremendous amount of basic biology knowledge has followed from this discovery—had we never discovered them, our knowledge of transcriptional regulation, and therefore biomedical dynamics, would be correspondingly impoverished. Yet when was the last time we made a similarly momentous discovery in basic biology? Might biomedicine be exhausted of grand unifying theories, left only with factoids to discover? Or might these theories and laws be inexpressible in the language of human-legible models?

But in one of the footnotes there's a point where the complications of truly being "end to end" become more obvious:

Insitro et al. are to drug discovery as Waymo et al. are to autonomous vehicles. Just as some think autonomous vehicles will be solved by building high-definition maps of cities and modeling dynamics at the level of individual pedestrian behavior, some think biomedicine will be solved by building high-definition molecular “maps” of diseases and modeling dynamics at the level of individual cellular behavior. Though they are directionally correct in their use of machine learning, they fail to abstract sufficiently.

Green wants to truly "end-to-end" biology. That is, having a system we can ask "make a human healthy" and getting an answer, trained on triplets of (diseased human, perturbation, healthy human). Of course, he admits this is unrealistic because of ethical considerations; so rather he proposes to do this in mice and organoids (as close as possible to the real system) and then try to transfer from there. In the paragraph above he says Cellarity is not going far enough: They are trying to fix cells, but cells are not what we ultimately care about (the whole organism); in his view fixing cells is like learning to recognize traffic cones when building a self-driving car: a hand-engineered feature that is not required if one can end-to-end enough.

I think cells are better models than he thinks, perhaps. Biological systems have the advantage of being built of similar building blocks (all cells work in the same fundamental way), where parts of the system are adjusting themselves to the state of other parts. If you rejuvenate e.g. blood, you can probably have effects elsewhere. If you hit 60% of a tissue with a successfully rejuvenating therapy, chances are you might go beyond that 60% through cell-to-cell signaling. The self-driving analogy shouldn't be cones but rather charging electric cars: Given the task "Driving an electric car across the United States without human intervention" one has to automate driving and charging. The true end to end approach would be to train a joint model to control both the car and the charger (perhaps equipped with one of these). A single neural network that tells the car what to do and same for the charger. But in practice this is unnecessary: You can have a model that drives the car to a spot in the charger and then a simple computer vision based model that controls the charger and gets the car charged. The performance of this split approach wouldn't be inferior to the true end to end solution, and it is easier to train.

Similarly, while on paper the problem of "altering the state of a cell" involves a) designing what to do to the cell and b) getting that to the cell, I could imagine how one can solve (a), say a model that predicts what transcription factors to get a cell to express, then trying to find a way to package that into an AAV or something else, (b). This might not be doable, but then one can pick the next solution from the model that solves (a) and try again. To the extent that the domains being decoupled, one can substitute end-to-end learning with some more trial and error. Ultimately, the question is: Should we put more resources on organoids or better models, ultimately having 'organs on a chip' so that we can collect data to train end-to-end models? Or should we try to develop therapies with the tools we have available right now? My hunch is that the latter approach can still be useful.

One more argument against the need for complete end-to-end is that biology is incredibly 'plug and play'. It's possible to replace or address subsystems of an organism separately. For example one can replace old bone marrow with young bone marrow without having to concurrently replace everything else. One can even implant bits of organs in the right place and those organs will function somewhat. And obviously we have seen many successful drugs being developed by modeling just parts of the whole.

And lastly, as readers of Nintil know, I'm a fan of partial reprogramming. I do think fixing aging goes a long way in extending healthy lives, and aging is, to a large extent, the deteriotation of processes that are common to all cell types (like transcription, translation, or autophagy), hence fixing this in vitro and solving systemic delivery seem to, together, go a long way!

Pablo Cordero writes here about unifying all of biology into a large model by thinking of biological knowledge as a graph, masking parts of it, and then predicting those parts from the rest of the data. It's not fully clear how one would go about doing this! I'm no expert in graph neural networks, but certainly the post shares the spirit of "end-to-end biology".

Lastly, Jacob Kimmel wrote a really good post last year on representation learning as an extension from the way early molecular biologists worked:

There’s no general solution to modeling complex systems, but the computational sciences offer a tractable alternative to the analytical approach. Rather than beginning with a set of rules and attempting to predict emergent behavior, we can observe the emergent properties of a complex system and build models that capture the underlying rules. We might imagine this as a “top-down” approach to modeling, in contrast to the “bottom-up” approach of the physical tradition.

Whereas analytical modelers working on early structures had only a few experimental measurements to contend with – often just a few X-ray diffraction images – cellular and tissue systems within a complex organism might require orders of magnitude more data to properly describe. If we want to model how transcriptional regulators define cell types, we might need gene expression profiles of many distinct cell types in an organism. If we want to predict how a given genetic change might effect the morphology of a cell, we might similarly require images of cells with diverse genetic backgrounds. It’s simply not tractable for human-scale heuristics to reason through this sort large scale data and extract useful, quantitative rules of the system.

Is this time different?

Current "ML for drug discovery" startups are far from end to end. They still find a target the old fashioned way, and limit themselves to small molecules (As with Relay or Exscientia). Some do go beyond the concept of a target and into phenotypic screening (like Recursion), where there is no initial driving hypothesis behind a drug, instead the company builds models trained to recognize features of cells that look more or less diseased and then build relations between the drugs the cells were treated with and the observed change. Recursion hasn't gotten any drug approved yet. Cellarity seems to follow a similar approach, moving away from the idea of a target and towards drugging cell state holistically. I suspect we will see more companies moving in this broad direction.

Why is this changing? The costs of reading and writing DNA are the lowest they have ever been. So is the number of cells we can measure per experiment. Just a few days ago a new paper came out reducing the cost of testing genetic perturbations in cells by an order of magnitude. "High-throughput" is now "Ultra-high throughput". Collecting data was never as cheap as it is today.

In parallel to increasing volumes of data being collected, only very recently we have started to see the appearance of models that can output predictions on what to do to a population of cells to shift them to a desired state (like PerturbNet, from 2022) or models that can predict the effect of combined genetic perturbations (GEARS, also from 2022), and of course transformers are coming for perturbation prediction as well (scFormer, once again 2022). Thanks to neural networks and progress in representation learning, the model can predict chemical perturbations or gene perturbations alike.

I don't have concrete timelines for when we are going to 'solve biology with ML', but working towards that end seems enormously valuable

Notes on end-to-end biology

Summary

We are far from understanding all of biology, but that's okay

Protein folding prediction: how useful?

Drug discovery?

Drug discovery and self-driving cars

Self driving cars are easier than drug discovery

Some recent commentary on the future of ML for biology

Is this time different?

Further reading

Citation

Backlinks