The man who used machine learning to model COVID treatments
- Hope Reese
An interview with Jeremy Zucker, a principal investigator in the department of computational biology at the Pacific Northwest National Laboratory.
In the late ‘90s, Jeremy Zucker was a mathematician and computer scientist when he turned his focus to “biologically-inspired computing.” He went to work at an AI lab, where he applied his background to examine the ways that biological phenomena, like self-healing structures, could help inspire computational models. Now, Zucker is a principal invesitgator in the department of computational biology at the Pacific Northwest National Laboratory. And he’s been using machine learning-powered causal models — abstract and often complex mathematical representations of how the real world works — to allow us to better understand biology. His work has modeled everything from how our metabolism functions to how infectious diseases spread.
So, when COVID-19 hit, Zucker was primed to apply his knowledge, and his models, to help unravel how the virus operated. Zucker turned to machine learning to examine different COVID treatments, and to study our immune system’s response to the virus. He was able to make predictions about how different types of COVID-19 treatments would work, which became part of a larger publication called the COVID-19 Disease Map Initiative. His team’s data would make its way to the White House.
We caught up with Zucker to learn how he applied AI to figure out the novel coronavirus. We discussed how he applied machine learning to clinical data, the broader scientific effort to understand the body’s response to COVID and what it all means for the future
This interview has been edited and condensed for clarity.
Hope Reese: You have a background in biologically-inspired computing — how can this framework help us understand COVID-19?
Jeremy Zucker: It has to do with viral pathogenesis. So, how does the immune system normally respond when it sees some kind of bug? Normally, the interferon [protein] response would be sending alarm bells to nearby cells saying “Hello 911! I have been infected. Close your doors, batten down the hatches! Don’t let this bug infect you, too!”
What coronavirus does is that it dysregulates the immune response. It either overreacts or underreacts at the inappropriate time. It first underreacts because it doesn’t notice that there’s a viral infection. And when it finally realizes there’s an infection, it overreacts. In this case, it was the body’s response to the virus that was killing you, not the virus itself. The virus could sometimes be completely cleared from the body by that point, but the activation of cytokines [small proteins] by the adaptive immune system causes inflammation, which damages the body, which signals the adaptive immune response to activate more cytokines, which causes more inflammation, which does more damage, and so on — a vicious cycle known as a cytokine storm.
How did your machine learning system use data in order to figure out this kind of viral pathogenesis?
We actually submitted a proposal and it got funded [by the PNNL Mathematics for Artificial Reasoning In Science (MARS) Initiative] before COVID came out. It was focused on viral pathogenesis –– how to integrate prior knowledge with existing data to get something you couldn’t get from either one of them alone. There were multiple kinds of data being generated. Clinical data tells you “we have so many patients in the hospital with coronavirus, with this percentage getting severe COVID and this percentage only getting moderate COVID. So many folks are infected, with this percentage being asymptomatic, etc.” Clinical data helps us build epidemiological models to estimate how fast the virus is moving through the population.
But other kinds of data help answer questions about viral pathogenesis, such as: “By what mechanism does the virus actually enter the cell? How is it able to suppress the interferon response? Why does our adaptive immune system overreact? Which inflammatory cytokines are actually getting induced?” Those details make a difference when it comes to finding a solution.
You made a “knowledge graph” to help explain what happens with the virus. What exactly is that?
There’s so much biological knowledge that no one human being can keep it all in their heads. So we have what are called ontologies, which are standard vocabularies for describing all the genes and proteins and biological processes that are going on in the body. And they’re all in a very structured ontology, tree structure. And the trick is, in the literature, people have mentioned one ontology term from one ontology, and they mentioned another ontology term from another ontology. And we have to ground those. So, when you say SARS-CoV-2 and this person says novel coronavirus, they’re talking about the same thing, so there’s that unification, just so that there’s a grounded term there. That whole process of figuring out what the terms are that we’re actually referring to is name-entity recognition.
So now we have a bunch of nodes in our network and now we want to relate the terms. So we can say “this gene causes this other gene to be dysregulated.” The knowledge graph is the relationship between the two nodes.
What happens after you make the knowledge graph?
Once you have that kind of causal knowledge, then you can say, “well, if A causes B, then if I tweak A, I should see an effect on B.” That’s where you can start talking about, “what are the determinants of viral pathogenesis?” or “can we identify targets for medical countermeasures?”
But to answer those kinds of questions, you need a way to integrate data with these models. This is where causal inference comes in.
For example, we saw in the clinical data that patients who were severely ill with COVID and went on to develop a cytokine storm often had a highly induced cytokine named IL-6. But we still didn’t know if high levels of IL-6 caused the cytokine storm, or if the cytokine storm induced high levels of IL-6, or if it was just that severe COVID caused both cytokine storms and IL-6 induction, and there was no direct relationship between IL-6 and cytokine storms at all.
By integrating these data with the causal models, we teased apart those relationships to remove the confounding effect of severe COVID and determined that IL-6 was a necessary, but not sufficient cause of cytokine storms in severely ill COVID patients. Based on this model, we then predicted that by treating a patient with an IL-6 inhibitor called Tocilizumab, we could reduce the likelihood that a severely ill COVID patient would develop a cytokine storm.
Once you had this information, what did you do with the results?
There was a whole marshaling of resources — early on in the pandemic, the White House Office of Science and Technology Policy (OSTP) requested a dataset that would capture all the world’s knowledge of coronavirus in a machine-readable format.
They wanted to use this dataset as a call to action for the nation’s AI experts to develop new text and data mining techniques to help the scientific community answer high-priority scientific questions related to COVID-19. The Semantic Scholar team at AI2, in collaboration with the Chan-Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft, and the National Library of Medicine at the National Institutes of Health and Unpaywall responded to that request and on March 16, 2020 they released the Coronavirus Open Research Dataset (CORD-19). It originally had 20,000 articles, but expanded to over 350,000 articles over the course of the pandemic.
To support this call to action, a kaggle competition was formed where different teams of data scientists would collaborate and compete with each other to find these answers through machine learning and natural language processing techniques. I was part of a team called CoronaWhy that had several prize-winning entries.
Had AI ever been used on this kind of problem before?
DARPA has a program called Automating Scientific Knowledge Extraction — or ASKE — where they developed all these AI algorithms that took this unstructured data and then grounded all the scientific jargon into well-defined terms to describe all the genes and proteins and biological processes in the body. So when one paper says that “SARS-CoV-2 inhibits interferon response,” and another paper says “interferon response activates innate immune system” and a third paper says “innate immune system activates the adaptive immune system” and a fourth paper says “adaptive immune system activates inflammatory cytokines” — then a computer can join all these causal fragments into a chain of causal events to conclude that “SARS-CoV-2 dysregulates the activation of inflammatory cytokines.”
Of course, you need human curators to ensure that this chain of reasoning doesn’t just spit out nonsense. Within the COVID-19 Disease Map initiative, [an open-source, large-scale collection of curated computational models and diagrams of molecular mechanisms involved in SARS-CoV-2 infection], human curators would sketch out a pathway based on intuition honed by extensive expertise, and then the algorithms would ground these terms and fill in the details that a human being left out by traversing its vast knowledge graph of causal relationships. Sometimes the humans would correct the machine, and sometimes the machine would correct the humans, but it enabled the coordination of a large group of COVID-19 experts to generate a coherent, comprehensive, causal model of the host response to viral pathogenesis — quickly.
This is the largest consensus model that has been published about COVID-19 so far. And people keep adding to it because we are still learning more about COVID-19.
The big takeaway is that to understand why we react to the virus the way we do and find therapeutics that might soften the blow, the empirical data alone are not sufficient, nor are our AI models, but the answer can sometimes be found in the interaction of the two.