Predicting Viral Outbreaks With Machine Learning
Think about pandemics that have affected the human race – our history and even our culture. Even in recent times, viral outbreaks such as Zika, Ebola, HIV and even Smallpox have caught us by surprise. Despite the technology and data available to us today, scientists struggle to predict where and when the next viral outbreak might occur. But this could be about to change, thanks to a new weapon in our arsenal – machine learning.
Since the beginning of our species, humankind has survived many viral outbreaks. These outbreaks start off as epidemics, confined to a certain location. If allowed to spread throughout the world, we classify them as pandemics1. Pandemics have claimed millions – perhaps even billions – of lives, through diseases such as the Black Death and Spanish flu.
The recent H1N1 (‘Swine’ flu) pandemic of 2009 caused alarm due to its resemblance to the Spanish flu virus, sparking a panic that another worldwide disaster could resurface2. Despite the knowledge we possess today, our lives continue to be at the whim of such viral outbreaks.
Table of Contents
The Challenges Facing Epidemiologists
For many recent viral outbreaks, scientists actually know a great deal about the viruses themselves. In fact, we have studied viruses such as Zika and Ebola for decades. And yet, each time an outbreak occurs, epidemiologists – scientists who study disease patterns – are left scratching their heads.
Virus emergence is a key area of study for epidemiologists, as the ability to predict viral outbreaks can give us the upper hand in preventing and containing them. However, numerous obstacles make this a difficult task. Of course, seasonal flu can be ‘predicted’, as viruses are more stable during cold winter months. To make matters worse, it is estimated that just 0.005% of all virus strains have been identified. Obviously, it becomes harder to predict epidemics when we don’t know what we might we looking for!
Typically, viral emergence depends on factors that are already difficult to model. Environmental, geographical and even cultural factors play a role in the surfacing and spread of diseases5. Furthermore, viruses – especially RNA viruses – tend to have a high mutation rate, among which the most infamous is the human immunodeficiency virus (HIV). Constant changing of the genetic code makes the behavior of a virus extremely difficult to predict, as seen in our failures to develop a cure or vaccine for HIV6.
Phylogenetics to the Rescue
To predict the spread of a virus, epidemiologists try to determine its host (or hosts) and its method of transmission into humans. The host – also known as a ‘reservoir’ – is where a virus naturally occurs. The method of transmission involves a vector – an organism that can transmit the virus to humans. The dengue virus, for example, uses monkeys as a reservoir and the Aedes mosquito as its transmission vector. Knowing where and how a virus spreads is useful information that can help to predict
In terms of research material, a virus doesn’t have much to provide us. It contains only RNA (or DNA, depending on the strain), genetic information that the virus uses to build proteins. But genes can also tell us about the evolutionary history of a certain virus strain, an area of study known as phylogenetics. By constructing the genetic ‘family tree’ of a virus, we can narrow down possible hosts and vectors based on data of its closest relatives. Though the genetic data is accessible, the sheer amount of it makes processing difficult.
However, machine learning – a new weapon in the arsenal of epidemiology – could hold the key. We can ‘train’ machine learning algorithms find patterns and structure in existing data sets, in order to make future predictions. In November 2018, scientists Babayan, Orton and Streicker successfully used machine learning models to simulate the emergence and spread of viruses. Their results were published in the journal Science.
After ‘training’ the algorithm with the genetic information of hundreds of viruses, they were able to correctly assign hosts and vectors to viruses in the original dataset. The algorithm was then made to ‘predict’ the hosts and vectors of viruses not included in the training data. However, results showed only a 58% host identification rate and a 67% vector identification rate.
To further enhance accuracy another factor was introduced to the system – genomic ‘bias’. The genomes of viruses tend to mimic that of their hosts and vectors, as they need to hide within their cells. Similarities between their codon pairs and dinucleotides add another layer of identification. The algorithm incorporated these genomic biases using so-called gradient boosting machines (GBMs
Predicting the Next Viral Outbreak
It was quickly put to work by studying the genetic data of several strains of the Ebola virus. Prior to the study, bats were thought to be the primary reservoir for all strains of the virus. Surprisingly, the results generated by the algorithm showed that at least two strains – the Bundibugyo and Thai Forest ebolavirus – utilized a different host. The algorithm assigned a higher probability that these strains originated from primates instead3,4.
For all its remarkable success, the next crucial step for the machine learning algorithm is to see how it fares against ‘real-life’ data. If an unknown virus presented itself, would it be able to correctly identify its reservoir host and transmission vector?
Although it is not yet possible to predict where and when a virus might pop up, these new models provide useful tools in epidemiology. The huge death tolls of viral outbreaks in the past were mainly due to unforeseen circumstances and poor quarantine practices. Perhaps in the future, we might be able to stop epidemics before they even occur3.
- Centers for Disease Control and Prevention (2012). Principles of Epidemiology in Public Health Practice, Third Edition: An Introduction to Applied Epidemiology and Biostatics.
- World Health Organization (2010). What is a pandemic?
- Simon A. Babayan, Richard J. Orton, Daniel G. Streicker. (2018). Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science. November 2018. 362, 577.
- Centers for Disease Control and Prevention (2018). Ebola Reservoir Study.
- Geoghegan, J. L., & Holmes, E. C. (2017). Predicting virus emergence amid evolutionary noise. Open biology, 7(10), 170189.
- Cuevas JM, Geller R, Garijo R, López-Aldeguer J, Sanjuán R. (2015). Extremely High Mutation Rate of HIV-1 In Vivo. PLOS Biology, 13(9).