Predicting Viral Outbreaks With Machine Learning

green virus viral outbreak computer generated

Think about pandemics that have affected the human race – our history and even our culture. Even in recent times, viral outbreaks such as Zika, Ebola, HIV and even Smallpox have caught us by surprise. Despite the technology and data available to us today, scientists struggle to predict where and when the next viral outbreak might occur. But this could be about to change, thanks to a new weapon in our arsenal – machine learning.

Pandemic Alert

Since the beginning of our species, humankind has survived many viral outbreaks. These outbreaks start off as epidemics, confined to a certain location. If allowed to spread throughout the world, we classify them as pandemics1. The COVID-19 outbreak in 2020 is the most recent pandemic, caused by the Sars-CoV-2 strain of coronavirus.

Pandemics in the past have claimed millions – perhaps even billions – of lives, through diseases such as the Black Death and Spanish flu. The H1N1 (‘Swine’ flu) pandemic of 2009 caused alarm due to its resemblance to the Spanish flu virus, sparking a panic that another worldwide disaster could resurface2.

Despite the knowledge we possess today, our lives continue to be at the whim of such viral outbreaks.

spanish flu influenza viral outbreak pandemic epidemic mortality
A hundred years later, scientists are still unable to determine the exact origins of the 1918 Spanish flu pandemic that killed millions.

The Challenges Facing Epidemiologists

For many recent viral outbreaks, scientists actually know a great deal about the viruses themselves. In fact, we have studied viruses such as Zika and Ebola for decades. And yet, each time an outbreak occurs, epidemiologists – scientists who study disease patterns – are left scratching their heads.

Virus emergence is a key area of study for epidemiologists, as the ability to predict viral outbreaks can give us the upper hand in preventing and containing them. However, numerous obstacles make this a difficult task. Of course, seasonal flu can be ‘predicted’, as viruses are more stable during cold winter months.

To make matters worse, it is estimated that just 0.005% of all virus strains have been identified. Obviously, it becomes harder to predict epidemics when we don’t know what we should be looking for! Typically, viral emergence depends on factors that are already difficult to model. Environmental, geographical and even cultural factors play a role in the surfacing and spread of diseases5.

Furthermore, viruses – especially RNA viruses – tend to have a high mutation rate, among which the most infamous is the human immunodeficiency virus (HIV). The constant changing of the genetic code makes the behavior of a virus extremely difficult to predict, as seen in our failures to develop a cure or vaccine for HIV6.

Phylogenetics to the Rescue

To predict the spread of a virus, epidemiologists try to determine its host (or hosts) and its method of transmission into humans. The host – also known as a ‘reservoir’ – is where a virus naturally occurs. The method of transmission involves a vector – an organism that can transmit the virus to humans.

The dengue virus, for example, uses monkeys as a reservoir and the Aedes mosquito as its transmission vector. Knowing where and how a virus spreads is useful information that can help to predict viral outbreaks.

In terms of research material, a virus doesn’t have much to provide us. It contains only RNA (or DNA, depending on the strain), genetic information that the virus uses to build proteins. But genes can also tell us about the evolutionary history of a certain virus strain, an area of study known as phylogenetics.

By constructing the genetic ‘family tree’ of a virus, we can narrow down possible hosts and vectors based on data of its closest relatives. Though the genetic data is accessible, the sheer amount of it makes processing difficult.

bronn tree diagram virus family tree
Take this Bronn tree diagram. Now imagine sorting it based on the genetic information of tens of thousands of viruses, hosts and vectors.

Machine Learning

However, machine learning – a new weapon in the arsenal of epidemiology – could hold the key. We can ‘train’ machine learning algorithms find patterns and structure in existing data sets, in order to make future predictions. In November 2018, scientists Babayan, Orton and Streicker successfully used machine learning models to simulate the emergence and spread of viruses. Their results were published in the journal Science.

After ‘training’ the algorithm with the genetic information of hundreds of viruses, they were able to correctly assign hosts and vectors to viruses in the original dataset. The algorithm was then made to ‘predict’ the hosts and vectors of viruses not included in the training data. However, results showed only a 58% host identification rate and a 67% vector identification rate.

To further enhance accuracy, another factor was introduced to the system in the form of genomic ‘bias’. The genomes of viruses tend to mimic that of their hosts and vectors, as they need to hide within their cells. Similarities between their codon pairs and dinucleotides add another layer of identification.

The algorithm was incorporated these genomic biases using so-called gradient boosting machines (GBMs), which was then able to identify hosts and vectors with an accuracy of up to 95.9%3.

Predicting the Next Viral Outbreak

It was quickly put to work by studying the genetic data of several strains of the Ebola virus. Prior to the study, bats were thought to be the primary reservoir for all strains of the virus. Surprisingly, the results generated by the algorithm showed that at least two strains – the Bundibugyo and Thai Forest ebolavirus – utilized a different host. The algorithm assigned a higher probability that these strains originated from primates instead3,4.

Until now, bats were suspected to be the primary reservoir of the Ebola virus, but model predictions now question this claim.

For all its remarkable success, the next crucial step for the machine learning algorithm is to see how it fares against ‘real-life’ data. If an unknown virus presented itself, would it be able to correctly identify its reservoir host and transmission vector?

Although it is not yet possible to predict where and when a virus might pop up, these new models provide useful tools in epidemiology. The huge death tolls of viral outbreaks in the past were mainly due to unforeseen circumstances and poor quarantine practices. Perhaps in the future, we might be able to stop epidemics before they even occur3.


  1. Centers for Disease Control and Prevention (2012). Principles of Epidemiology in Public Health Practice, Third Edition: An Introduction to Applied Epidemiology and Biostatics.
  2. World Health Organization (2010). What is a pandemic?
  3. Simon A. Babayan, Richard J. Orton, Daniel G. Streicker. (2018). Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science. November 2018. 362, 577.
  4. Centers for Disease Control and Prevention (2018). Ebola Reservoir Study.
  5. Geoghegan, J. L., & Holmes, E. C. (2017). Predicting virus emergence amid evolutionary noise. Open biology, 7(10), 170189.
  6. Cuevas JM, Geller R, Garijo R, L?pez-Aldeguer J, Sanju?n R. (2015). Extremely High Mutation Rate of HIV-1 In Vivo. PLOS Biology, 13(9).

You may also like...

Leave a Reply