Mathematics helping to predict the evolution of COVID-19
Atualizado: 2 de Set de 2020
Hello guys! We took a while, but we are back with another paper.
Taking advantage of the pandemic period, we prepared a material for you to show what I have been studying during this period of staying at home. As I cannot visit my clients, I am studying and learning new concepts, remembering old concepts that were a little forgotten and I would always like to share with you.
I hope you like the article. As a friend from old times used to say, if you don't like it, just tell me, but if you like it, spread it to the four corners of the world.
PS: You can check the article here by the blog or if you prefer to download the article
Mathematics helping to predict the evolution of COVID-19
Author: Edson Rui Montoro
The pandemic we are all going through and feeling its effects literally on the skin, has created a disturbance in everyone's life and I was obviously not immune to it. With quarantine my work was very compromised and staying at home has become our routine.
As I can't stand still, I started studying a little bit of epidemiology, the mathematical models that exist to create the epidemiological curves and I realized that I was a little “rusty” with the concepts of differential calculus, ordinary differential equations, among others.
An epidemic, according to Forratini (2005), is the “name given to the state of incidence or health problem, in addition to what is normally expected within the range of endemicity, in a given area or population group”, with the term endemicity being the range of variation in the prevalence of the disease or health problem, defined by levels considered normal for a given area or population group.
The development of an epidemic depends on some factors such as the number of susceptible people, the number of infected, the rate of contact between them and the mode of transmission.
According to Greenberg and collaborators (2005), a pandemic is an epidemic of infectious disease that spreads among the population located in a large geographic region, such as a continent, or even the planet. According to the WHO (World Health Organization), a pandemic can start when three conditions are met:
The appearance of a new disease in the population;
The agent infects humans, causing a serious illness;
The agent spreads easily and gets sustainable among humans.
A disease or condition cannot be considered a pandemic just because it is widespread or kills a large number of people; it must also be infectious. For example, cancer is responsible for a large number of deaths, but it is not considered a pandemic because the disease is not contagious.
The science that studies epidemics is epidemiology, which is the branch of medicine that assesses the different factors that intervene in the spread of diseases, their frequency, their mode of distribution, their evolution and the placement of the means necessary for their prevention.
This science uses mathematics, as well as other sciences, to evaluate, quantify and make predictions in the most reliable way possible through statistics and mathematical modeling techniques. In epidemiology, mathematical modeling is extremely useful for understanding the propagation mechanisms and for planning control strategies and assessing their impacts.
Bassanezzi (2002), states that a mathematical model of a situation is a symbolic representation that starts from the real and involves an abstract mathematical formulation. According to this same author, modeling is the practice of making models. A model is never a completely accurate representation of a physical situation, it is an idealization.
A good model simplifies reality enough to allow mathematical calculations, while maintaining sufficient precision for meaningful conclusions, however it is important to understand the limitations of the model (STEWART, 2015).
The most famous phrase of George Box (Box, G.E.P, 1976), a well-known chemist / statistician in the field of design of experiments in the industrial area, is: "All models are wrong, but some are useful". This may never be as true as during a crisis, as information is limited, often wrong, but decisions must be made and implemented based on what is known today.
The mathematical models that provide a solution to real situations are extremely complex and to approach this complexity, the strategy is to achieve a real solution to an approximate problem, which is better than an approximate solution to a real problem, as shown in Figure 1.
Figure 1 – Mathematical modeling.
The concept of mathematical modeling is fully consistent with the George Box phrase, quoted earlier.
The modeling technique in epidemiology has been known since the middle of the previous century (Kermack, W.O., McKendrick, M.C., 1927). The most used model is the SIR (Susceptible, Infected and Removed), sometimes also called the Compartmental Systems Model, and it is based on a system of Ordinary Differential Equations (ODE) that makes some basic assumptions, such as, for example, that of the original population (N) does not change as there is a balance between deaths and births.
There are other models for different epidemiological situations but for the current case of COVID-19, the SIR model seems more appropriate, especially at the beginning of the crisis.
Firstly, considering that an epidemic occurs in a closed system where there is contact between healthy and infected individuals and that the population is divided into different classes or compartments, namely:
S = healthy people, but susceptible to disease, who can become infected when they meet sick people.
I = individuals with the disease, which are the focus of transmission.
R = individuals who have already contracted the disease and recovered or died.
In the study of compartmental models, we consider that each compartment, or class, is composed of homogeneous individuals and each individual has the same probability of meeting a susceptible individual and births have not yet occurred and that the death of individuals only occurs due to the contagious disease under study.
In this study we have a closed system, that is, the population in question remains constant over time and what changes are the components, called compartments, hence the name compartmental. This can be described by the following equation:
N = represents the amount of the population;
S(t) = number of susceptible individuals at time t;
I(t) = number of individuals infected at time t;
R(t) = number of individuals recovered at time t, in which the term Recovered includes those who have recovered and died due to this very illness.
Another way of representing the model is shown in Figure 2.
Figure 2 – Compartments considered in the SIR model.
Modeling by EDO (Ordinary Differential Equations), considers the following equations:
This model is based on some additional hypotheses, which are:
The ratio of the variation of the susceptible population over time (dS/dt) is proportional to the number of encounters between the populations of the susceptible and the infected, that is, in order for the spread of the infection to occur the SI interaction (Susceptible – Infected);
The ratio of variation of the population removed (dR/dt) or recovered is proportional to the population already infected (I);
A member of the population, on average, makes contact to transmit the infection to another individual at a rate dI(t)/N per unit time (law of mass action).
Notice that in the first equation (2.a) there is an interaction between S and I, which means that the contact between a susceptible and an infected person causes the epidemic to evolve.
Using these hypotheses, we can translate the differential equations in a more literal way for a better understanding, as follows:
Rate of change of S = –Infection Rate;
Rate of change of I = Infection rate – Removal rate;
Rate of change of R = Removal rate.
The solution of the system of ordinary differential equations goes through the values of S(0), I(0) and R(0), that is, values in the “zero” time of the epidemiological process, with S(0) being the number of people that make up any given population at the beginning of the process, I(0) must be the first person to be infected, that is, the origin of the process or at least on the day of its discovery and the R(0) must be null. The Euler method is used to solve the ordinary differential equations.
Solving the Ordinary Differential Equations using Euler method with the initial data of the epidemiological process, one obtains the curves that clearly represent the prediction of the epidemic's behavior, as shown in Figure 3.
Figure 3 – Epidemiological curves – SIR model.
In Figure 3, it is possible to observe the Infected (I) curve (red curve) that grows, reaches a maximum and then decreases, while the S (Susceptible) curve (blue curve) starts with the population size and decreases with the evolution of the epidemiological process. The R (Recovered) curve (green curve) gradually increases with a lag that represents the time it takes the individual to recover.
The Recovered compartment (class) can still be divided into Recovered and the number of deaths (purple curve), which is a curve that increases at first and then stabilizes, but fortunately at a much lower level than the Recovered one, representing the mortality rate of the epidemiological agent.
The epidemic contagion is based on the law of mass action, that is, the infection spreads more quickly the greater the concentration of susceptible individuals exposed to the infectious agent, which is hosted in an infected person. This infection rate can be described as the basic reproduction rate of the pathogen (in this case the virus), called R0, defined as the average number of individuals infected by a single infected member during its infectious period, as soon as an epidemic start. So, we have:
β = average probability of successful infection of an infected person;
c = the average number of susceptible individuals exposed to an infected individual;
d = average period of the contagious phase.
Another way to obtain R0 is:
Being gamma the recovery rate.
Based on analyses in equation 3, it can be concluded that:
If R0 > 1 the number of infected will increase, generating an epidemic;
If R0 < 1 the epidemic is not self-sustaining and tends to disappear;
If R0 = 1 the disease persists endemically, but in an unstable manner in the population which can cause epidemics, persist or become extinct.
Another factor, an important result that stands out within epidemiology, is the Threshold theorem proposed by Kermack and McKendrick (1927), which advocates the existence of a critical number of susceptible individuals in a population so that an epidemic can occur. That is, if a number of infected individuals are introduced into a population, we will only have an epidemic if the number of susceptible individuals is greater than the critical value. Otherwise, we will not have an epidemic. This justifies why 100% of the population was not vaccinated during epidemic outbreaks. Depending on the situation, it is possible to control a particular epidemic by vaccinating, for example, 70% of the population.
A relationship between the Recovery Rate (gamma) and the Transmission Rate (beta) as a function of R0 can be seen in Figure 4.
Figure 4 – Relationship between b and g changing R0
The higher the R0, the lower the slope of the curve and requires more time for the recovery of infected people. As transmission and recovery rates increase, the duration of the epidemic decreases.
This SIR model works very well in the beginning, then it fails to explain the pandemic evolution. There are other models with the same philosophy, such as SEIR (Susceptible, Exposed, Infected and Removed) and a series of variations with more compartments being inserted in the model.
After some time trying to learn about the SIR model, I realized that it was no longer adjusting well to the evolution curve of the number of actual cases and obviously with all the time available I started looking for other models that could better represent this evolution.
I then found the Gompertz curve, created by Benjamin Gompertz, who was a Jewish mathematician and actuary, who proved that the death rate grows geometrically. Gompertz defined a law that described the geometric growth of the mortality rate and this study presented an advance in relation to the studies of Thomas Malthus, focused on the calculation of actuarial and death insurance.
Remembering again that every model is wrong but some of them are useful, the function that describes the Gompertz model is given by equation 5:
y = Accumulated number of cases;
k = constant that represents the upper part of the asymptote;
e = Euler number (2.7182818);
a = constant related to the lower part of the asymptote, is the point when the curve starts to rise;
b = constant related to the growth rate;
x = corresponds to the days of the evolution of the cases.
The Gompertz curve is a sigmoid similar to a logistic curve, only it is not symmetrical and looks like the following (Figure 5):
Figure 5 – Gompertz sigmoidal curve.
From this model it is possible to obtain numerous information about the phenomenon under study in this case, COVID-19. Among them an estimate to the following questions: when will the pandemic peak? when will it finish?
The ratio of the constant’s a/b is the value on the horizontal axis that corresponds to the peak; and the k/e relationship corresponds to the vertical axis value.
The derivatives (I had to study a little to remember!) are remarkably interesting and the first derivative (Equation 6) is represented in the curve of Figure 6.
Figure 6 – First derivative of Gompertz.
The peak is clearly noticed and the corresponding value on the horizontal axis, which in our case is the number of days since the beginning.
The second derivative is also remarkably interesting, because when its curve crosses zero on the vertical axis it also corresponds to the peak, but it is possible to verify the value with more certainty. The equation of the second derivative and its corresponding curve are presented in equation 7 and Figure 7 respectively.
Figure 7 – Second derivative of Gompertz.
Observe in Figure 7, the zero line, and when the curve crosses this line, the value of the horizontal axis corresponds to time for the pandemic peak.
As we are more used to linear curves, it would be more interesting to linearize the Gompertz curve, which is in the form of equation 8 and as can be seen in Figure 8. The curve is descending because the slope is negative.
Figure 8 – Gompertz function linearized.
It is worth mentioning that as in the second derivative, the peak (maximum) is characterized by the crossing of the curve on the zero line.
Figure 9 presents data on the number of cases accumulated in Brazil up to 06/25/2020, together with the adjustment of the Gompertz curve. Notice the adherence of the model (red curve) to the data.
Figure 9 – Gompertz curve adjusted to the data until 06/25/2020.
The adjusted Gompertz model, with an R2 of 0.9990, produced the following values for the equation parameters:
k = higher level in the amount of 7,705,969 corresponding to the maximum number of infected;
a = point where the curve started to rise (2.5);
c = growth rate equivalent to 1.6%.
Based on these parameters, we design the first and second derivative curves, which can be seen in Figures 10 and 11 respectively.
Figure 10 – Gompertz’s first derivative of BR data.
Figure 11 – Gompertz’s second derivative of BR data.
As can be seen in Figure 11, the curve of the second derivative crosses the line from zero at the point approximately corresponding to the value of 157 on the horizontal axis, which corresponds to the peak of the pandemic in Brazil in early August.
These are just two of the possible models in the field of epidemiology. There are others, but within our possibilities we present these two, SIR and Gompertz, the latter especially useful in the current stage of the pandemic.
There are several groups linked to several universities that are studying, evaluating, and proposing new models to understand this pandemic in Brazil and abroad. One of the models that I find very interesting is that of the staff at the Veterinary Faculty of UNESP in Araçatuba, coordinated by Professor Yuri Tani Utsunomiya, who works with the Gompertz model through the Moving Regression technique associated with a Hidden Markov Chain. Data can be loaded on the website https://theguarani.com.br/covid-19/, and three curves are generated based on accumulated number of cases data, one for natural growth, one for the first derivative and the last for the second derivative, which corresponds to the acceleration of the growth in the number of cases. It is worth visiting the website and reading the available paper that describes the technique they are using.
To follow the evolution of the pandemic in Brazil and also in some cities where I have clients, I initially built an Excel spreadsheet for the SIR model, which was based on a spreadsheet developed by Professor Nicolas Spogis from the Faculty of Food Engineering at UNICAMP. Sometime later I created a spreadsheet for the Gompertz curve with its derivatives and its linearization which is currently a better simulation as it fits very well into the evolution curve of the number of accumulated cases with a model explanation coefficient (R2) of about 99%.
The data of each city of interest, despite all the controversy of underreporting that we have seen every day, and which makes it more difficult to analyze it, I search on the website https://brasil.io/dataset/covid19/caso_full/.
These spreadsheets are available to anyone who is interested, just send an email that I will be happy to share, because after all the motto of our company is “Sharing knowledge to improve your results”.
Greenberg, Raymond S.; Daniels, Stephen R.; Flanders, W. Dana; Eley, John William; Boring, III, John R (2005). Epidemiologia Clínica 3ª ed. Porto Alegre: Artmed.
BASSANEZI, R.C (2002), Ensino-aprendizagem com modelagem matemática: uma nova estratégia; São Paulo: Contexto.
STEWART, J. (2015) Cálculo. São Paulo: Cengage Learning, V. 1.
Box, G. E. P. (1976), "Science and statistics" (PDF), Journal of the American Statistical Association, 71 (356): 791–799, doi:10.1080/01621459.1976.10480949.
Marros, A. M. D. (2007), Modelos matemáticos de equações diferenciais aplicados à epidemiologia, www.pgsskroton.com.br, consultado em 07/04/2020.
Kermack, W.O., McKendrick, M.C. (1927), A Contribution to the Mathematical Theory of Epidemics; Proc. Royal Soc. A, 115, 700-721.
Ramon, R. (2011), Modelagem Matemática Aplicada a Epidemiologia, Monografia apresentada à UFSC como parte dos requisitos para a obtenção do grau de especialista em Matemática. Orientador: Dr. Daniel Norberto Kozakevich, Universidade Federal de santa Catarina, Departamento de Matemática.
Gonçalves, B. Epidemic Modeling 101: Or why your CoVID-19 exponential fits are wrong, https://medium.com/ acessado em 20/04/2020.
https://lsbastos.github.io/covid-19/#covid-19-por-municipios consultado 23/04/2020.
WINSOR, C.P. (1932) A comparison of certain symetrical growth curves. J. Washington
Acad. of Sci., v. 22, n.19, p.73-84.
https://theguarani.com.br/covid-19/, visitado em 05/06/20.
https://www.youtube.com/watch?v=ZUrtyAL6NCM&t=7s vídeo do Professor Nicolas Spogis, visitado em 05/03/2020.
https://brasil.io/dataset/covid19/caso_full/ site visitado desde 15/05/2020.
About the Author:
Edson R. Montoro is Technical Director of ERMontoro Consulting and Training Ltda, a company focused on people development and process improvement consulting using Applied Statistics and Lean Manufacturing.
The author is a Chemist from UNESP (Paulista State University Júlio de Mesquita Filho) - Araraquara, MBA in Business Management from FGV (Getulio Vargas Foundation), Master Black Belt from Air Academy Associates, Quality Engineer from ASQ (America Society for Quality) and Postgraduate in Production Management from UFSC (Federal University of Santa Catarina).