Forecasting Ebola Deaths Using Time Series Regressions

Johnny Voltz, an old college friend of mine who was once voted most-likely to have a superhero named after him, send in a great question about the recent, tragic Ebola outbreak in West Africa:

I know very little about math, and even less about medicine. From the data I have, it would suggest that the Ebola virus is growing at a logarithmic pace. Would it be fair to predict continued growth at this rate considering lack of medical care in Africa? Would 100,000 be an exaggerated number for March 2015? What are your thoughts?

He then linked to a chart which showed the logarithm of Ebola Deaths and an Excel fit.

10635959_10103770303239115_8695286090319037235_n

This is a classic time series problem, and I’d like to use it to illustrate the process, merits and accuracy of fitting time trends through regressions. As a final step, we’ll produce an estimate of cumulative Ebola deaths in March 2015. But first, let’s talk about regressions in general.

Understanding Regression Analysis

Regression analysis is a widely used technique which helps practitioners discover relationships between variables. In this context, we’re trying to determine how the number of Ebola deaths has been (and will be) affected by the passage of time. More generally, we’re attempting to explain a dependent variable that depends on one or more independent variables.

The basic process is as follows:

  1. Define a mathematical relationship which could relate the independent variable(s) (t) to the dependent variable (deaths_t). This relationship will include at least one parameter (also known as coefficients) whose values are unknown.
  2. We define a “loss function” which is a measure of how well any particular set of parameters fits the relationship to the data. The most commonly used loss function is to square the difference between actual and fitted values of the dependent variable.
  3. We solve the optimization problem: what parameter values minimize the loss function. Informally, you can think of a computer tuning the parameters to find the best “fit” to the data. (This is why it’s called “least squares” regression.)

As an oversimplified example, one could specify the following relationship between deaths and t:

deaths_t = \beta_0 + \beta_1 t

This model is easy to interpret. At time zero, there’s \beta_0 deaths. With each additional t, there’s \beta_1 additional deaths. However, this isn’t a good fit for the data or the problem. We shouldn’t expect the number of deaths to linearly increase. The more people who currently have Ebola, the more people who can spread Ebola. So we should expect Ebola cases (and the resulting deaths) to increase exponentially.

Examining the plotted linear fit confirms our intuition.

Deaths Fig 1

Linear Regressions

As Johnny suggested, a better fit is to take a logarithm of the dependent variable. (See previous posts for an explanation of logarithms and the natural logarithm.)

ln(deaths_t) = \beta_0 + \beta_1 t

Since the dependent variable is a logarithm and the independent variable is linear, this is sometimes called a log-lin regression. Solving for death, this means that:

deaths_t = e^{\beta_0 + \beta_1 t}

Finance types will notice that this is very similar to the formula for continuous growth, except the parameter for the initial level is in the exponent rather than in front. Calculus types will confirm that the derivative of deaths with respect to t (how quickly deaths is increasing at any instant) is exactly equal to \beta_1 y. So, like we desire, the rate at which Ebola spreads depends on the current level of Ebola.

The linear model assumes that every day a constant number of people will die. The logarithmic model assumes that every day the number of deaths will increase by a constant percentage.

This assumption produces a far better fit.

Deaths Fig 2

In the future, I’ll give a list of many functional forms and how to interpret their conclusions.

Forecast Intervals

Is it reasonable to assume deaths will exponentially continue at the historically observed rate? In the short-term, probably. In the longer term, probably not. For as long as the disease can spread among its current population, current trends may continue. However, to put it clinically, exponential growth can only continue until the end of the petri dish. Increased education and cleanliness may decrease the growth rate. Alternatively, if the infection spreads to many densely populated areas, then the rate could get even worse. A more sophisticated model would consider different populations susceptibilities to the disease and the probability of the disease spreading to them.

If we assume the trend will continue at its present rate, where does this put cumulative deaths by March 2015?

At the 95% confidence interval (which I will also explain in a future post), I’m projecting cumulative deaths in mid-March of 2015 to be between 60,000 and 130,000 with a point estimate of roughly 88,000. This is shown in the following graphs with both a normal and a logarithmic scale.

Deaths Fig 3

Deaths Fig 4

That’s a lot of deaths, and I hope the number of cases stops growing.

This is a significantly lower forecast than what the CDC predicts. However, it looks like the CDC estimates that less than half of Ebola cases are reported. To correct for this, all of their figures are multiplied by 2.5.

Accounting for this partially reconciles the official estimate with my outside view, but only partially.

All facts and figures were created in an Excel workbook, which you can download here.