The UK’s Covid-19 Model

hackf5.io
14 min readMay 13, 2020
Thanks to Andrew Winkler for sharing their work on Unsplash.

On April 22 a version of the Imperial College model used to predict the impact of Covid-19 was released on GitHub. This model played a key role in the UK government’s decision to implement a nationwide quarantine.

Since the code became public it has, understandably, come under heavy scrutiny and heavy criticism.

So is the model as bad as some people are saying and what can we learn for the future?

TLDR;

If the model had been used for its intended purpose, this being academic research, then the criticisms would have been unwarranted. However, it has been used to justify what is arguably the most significant policy decision in many generations. Given this, one is forced to ask why it was that the best tool we had at our disposal was a dusty old code file that has been sitting around on a professor’s hard-disk for over 13 years?

One can certainly admire the make-do British wartime spirit of it, but it’s hard to argue that we didn’t get caught with our pants firmly to the floor.

I hope this is a wake-up call that we start investing heavily in pandemic modelling the way we invest heavily in weather modelling as it seems just as important for our future and that of our children.

Overview

The feedback the model has received has largely been negative; for example in the GitHub repository itself an issue was raised stating that the model was not fit for purpose due to the low quality of the code.

In the past week a number of code reviews and social media posts have been published that expand on this sentiment:

At least one of these comes from a site with a specific agenda, so it would be wise to treat it with a level of skepticism.

In my experience of working as a professional programmer for over a decade it’s rare that anyone says much good about a piece of software. When was the last time you emailed DuckDuckGo to tell them how happy you were with the search results you got back for funny cat meme? It does happen, but it’s one of those once-in-a-blue-moon things that most programmers greet with genuine shock (and delight). No, what programmers get is a seemingly endless stream of complaints and nitpicks. So it was to be expected that the majority of comments would be negative. However, the level of vitriol is significantly beyond the norm so I decided to take a look for myself.

A little bit of background about me. I got my PhD in pure mathematics from University College London in the late 00’s and started working as a programmer shortly thereafter. Until recently I was the lead developer for a commercial stochastic modelling platform used by the non-life (natural disasters and motor insurance) reinsurance industry. Pandemic modelling falls under the life (health and life insurance) umbrella and although the models themselves are quite different, there is a large amount of overlap in what the models do and how they’re built.

For those who’ve not worked in the software industry there is often a belief that most programmers are computer science graduates, however on the mid-sized team of which I was a member the majority of the programmers had physics, mathematics or engineering degrees. Most had a higher degrees and there were a few postdocs. This is fairly normal in financial software teams.

Criticisms of The Model

There are three main points that are made against the model.

  1. The code that has been published is not the code that was actually run.
  2. The code is almost unreadable.
  3. The code is not tested.

And there is one main point that is made in its defense.

  1. Academic code has different aims from commercial code, so it’s made differently.

I’m going to address each of these in turn.

The code that has been published is not the code that was actually run

What happened is described in the following tweets.

I’m conscious that lots of people would like to see and run the pandemic simulation code we are using to model control measures against COVID-19. To explain the background — I wrote the code (thousands of lines of undocumented C) 13+ years ago to model flu pandemics. I am happy to say that @Microsoft and @GitHub are working with @Imperial_JIDEA and @MRC_Outbreak to document, refactor and extend the code to allow others to use without the multiple days training it would currently require (and which we don’t have time to give).

Neil Ferguson

Before the GitHub team started working on the code it was a single 15k line C file that had been worked on for a decade, and some of the functions looked like they were machine translated from Fortran. There are some tropes about academic code that have grains of truth, but it turned out that it fared a lot better going through the gauntlet of code analysis tools I hit it with than a lot of more modern code. There is something to be said for straightforward C code. Bugs were found and fixed, but generally in paths that weren’t enabled or hit. Similarly, the performance scaling using OpenMP was already pretty good, and this was not the place for one of my dramatic system refactorings. Mostly, I was just a code janitor for a few weeks, but I was happy to be able to help a little.

John Carmack

To understand why a team of software engineers got involved it helps to know what constitutes a model. There are 4 distinct parts:

  1. The simulation logic that encodes the assumptions about the environment that is being modeled. An example would be the logic that describes how people transition from uninfected to infected.
  2. The input data that the simulation logic uses to make its predictions. An example would be the probability with which contact with an infected person results in infection.
  3. The data loader that loads the input data into the simulation logic. This data can be fragmented and often needs to be pulled from multiple files and/or databases.
  4. The results that are generated by running the input data through the simulation logic. The results data are usually in tabular form (like that found in an Excel worksheet). These data are usually post-processed into graphs for human consumption.

When developing a model that you don’t expect anyone else to ever see it’s normal to focus on the core functionality, which in this case is the simulation logic and the input data.

The data are, in my experience, usually spread around various locations and not well organised. Think about how you organised your photos before your phone auto-tagged everything for you.

The data loader changes from run to run, because the input data are changing from run to run.

The results format may change from run to run and the scripts that analyze the results will need to change to represent the changes in the results data and to investigate different aspects of the data.

Ferguson states that the Microsoft and GitHub teams were brought in to refactor and extend the code to allow others to use without the multiple days training it would currently require. Refactor is a hugely overused term in software engineering, it means anything from tidy up a bit, think paint the window frames, through to completely redo everything from scratch, think knock down the garage, dig up the foundations and build a condo.

My assumption, based on the twitter comments above and looking over the code, would be that Ferguson considered his simulation logic and input data to be good, but that at the data loader and possibly the results could not be used by anyone other than him or one of his students. For example the data loader may have comprised many hard-coded references to Imperial College file shares and databases and the results may only have been possible to interpret using an R script that ran in a particular version of R. I’m guessing here, but this is the sort of thing I’ve found when working with actuaries to resolve their modelling issues.

John Carmack is a highly experienced professional programmer and his team would have gone to considerable lengths to preserve the simulation logic so that the results of the refactored model are as close to the original model as possible.

This would likely have been done by transplanting the simulation logic verbatim and then chopping it up into smaller blocks to improve readability. Parity with the original model would have been verified by running the original model and the new model on the same set of input data and asserting the results (or some subset of the results) were the same. This is standard stuff. It is also worth noting that there exist a lot of tools that help considerably with this kind of work that guarantee that the application logic is preserved when making significant changes.

What the software team would have focused on would have been making the data loader and the results generation more robust. None of this is particularly difficult, although it can be time consuming.

In summary the Microsoft and GitHub teams were brought in to turn a prototype into something closer to a piece of production software. Ferguson’s original code would likely have been as good as useless in isolation, so the fact that it has been made easier to run and examine was almost certainly necessary. I can understand why they are refusing to release the original as the negative comments would only be magnified.

In summary, taken in context the refactoring seems relatively benign. What it has done is allow other researchers to provide a level of perspective on the model that would have been impossible otherwise.

The code is almost unreadable

What does readable mean?

In simple terms the more readable code is, the less of it you need to read to understand what it’s going on. I would liken a readable code-base to a book with a well written summary that describes its contents. Having read the summary I’d hope to have a good idea of what the book is about and what the key ideas are. Conversely some code is more akin to Finnegan’s Wake.

Is the code unreadable?

It’s not pretty as John Carmack alludes to when he states There are some tropes about academic code that have grains of truth. If a junior programmer handed this in they would probably be asked to go back and try a little harder. But is it completely unreadable? No, definitely not. The code itself is naive with little genuine complexity. There is a lot of it and doubtless a large amount of repetition, but apart from the multi-threading, it’s all completely vanilla. I know a very good programmer who loves to code in this style. My impression is that a decent C++ programmer would be able to get a fair idea of what the code is doing if they spent a while studying it. If they also understood the science it would probably be quicker since the concepts would be familiar.

Not quite under the unreadable umbrella, but close by is the complaint that has been made about the fact that the model can return different results for the same inputs (in technical terms it is said to be non-deterministic). In truth determinism is a nice to have, but being non-deterministic does not invalidate the outputs.

Actually it is quite hard to get a multi-threaded stochastic model to run deterministically, although most commercial software manages this because actuaries like this. The reason actuaries like this is that it makes it easier to spot numerical bugs during model development because the outputs only change if the inputs change. However, since the model output is a prediction based on random numbers, so long as the random numbers remain correctly distributed one set of outputs is as good as another.

In the non-life modelling world the models are run a great many times in the hope that the outputs converge in certain places. Life models are a bit different, but running them many times and looking at the distribution of the outputs over multiple runs would have the same effect. If you ran the model many times and the results were all over the place that would suggest the model was junk. If however there was a strong consensus between runs then you could assume that if the inputs and assumptions were valid then the outputs would also be valid.

There are a bunch of other gripes on the quality front. I saw one person stating that there were too many inputs. It is worth noting that big non-life models can easily have many millions of inputs, so whoever thought this had clearly never worked with any big models. This one is actually quite small.

In summary, readability isn’t great, but it’s good enough to allow another research team to work out what the model is doing, which I believe was the point of publishing it.

The code is not tested

How is software tested?

It’s tested by taking chunks of code from the code-base, passing in inputs and verifying the outputs against pre-computed expected outcomes. Ideally each line in the code-base should have at least one test that covers it. In practice this rarely happens, but good teams generally aim to cover most of their code with tests.

Why do programmers test their software?

From an engineering point-of-view there are two reasons:

  1. To ensure that the code does what it’s supposed to today.
  2. To ensure that the code does what it’s supposed for as long as it remains in the code-base.

A lot of programmers think they are testing for reason 1, but in fact reason 2 is the real reason that good programmers thoroughly test their code. Reason 2 is the important one because it means that every time a programmer makes a change the tests can be run and new bugs that have been introduced can be identified immediate through test failures and then fixed.

It is critical that tests and code are written in tandem, as once untested code has found its way into a code-base it is very unlikely to ever get tested properly. It is probably worth noting that it often takes at least as long to test code as it does to write it, it’s also a very skillful job. Many companies hire programmers who specialize only in software testing.

Is it a problem that the software is not tested?

In my opinion, yes, it’s a problem. Having worked on a large stochastic modelling platform for a number of years I know how easy it is to get the numbers wrong. I’m not saying that because there is no testing the results are wrong. But what I am saying is that the confidence you can have in the results must be low because there is no testing. The lack of testing makes it very hard to make changes or to fix bugs because it’s impossible to know whether a change will cause a problem elsewhere.

Although there are no tests in the code itself, I’m sure that a number of high level statistical tests will have been performed on the results to ensure that they look sane. This is what actuaries do to validate their models. However, just because the results look sane, that doesn’t mean that the model is doing what it’s supposed to do.

In summary, the lack of rigorous testing means it’s not possible to have much confidence in the validity of the results.

Academic code has different aims from commercial code, so it’s made differently

I’m going to quote from the Lumps ’n’ Bumps blog as this is reminiscent of a hymn-sheet that is often sung from:

Many scientists write code that is crappy stylistically, but which is nevertheless scientifically correct (following rigorous checking/validation of outputs etc). Professional commercial software developers are well-qualified to review code style, but most don’t have a clue about checking scientific validity or what counts as good scientific practice. Criticisms of the Imperial Covid-Sim model from some of the latter are overstated at best.

As I mentioned at the start, many commercial software developers have more than a clue about checking scientific validity and what counts as good scientific practice as they have spent many years in academia themselves. I don’t know what percentage of math, physics, chemistry and engineering PhDs end up working in financial modelling, but I can guarantee it’s more than the number that remain in academia.

How can it be argued that the model is scientifically correct if no tests of any form, beyond a regression test against the original model, have been published?

In the math PhD office in UCL there was a shabby old photocopied page stuck to the wall that contained a number of satirical methods of proof, among these was

proof by authority well, Grothendieck says it’s true, so it must be.

Make your own mind up, here’s some further reading.

Another quote from the blog.

Cook explains how scientists use their code as an “exoskeleton” — a constantly-evolving tool to help themselves (and perhaps a small group around them) answer particular questions — rather than as an engineered “product” intended to solve a pre-specified problem for a separate set of users who will likely never see or modify the code themselves.

Sure. It’s like being a research chemist who experiments with a large number of chemical compounds in an attempt to develop a medicine that cures some specific disease (this probably doesn’t happen, I know nothing about medicine). However, before you can sell the medicine you need to go through a vast array of tests to prove that it really does work and doesn’t have serious side effects. For me the argument that it works because I know it works just doesn’t cut it and wouldn’t cut it in most fields. Apparently though academic software is somehow special… Excuse me Prime Minister, I have a small rocket that I’ve knocked up in my back garden, it flies very well over Farmer Jones’ field, do you think we could try sending some volunteers into space with it?

I know that academics will argue that they have the peer review process which ensures quality, but from what I can tell peer review is pretty broken. Journals generally have no interest in publishing scientific reproductions, so they are rarely done, but when they are they seldom succeed. In many fields there simply aren’t the resources to do any real due diligence on the articles being published. Nepotism is a huge issue with big names getting preferential treatment. See here and here.

I’m happy to accept most of the other comments in the blog. If scientists want to hack out poorly written code for their small circle, go for it. Most commercial software is badly written. I’m of the opinion that well written code tends to pay dividends over the medium term whatever its purpose, but at the end of the day the only cost is time. Documentation! Good luck if you can find any documentation worth a damn in most commercial software.

In summary it’s reasonable to accept that research code has different aims from commercial software and therefore has standards that optimize for a different set of goals, but when a research project graduates from the laboratory it seems reasonable to expect it to reach production standards.

Conclusion

Giving it the benefit of the doubt, the Imperial College model is a useful research tool that has probably provided researchers in the field with a number of useful insights and ways of exploring their ideas.

But why is it that the UK government can spend over £170 million per year on funding the MET office to model weather (and I’m not suggesting we should stop) are we relying on a dusty old code file that has been sitting around on a professor’s hard-disk for over 13 years?

There was a pandemic in 2009 and a number of other significant disease events in the past decade. Insurance companies have been modelling and selling pandemic insurance for years. So to those in the know an event like this could hardly have been much of a surprise. What gives?

My personal opinion is that bashing the Imperial College model is misguided and serves little purpose. It represents one opinion and one that is probably based on a lot of thought and experience. But the model is a very narrow one. It optimizes for a single outcome: mitigate the direct impact of a single virus. It does not in anyway attempt to predict the medium or long term impact of the short term solutions it suggests.

For me the major takeaway is that we should be investing heavily in pandemic modelling similarly to the way we invest in weather modelling as it seems just as important for our future and that of our children.

--

--