Statistics – Tessa Wilkie

Dealing with Imputation Uncertainty

Tessa Wilkie — Fri, 01 May 2020 10:48:34 +0000

This post tackles a popular method that helps you understand the amount of variability you have introduced to your analysis through replacing missing data with estimated values. This variability is known as Imputation Uncertainty.

If you haven’t read my first two posts on Missing Data, it might be worth taking a look before you read this. You can find the first post here, and the second, here.

I had some misgivings about imputation before I learnt about methods to quantify imputation uncertainty.

My misgivings centred around the fact that with imputation we are sort of making the data up (in a statistically rigorous fashion, of course!). But even so, how happy could we be with our analysis after imputing?

It turns out we can use a method that gives us insight into how much variability is down to the fact that we have imputed missing data.

This can help us to understand how confident we can be in our statistical analysis, given that it is based in part on missing data.

One popular method that gives us a measure of imputation uncertainty is Multiple Imputation.

How do we do Multiple Imputation?

Firstly, we create an imputed data set using any method that involves taking draws from a predictive distribution.

We repeat this, to create M imputed data sets.

We can analyse these data sets, to come up with estimates of parameters we are interested in.

We can then combine these estimators. There are also formulas that we can apply to calculate within imputation variance, across imputation variance, and overall variance.

These can give us an idea of how much of the variability in our estimates is down to the imputation process.

Multiple Imputation isn’t the only method that can help us with Imputation Uncertainty. You can read more about them in some of the references below.

What to do with missing data?

Tessa Wilkie — Fri, 01 May 2020 09:05:59 +0000

In this post I’m going to describe some simple ways of dealing with missing data and discuss some of their strengths and flaws.

How methods to deal with missing data will depend on the Missingness Mechanism that informs the missingness pattern. I describe Missingness Mechanisms in my previous post, which you can find here. If you haven’t read that, you may want to take a look before continuing with this post.

There are four techniques I’m going to describe: Complete Case Analysis, Unconditional Mean Imputation, Regression Imputation and Stochastic Regression Imputation.

I am going to illustrate these methods with an example. Imagine we are interested in racehorse performance and we conduct a survey of racehorse heights and weights. Unfortunately, some of the weight data has gone missing.

I wrote about Missing Data for a second, longer research report at STOR-i (I mentioned the shorter first report in this post). I created the images below from a simulation study that I carried out for this report (where I followed the racehorse example described above). I will post the link to the full report at the end of this blog.

Complete Case Analysis

One of the simplest ways of dealing with missing data is Complete Case Analysis. This means that we delete any responses that have any missing data in them. This method is okay if we have data that is Missing Completely at Random or if we do not have a lot of missing data. But, if we do not have these — nice — conditions then we can introduce bias into the dataset.

For example, if tall horses do not like being weighed (perhaps the weigh bridge is claustrophobic), then if you delete all cases where weights are missing you will probably be looking at a sample with lower heights and weights than is representative of the population.

So, if you don’t use Complete Case Analysis there are still some fairly simple options open to you. These next three methods involve forms of something known as imputation.

Imputation? What is that?

The next method is the first I introduce where you impute data — this means you replace missing values with estimated values.

This may seem strange at first — isn’t this just making up values? How could this make your analysis better?

Well, imputation can help you avoid the pitfalls that simply deleting missing data could land you in.

However, you will (I hope) see that how you decide to impute the missing data can drastically affect the rigour of your analysis.

Unconditional Mean Imputation

The first method I am going to describe is the simplest: Unconditional Mean Imputation. You take the mean of your observed data and impute any missing values with it. So, if we have missing racehorse weights, we fill in missing weight values with the average of those that we do observe.

This, you can probably imagine, can be problematic, too. If we look at the below plot, where I’ve imputed data that is simulated from the above situation — where taller horses are more reluctant to be weighed — we can see that the imputation will underestimate the true mean and the variation in the data.

Unconditional Mean Imputation: missing weight data is imputed using the mean of the observed weights

Regression Imputation

Regression Imputation is a little better as in the situation we are considering it should help to eliminate some of the bias that Unconditional Mean Imputation brings.

This works by drawing a regression line, based on the complete observations that you have, and using that to predict where a piece of missing data will fall on the regression line.

The method works better in terms of bias, but it still underestimates the variation in the data — as we can see in the figure.

Regression Imputation: the imputed data (represented by pink circles) are on the regression line

To deal with that we have our final method, Stochastic Regression Imputation.

Stochastic Regression Imputation

This methods works in a similar way to Regression Imputation — but it adds a random element so, instead of points being interpolated directly onto the regression line, they are scattered about it in a random fashion.

As we can see in the plot, this shows a much more realistic representation of the variation in the data.

Stochastic Regression Imputation: the imputed circles (in pink) are scattered randomly about the regression line

You may be thinking that introducing values yourself to replace missing ones will cause its own problems.

Read my next post to find out how to account for the uncertainty that you are introducing through imputing missing data.

Want to know more?

You can read more about Complete Case Analysis (and some variations/adaptations of it) in Chapter 3 of the below book. The Single Imputation Methods I’ve described are discussed in Chapter 4.

Little, R. J. A. and Rubin, D. B. (2020). Statistical analysis with missing data. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, third edition.

Another single imputation method that I do not discuss here (but I do in my report) is Hot Deck. You can read about it in this paper:

Andridge, R. R. and Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1):40-64.

I used the R Package to do my analysis. You can read about it in this paper:

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3):1-67.

Thank you for reading! Click here to go to the next post on Missing Data.

Click here to see my first post on Missing Data: on the different types of missingness.

You can see my full report for STOR-i on Missing Data, .

Missing Data: Introducing the Missingness Mechanism

Tessa Wilkie — Wed, 29 Apr 2020 09:07:00 +0000

Often when we collect data, some is missing. What do we do? Well, there is a load of stuff to cover here (and I’m going to do it over a few posts). This post is going to cover an important question: what is causing the data to be missing?

What causes the data to be missing is known as the Missingness Mechanism. There are three main types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

You can think of these as traffic lights: MCAR is green (easy to deal with), MAR is amber (a bit problematic but there are some decent methods out there) and MNAR is red (a total pig to deal with).

Missing Completely at Random

Markus snoozes peacefully, having decided that his data is MCAR

As the name suggests, Missing Completely at Random data means that the missingness presenting in your data is in a totally random pattern. There isn’t anything in the data driving it that you need to be further concerned with. This is nice, because you can get away with some simplistic methods to deal with it.

I describe some of those methods in my next post on missing data.

Missing at Random

Missing at Random data is where what drives the missingness is something in the data we are collecting, but that what drives it is something we have observed. The preceding sentence starts to give me a headache if I think about it too much, so I prefer to think of it in terms of an example.

Imagine a university does a survey of previous students, to find out where they are working, what their income bracket is, etc.

Let’s say that alumni that work in a particular sector are less likely to disclose their income. But, they do disclose what sector it is that they work in. That data would be Missing at Random.

Missing Not at Random

But, what if students are less likely to respond to that income question the more they earn? Then we have Missing Not at Random data. The missingness depends on something we do not observe.

This is very difficult to deal with and often causes bias in our analysis. To make it even more difficult, we cannot test whether the missingness mechanism is Missing at Random or Missing Not at Random.

Want to know more?

You can read more about missingness mechanisms in Chapter 1 of the book below — this is a really good book on missing data in general.

Little, R. J. A. and Rubin, D. B. (2020). Statistical analysis with missing data. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, third edition.

I’ve also included a link to the paper that introduced the idea of considering missingness mechanisms.

Thank you for reading. Click here to see my next post in this series. This will discuss some simple methods to deal with missing data.

Or you can skip to my final post on missing data: this will discuss a method that allows you to quantify the uncertainty that you are introducing into your analysis by using some of the methods discussed in my second post.

I wrote a 20 page report on Missing Data as part of my studies at STOR-i. It discusses the ideas above in more depth. If you want to take a look, .

Why statistics?

Tessa Wilkie — Tue, 11 Feb 2020 10:26:18 +0000

This a personal post on what I think statistics is, why I was drawn to study it, and why it’s basically a super awesome cool subject everyone should know more about.

Many people appear to regard the subject with suspicion, feeling that statistics are more often used to bamboozle than to enlighten.

When, in my 20s, I started studying Applied Statistics in the evenings I got a lot of puzzled looks from friends and colleagues. What on earth did I want to do that for? What was the point?

To me this seemed odd.

As a journalist you are always trying to find out what is going on, and what it might mean. In my History degree I looked to try to find out what went on in the past and what it might mean.

These questions are very similar to those that statisticians ask. Only, statisticians have an additional tool to help them — and it’s a big one: maths.

Mathematical frameworks help us understand what conclusions we can draw from data, and how confident we can be in them. They give us tools to deal intelligently with what we do not — or cannot — know.

And this is great. Because of course, in the real world, we never get all the information we need. We are always having to piece together a picture based on what we can see, and we need steer on what we cannot see and how important that might be.

For example: I might have a disagreement with my dog, Markus, about his biscuits. He might insist that Brand A’s dog biscuits are bigger than Brand B’s, and therefore he should be bought Brand A’s.

As he’s a scientifically minded dog, he would allow me to take a random sample of each to weigh.

What if, based on the sample, Brand A’s biscuits are slightly larger than Brand B’s? Are Brand A’s biscuits bigger, or could this be a fluke?

Well, there are statistical tests to decide whether there is actually enough evidence to accept Markus’s claim.

Markus protesting about biscuits

There are also well defined frameworks to assess the probability that you will wrongly accept the claim by chance (that is, you select a random sample that happens to be of unusually large biscuits from Brand A, when in fact Brand A’s biscuits are not bigger than Brand B’s).

Of course, advanced statistical methods deal with much more difficult and nuanced situations than my dog biscuit example.

Statistics has not advanced to the point where we can guarantee that we are right about everything all the time*. But that’s not a reason to dismiss it.

Statistical methods, if used properly, bring us a vast amount of insight into problems. Why wouldn’t you want to know about that?

*On the subject of things I’ve not always been right about: I was going to put something in this post about Benjamin Disraeli’s famous “Lies, damned lies and statistics” quote, only to discover that we don’t actually know who coined that one.[1]

[1]

Extreme value theory: predicting the ultra rare

Tessa Wilkie — Tue, 28 Jan 2020 12:42:39 +0000

Extreme value theory is a really exciting — and kind of astonishing — area of statistics. This is because it can tell us about the probability of events happening that are so rare there is barely any data recorded on them.

This seems perverse. Very broadly, traditional statistics says that we may not able to make accurate predictions about what may happen on an individual level (for example, how tall one puppy may grow to be in adulthood). But, if we look at a large population (the development of large numbers of puppies) we can get an idea of the range that we expect the majority to be in.

With extreme value theory, we are not interested in the behaviour of the majority. We want to look at the likelihood of a very, very rare event happening. Such as a Dachshund puppy that grows to be bigger than a Doberman.

A Dachsund puppy

��

Why would we want to know that? Well, let’s say you own a VW Beetle and you want to buy a Dachshund puppy. Your family is fiercely attached to the car and will only agree to getting a puppy if it means you will not need to sell the car.

You are pretty sure this will not happen. You promise them this could never happen. But then you start to worry: could the puppy grow to be too big to fit in the car? You’ve never heard of — or seen — a Dachshund that’s too big for a Beetle. But does that mean you can be certain?

The trouble with extreme events — from a statistical point of view — is that they do not happen very often, if at all. We might want to know the probability of a once in 1000 years type event. We do not have a large body of data that can give us steer on what and when these events might occur.

So, are we stuck?

No! Thanks to extreme value theory. ��

Statisticians can focus on the tails of the data — meaning they can examine the events that have a very low probability of occurring. They usually do this in one of two ways in the univariate setting.*

We can look at maxima over a certain period of time. For example, we could group Dachshunds according to the year they were born, then record the tallest in each year group. Surprisingly (to me) these tend to a distribution (the Generalised Extreme Value distribution).

This is Very Good News in statistics. It means we have mathematically backed insight into the way the population of maxima behaves.

What if we had two Dachshunds born in 2015 that grew very big? If we were looking for maxima we would only count the largest one, so we would be cutting out a potentially useful bit of data. A method that gets around this issue is to look at exceedances — data points that come above a certain threshold.

An unusually large puppy, with adult to scale

��

If we decide that any Dachshund taller than, say, 40cm is remarkable then we can look at the distribution of Dachshunds that exceed that level. This would give us data that accords to a Generalised Pareto Distribution.

One of the big academic issues here is choosing that threshold level: set it too high and you don’t get much data. Set it too low and you are out of the tails of the distribution.��

These theories have important applications — beyond prospective dog owners with families that love their car ~~a little too much~~.

Flood defences are one area where governments need to know what a really, really bad flood would look like and how to protect people from it. But, because flood defences are expensive, they also don’t want to build ones that are bigger than necessary.

Finance is another area. How likely is an extreme financial or economic shock? What measures should be in place to ensure that institutions, and the financial system itself, can withstand it? Regulators would want to make sure they are not insisting on such strongly risk-averse measures that it is impossible for companies to make a profit.

*By univariate, I mean we are looking at just one variable. For example: height of dog, or observed temperatures or daily rainfall. We are not looking at several variables together (the multivariate setting).

Want to know more? There is a whole journal dedicated to Extreme Value Theory.

Statistics – Tessa Wilkie

Dealing with Imputation Uncertainty

How do we do Multiple Imputation?

Further reading

What to do with missing data?

Complete Case Analysis

Imputation? What is that?

Unconditional Mean Imputation

Regression Imputation

Stochastic Regression Imputation

Want to know more?

Missing Data: Introducing the Missingness Mechanism

Missing Completely at Random

Missing at Random

Missing Not at Random

Want to know more?

Why statistics?

Extreme value theory: predicting the ultra rare