{"id":235,"date":"2020-05-01T09:05:59","date_gmt":"2020-05-01T09:05:59","guid":{"rendered":"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/?p=235"},"modified":"2020-05-01T13:56:00","modified_gmt":"2020-05-01T13:56:00","slug":"missing-data-part-ii-what-to-do-with-missing-data","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/05\/01\/missing-data-part-ii-what-to-do-with-missing-data\/","title":{"rendered":"What to do with missing data?"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In this post I\u2019m going to describe some simple ways of dealing with missing data and discuss some of their strengths and flaws.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How methods to deal with missing data will depend on the Missingness Mechanism that informs the missingness pattern. I describe Missingness Mechanisms in my previous post, which you can find <a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/04\/29\/missing-data-part-1-introducing-the-missingness-mechanism\/\">here<\/a>. If you haven\u2019t read that, you may want to take a look before continuing with this post.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are four techniques I\u2019m going to describe: Complete Case Analysis, Unconditional Mean Imputation, Regression Imputation and Stochastic Regression Imputation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I am going to illustrate these methods with an example. Imagine we are interested in racehorse performance and we conduct a survey of racehorse heights and weights. Unfortunately, some of the weight data has gone missing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>I wrote about Missing Data for a second, longer research report at STOR-i (I mentioned the shorter first report in <a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/03\/23\/censored-demand\/\">this post<\/a>). I created the images below from a simulation study that I carried out for this report (where I followed the racehorse example described above). I will post the link to the full report at the end of this blog. <\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Complete Case Analysis<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">One of the simplest ways of dealing with missing data is Complete Case Analysis. This means that we delete any responses that have any missing data in them. This method is okay if we have data that is Missing Completely at Random or if we do not have a lot of missing data. But, if we do not have these \u2014 nice \u2014 conditions then we can introduce bias into the dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, if tall horses do not like being weighed (perhaps the weigh bridge is claustrophobic), then if you delete all cases where weights are missing you will probably be looking at a sample with lower heights and weights than is representative of the population.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, if you don\u2019t use Complete Case Analysis there are still some fairly simple options open to you. These next three methods involve forms of something known as imputation.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Imputation? What is that? <\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The next method is the first I introduce where you impute data \u2014 this means you replace missing values with estimated values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This may seem strange at first \u2014 isn\u2019t this just making up values? How could this make your analysis better?<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Well, imputation can help you avoid the pitfalls that simply deleting missing data could land you in.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p class=\"wp-block-paragraph\">However, you will (I hope) see that how you decide to impute the missing data can drastically affect the rigour of your analysis.<\/p>\n<\/div><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Unconditional Mean Imputation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The first method I am going to describe is the simplest: Unconditional Mean Imputation. &nbsp;You take the mean of your observed data and impute any missing values with it. So, if we have missing racehorse weights, we fill in missing weight values with the average of those that we do observe.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This, you can probably imagine, can be problematic, too. If we look at the below plot, where I\u2019ve imputed data that is simulated from the above situation \u2014 where taller horses are more reluctant to be weighed \u2014 we can see that the imputation will underestimate the true mean and the variation in the data.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" src=\"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Umean1.jpeg\" alt=\"\" class=\"wp-image-239\" width=\"432\" height=\"400\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Umean1.jpeg 864w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Umean1-300x277.jpeg 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Umean1-768x710.jpeg 768w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><figcaption>Unconditional Mean Imputation: missing weight data is imputed using the mean of the observed weights<\/figcaption><\/figure><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Regression Imputation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Regression Imputation is a little better as in the situation we are considering it should help to eliminate some of the bias that Unconditional Mean Imputation brings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This works by drawing a regression line, based on the complete observations that you have, and using that to predict where a piece of missing data will fall on the regression line.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The method works better in terms of bias, but it still underestimates the variation in the data \u2014 as we can see in the figure.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" src=\"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Reg1.jpeg\" alt=\"\" class=\"wp-image-237\" width=\"432\" height=\"400\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Reg1.jpeg 864w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Reg1-300x277.jpeg 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Reg1-768x710.jpeg 768w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><figcaption>Regression Imputation: the imputed data (represented by pink circles) are on the regression line<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To deal with that we have our final method, Stochastic Regression Imputation.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Stochastic Regression Imputation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This methods works in a similar way to Regression Imputation \u2014 but it adds a random element so, instead of points being interpolated directly onto the regression line, they are scattered about it in a random fashion.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As we can see in the plot, this shows a much more realistic representation of the variation in the data.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" src=\"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Stoch1.jpeg\" alt=\"\" class=\"wp-image-238\" width=\"432\" height=\"400\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Stoch1.jpeg 864w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Stoch1-300x277.jpeg 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/Stoch1-768x710.jpeg 768w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><figcaption>Stochastic Regression Imputation: the imputed circles (in pink) are scattered randomly about the regression line<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">You may be thinking that introducing values yourself to replace missing ones will cause its own problems. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Read <a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/05\/01\/dealing-with-imputation-uncertainty\/\">my next post<\/a> to find out how to account for the uncertainty that you are introducing through imputing missing data. <\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h4 class=\"wp-block-heading\">Want to know more?<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">You can read more about Complete Case Analysis (and some variations\/adaptations of it) in Chapter 3 of the below book. The Single Imputation Methods I&#8217;ve described are discussed in Chapter 4.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><span class=\"has-inline-color has-very-dark-gray-color\">Little, R. J. A. and Rubin, D. B. (2020). <em>Statistical analysis with missing data<\/em>. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, third edition.<\/span><\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<p class=\"wp-block-paragraph\">Another single imputation method that I do not discuss here (but I do in my report) is Hot Deck. You can read about it in this paper: <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Andridge, R. R. and Little, R. J. (2010). A review of hot deck imputation for survey non-response. <em>International Statistical Review<\/em>, 78(1):40-64.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<p class=\"wp-block-paragraph\">I used the R Package <strong><a href=\"https:\/\/www.rdocumentation.org\/packages\/mice\/versions\/3.8.0\/topics\/mice\">mice <\/a><\/strong>to do my analysis. You can read about it in this paper: <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. <em>Journal of Statistical Software<\/em>, 45(3):1-67.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<p class=\"wp-block-paragraph\">Thank you for reading! <a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/05\/01\/dealing-with-imputation-uncertainty\/\">Click here<\/a> to go to the next post on Missing Data. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/2020\/04\/29\/missing-data-part-1-introducing-the-missingness-mechanism\/\">Click here<\/a> to see my first post on Missing Data: on the different types of missingness. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can see my full report for STOR-i on Missing Data, <a href=\"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-content\/uploads\/sites\/14\/2020\/05\/RT2__Missing_Data_TW_1.5_spacing.pdf\">here<\/a>. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post I\u2019m going to describe some simple ways of dealing with missing data and discuss some of their strengths and flaws.<\/p>\n","protected":false},"author":8,"featured_media":248,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-235","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-statistics"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/posts\/235","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/comments?post=235"}],"version-history":[{"count":26,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/posts\/235\/revisions"}],"predecessor-version":[{"id":297,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/posts\/235\/revisions\/297"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/media\/248"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/media?parent=235"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/categories?post=235"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/tessa-wilkie\/wp-json\/wp\/v2\/tags?post=235"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}