Statistics – ROBYN GOLDSMITH

A Pandemic Paradox: Being Cautious with COVID-19 Statistics

Robyn Goldsmith — Fri, 02 Apr 2021 10:45:00 +0000

TRIGGER WARNING! This blog posts refers to COVID-19 and fatality rate data – if that’s something you’d rather avoid then, not to worry, you can head back here where I have lots of other blog posts for you to read!

We all know that Statistics is a powerful tool that has great significance in the real world. More recently, with coronavirus daily briefings and COVID-19 Statistics dominating the news, everyone one and their mother is a self-proclaimed statistician. Because Statistics inform so many big decisions, when it comes to analysis, we need to tread carefully. After all, things might not always be what they seem at first glance! In this short post, I’m going to introduce the Simpson’s Paradox, a phenomenon in Probability and Statistics.

The Simpson’s Paradox occurs when the same set of data can appear to show different analysis depending on how the data is grouped. This happens because there is what’s known as a lurking variable hidden in the aggregated data. Take a look at the graphs of simulated data below, the graph on the left-hand side separately considers two groups. In this graph, we observe a positive correlation in both groups. In the graph on the right-hand side, all of the data is grouped together and we observe the opposite, a negative correlation.

In real-life examples, this can have detrimental effects. I’ll show you what I mean! Take the next graph. Here, we’re comparing fatality rates for COVID-19 in Italy and China based on different age groups. Notice that for every age range, the fatality rate is higher in China than it is in Italy, suggesting that people in Italy who have COVID-19 are more likely to survive than people in China with COVID-19.

Great. Our analysis is done, no? We can now get carried away reporting these findings and firing all sorts of criticisms at the Chinese health service as we take this analysis as gospel. That’s until we notice the last two bars on this graph. The aggregation of all this data suggests the opposite, that people in China have it better with a lower fatality rate than in Italy. But how on earth can this make sense? Let’s look at another graph:

What is key about this graph is that it shows that there is a significantly larger population of older people with COVID-19 in Italy than in China. This is important because we know that age places a crucial role in the survivability of COVID-19, with younger people more likely to recover.

So, how does this graph help explain our conflicting analyses? Well, there is a statistical link here between each country and the proportion of confirmed cases per age group. In this case, this is our lurking variable, the fatality rate percentages in the first graph hide the numbers of cases within each age group. As Italy has more older people with COVID-19, overall their fatality rate is higher because these people are more at risk of dying than younger people. So, even though for each age group Italy has less deaths than China, the fact that Italy has a much larger number of older people with the virus plays in significantly to the overall figures. Put simply, the larger proportion of confirmed cases among older people in Italy shown along with the fact that elderly people are generally at higher risk of death from COVID-19, explain the discrepancy between the aggregated and categorised data. This is a perfect illustration of Simpson’s Paradox.

^{_{Me and my Mum. She has worked for the NHS for over twenty-five years. This week she is receiving her second dose of the COVID-19 vaccine!}}

This is really important and a clear indication of how we can get the totally wrong idea from the statistics when we’re not paying enough attention. Regarding the pandemic, we’re commonly given country-wide statistics. If we were to group by region or county we may draw vastly different conclusions. Nationally, we could observe a decline in COVID-19 despite a rising number of cases in some areas (which could potentially signal the start of a third wave). This is likely to happen if there are groups with large disparities, such as areas with vastly different populations. A spike in cases in a less dense region is likely to be dwarfed by falling cases in a largely populated area, like London for example, in national data.

There are so many other examples or instances where the Simpson’s Paradox can play a role in hiding information. We have to be careful in the ways we divide data and be mindful of potential lurking variables. You’ve been warned!

That’s it for now! This week’s tweet of the week goes to !

A friend has just shown me this book "Calculus made easy", published in 1914, and I think it's got one of the best prologues I've ever seen. This is *exactly* what textbooks should be doing. And they should all be honest about how terrifying the topic names are too.
— Helen Czerski (@helenczerski)

Missed my last post? Check it out here.

Want to know more?

This post is based on the observations of the Simpson’s Paradox found in this paper:

Clustering!

Robyn Goldsmith — Wed, 03 Mar 2021 15:05:00 +0000

I don’t know about any of you but to me clustering feels like a foreign concept, at least in the physical sense. Just over a year into a global pandemic, the image of a group of people being in close proximity feels like an alternative reality you’d only see in a sci-fi movie. The only kind of clustering I can be apart of these days is the statistical kind. Not quite as exciting as a mosh pit but arguably more productive. In the statistical sense, clustering is essentially grouping data together based on similar features. These similarities can be anything, depending on the data you are dealing with, from book genres to people’s heights to the types of crime committed in a particular area and just about everything in-between. This statistical analytic technique, found commonly in Machine Learning, is comprised by a number of algorithms. In this post, we’ll take a look at two of these methods.

First, let’s dive into an unsupervised learning algorithm called K-Means clustering. The algorithm for K-Means clustering is relatively simple and goes like this:

Choose a number of clusters.
Place the same number of central points in different locations.
For each data point, evaluate the distance to each centre point. Classify the point to be in the cluster whose centre is closest.
Re-compute the cluster centre by taking a mean of all the vectors classfied to that cluster.
For each data point, re-evaluate the distance to each centre point. And re-classify the point to be in the cluster whose centre is closest.
Repeat steps 4 and 5 until the change in cluster centre is sufficiently small.

K-means clustering has the advantage that it works fast but there is a couple of drawbacks. The randomness of the cluster centres also means that the K-means clustering can fall short in terms of consistency. If we were to repeat the process we may not see the same results. Also, the algorithm requires the specification of the number of clusters before we can begin, which in some cases can be hard to figure out.

_{^{Me, in a cluster of foul smelling fans on the last day of Reading Festival 2017.}}

So, what else is out there? Well, there is also agglomerative hierarchical clustering. It may be a mouthful but it doesn’t require a specification of the number of clusters before the algorithm can start – woohoo! Agglomerative hierarchical clustering is what’s called a bottom-up algorithm which means it treats every single data point as a cluster to begin with. Then, as we move through the process, pairs of clusters merge. The hierarchy of clusters can be shown as a tree where the root is the cluster that contains all the data points.

The algorithm follows these steps:

Treating all data points as a single cluster, define a metric for the similarity between two clusters.
Merge the two clusters with the largest similarity.
Repeat step 3 until there is only one cluster containing all data points or stop when the number of clusters is desirable.

There are a number of difference metrics available of calculating the similarity between clusters and thus which clusters to merge. The centroid method merges the clusters with the most similar average point of all points in a cluster. Alternatively, Ward’s method merges clusters that minimize variance measured by the sum of squares index.

Agglomerative hierarchical clustering gives the freedom for a user to choose a point to stop when the number of clusters look satisfactory. Additionally, various similarity metrics tend to work well for this algorithm. However, unlike K-mean clustering, which has linear complexity, this method has complexity ��(��³). So it isn’t as efficient.

Clustering methods are important in data mining and machine learning communities and form a powerful statistical technique. I hope this blog post brings together (I swear I don’t find the puns, the puns find me) some of the key info about clustering for you. Look down below for where to go to learn more and check back soon for more blog posts.

Now it’s time for my tweet of the week! This week’s goes to !

It’s really important to stay on top of the literature in your field. Here’s my approach:

1. See an interesting paper, open in a web browser.
2. Start to read, but get distracted, leave for later.
3. Find a new study, open that one.
4. Have too many tabs open, close all.
— Matthew Fox (@ProfMattFox)

Missed my last post? Check it out here.

Want to know more?

Gan, G., Ma, C. and Wu, J., 2020. Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics.