Statistics – Maddie Smith /stor-i-student-sites/maddie-smith STOR-i Student at APP Fri, 19 Jan 2024 09:57:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 /stor-i-student-sites/maddie-smith/wp-content/uploads/sites/27/2020/11/cropped-S-3-32x32.png Statistics – Maddie Smith /stor-i-student-sites/maddie-smith 32 32 The Social Network – A Super Quick Introduction to Network Modelling /stor-i-student-sites/maddie-smith/2021/04/13/the-social-network-a-short-introduction-to-network-modelling/ Tue, 13 Apr 2021 12:30:00 +0000 http://www.lancaster.ac.uk/stor-i-student-sites/maddie-smith/?p=195 I’m sure all of you have heard about networks in one way or another; perhaps you cast your mind instantly to the idea of Facebook friends upon reading this post, and not only thanks to my rip-off of a certain film title. But what are networks actually used for, and what inference can be made from them?

“We assume the data to be independent and identically distributed” – if I received £1 for every time I heard that phrase during my first week in STOR-i, I would have very little need for a stipend. This is because it is common in statistics to be working with independent and identically distributed data – and this means that making inference from the data is nice and easy. 

Network data poses more challenges than traditional independent and identically distributed data. One reason for this is that there is a dependent nature to the data. This makes sense; consider a Facebook page called APP Ducks, which posts the best pictures around of our campus ducks. 

You choose to like this page, as you like receiving all the best duck updates. Then, it is more likely that one of your university friends also likes this page, compared to a random person who is not a member of your network. 

Let’s consider another example of a network – a recommendation system. If, like me, you have been binging all the latest Netflix titles over lockdown, then you have probably come into contact with this form of network. 

When you finish watching a series in Netflix, you may have noticed that other TV shows are recommended to you. This similar to when you shop online; perhaps you are familiar with the ‘similar shoppers also bought…’ suggestions. This idea can be modelled as a network…

Diagram showing users as blue circles on the right hand side, and movies as orange circles on the left hand side. Certain users are connected to certain movies by lines.

This figure demonstrates a basic recommendation system network. The coloured circles are called nodes, or vertices. In this case, the blue nodes on the left represent users, and the orange nodes on the right represent movies. Nodes can be given any name in a network; here I could have given names of STOR-i lecturers or recent Netflix releases. 

The lines linking particular users to particular movies are called edges. It is possible for edges to be directed (usually shown by having an arrow pointing along the edge), or weighted. Weighted edges have a number (or weight) associated with them. In our recommendation system case, a weighted edge could perhaps indicate the number of times a user has watched a particular movie. 

Using our network, we can see that User 1 watched movies 1, 2 and 4, meanwhile User 2 watched movies 2 and 3, and so on. Now, imagine that a fourth user joins our network. User 4, our new user, is the same age as User 2, and both users live in the UK. Therefore, a recommendation system might want to suggest that User 4 also watched movies 2 and 3.

The degree of a node is the number of edges connecting to that node. Looking back at our network, we can see that the degree of the User 1 node is given by 3, and there are three edges connected to this node. The degree of the User 2 node is given by 2, and so on.

Using the degree of the nodes in a network, it is then possible to calculate a degree distribution. The degree distribution for a network denotes the proportion of nodes with a specific degree, and can be used to compare network models to real networks. 

There you have it – a super quick introduction to networks! Can you think of any other situations which could be modelled as networks? What about directed networks? Let me know in the comments!

If you are interested in reading more about networks, and finding out some network models that exist, then make sure you check out the further reading for this blog post!

Further reading

about network models for recommender systems is really interesting, and great for beginners!

Another great post about network models and recommender systems can be found . This is one of a series, and is written from a data science perspective.

The is a well known network model, first introduced in 1959 by mathematicians Paul Erdős and Alfréd Rényi. Check out which introduces the model and also provides some code for generating a graph using this model.

Another well known network model is the . explain how this model works in mathematical detail, as well as a few other network models.

]]>
Ch Ch Ch Ch Changepoints /stor-i-student-sites/maddie-smith/2021/03/02/ch-ch-ch-ch-changepoints/ Tue, 02 Mar 2021 12:30:00 +0000 http://www.lancaster.ac.uk/stor-i-student-sites/maddie-smith/?p=272 No, I didn’t just forget the words to David Bowie’s Changes, in today’s post we’re going to be talking about changepoints! In this brief introduction to changepoint analysis we’ll be covering what is actually is, how is it useful and when can we apply it. At the end of this post, I’ll also be sharing some code resources, which you can use to carry out your own changepoint analysis!

Changepoint analysis is a really well-established area of Statistics. It dates back as early as the 1950s, and since then has been the focus for LOTS of interesting and important research.

Changepoint detection looks at time series data. A time series is a series of data points which are indexed in time order. Usually, a time series is a sequence of discrete measurements, taken at equally spaced points in time. This could be the number of viewers for a particular TV show taken at one minute intervals over the course of an hour, or maybe the heights of ocean tides taken every hour throughout the day.

As the name suggests, the aim of changepoint detection is to identify the points in time at which the probability distribution of a time series changes. We can think of this as follows:

Let’s say we have some time series data given by y1, y2, …, yn, where yi is the measurement taken at time i. Then, if a changepoint exists at time τ, this means that the measurements y1, y2, …, yτ differ from the measurements yτ+1, …, yτ in some way.

If we are performing a changepoint analysis, there are some key questions that we’d like to consider:

  • Has a change occurred?
  • If yes, where is the change?
  • What is the probability that a change has occurred?
  • How certain are we of the location of the changepoint?
  • What is the statistical nature of this change?

Online v Offline Detection

Changepoint detection can either be online or offline. Imagine that we have access to some data, which describes the temperature taken at APP at 12pm everyday over the course of a month. We then want to look for changepoints in this data, to see whether there were any freak increases or dips in the mean temperature, or maybe periods with very high variance. This type of analysis would require offline changepoint detection methods, because we have access to the complete time series data. That is, we are looking at the data after all the data has been collected.

On the other hand, imagine that The Great British Bake Off is on TV right now. The number of viewers tuned in for the programme is being streamed to us live every second, and we want to look for changepoints in the number of viewers now, as the programme is being aired. This type of analysis would require us to use online changepoint dection methods, which run concurrently with the process that they are monitoring.

Let’s recap that. In offline changepoint detection …

  • Live streaming data is not used.
  • The complete time series is required for statistical analysis.
  • All data is received and processed at the same time.
  • We are interested in detecting all changes in the data, and not just the most recent.
  • We usually end up with more accurate results, as the entire time series has been analysed.

Whereas in online changepoint detection …

  • The algorithm runs concurrently with the process that it is monitoring.
  • Each data point is processed as it becomes available.
  • Speed is of the essence! The goal is to detect a changepoint as soon as possible after it occurs, ideally before the arrival of the next data point!

Examples

Let’s consider a fitness tracker that can tell when you are walking, running, climbing stairs … you get the idea. Maybe your mobile phone does this. One way in which devices can tell what activity you were performing at a particular point during the day is by using offline changepoint detection!

Online changepoint detection is often used in areas like quality control, or for monitoring systems. For example, a broadband provider might receive live data that details the performance of their broadband network at some site. Detection of a changepoint in this scenario might indicate that there is an issue with the network! This brings us to another required feature for a good online changepoint detection method: alongside the need for speed, it is also important that we have a method that is robust to noise, false positives and outliers. This makes sense, as the broadband provider doesn’t want to send out an engineer if there isn’t actually anything wrong with the network!

Now that we have covered what changepoint detection is, and the differences between offline and online detection methods, can you think of any other scenarios where we would want to use offline changepoint detection methods? What about online detection methods?

Further Reading

Sadly there is only so much I can write in one blog, so I have included plenty of further reading resources for you if you enjoyed today’s post!

  • Offline changepoint detection and implementation: provides a great place to start if you want to know more about the types of changepoint detection methods available, and if you want to have a go at using applying some of the methods to data in R.
  • Online changepoint detection and implementation: describes one possible method of online changepoint detection. I feel it gives a great intuitive understanding, and also explains how to code up the method if this is something that you would like to try!
  • PELT method: provides a more mathematical explanation of how one of the most popular offline changepoint detection methods works. I’d recommend reading this if you are looking for a deeper understanding into how the method works.
  • Binary Segmentation method: provides an introduction to another popular offline changepoint detection method, and also gives some code that you can use to implement the algorithm yourself. I find that this gives a better understanding than simply using some possible R packages.

]]>
Interior Design and Hypothesis Testing /stor-i-student-sites/maddie-smith/2021/02/16/interior-design-and-hypothesis-testing/ Tue, 16 Feb 2021 12:30:00 +0000 http://www.lancaster.ac.uk/stor-i-student-sites/maddie-smith/?p=246 Just the other week, my fiancé and I were told we were able to qualify for a mortgage. As you can expect, I’ve spent the following days excitedly searching for properties online and dreaming up interior design schemes. I’m sure most people would agree that the thought of decorating your first home is a thrilling but mildly terrifying task, as up until now all the bad interior design decisions in your home could be blamed on your parents’ poor taste.

While perusing the internet for living room paint colours, I came across the statement of ‘blue is a calming colour’.

This got me thinking, who comes up with this information? Is this just a clever marketing technique designed to encourage me to paint my entire house blue (because let’s face it, who doesn’t need a bit of calming in the midst of a global pandemic)? Or is there actually some truth to this statement? A way of testing whether this statement is likely to be true would be to use hypothesis testing.

Hypothesis testing is a statistical method that is used to determine how likely or unlikely a hypothesis is for a given sample of data.

In this post, I give a very simple introduction to hypothesis testing for those of you who may not have come across it before. I try to keep things simple, so if you want a bit more information (particularly on test statistics), I’ve left some great further reading resources at the bottom!

Let’s say that we have access to some data that was gathered to determine whether or not people find the colour blue calming.

The data we have corresponds to the following experiment: 100 people were asked to fill in a survey about how they were feeling. 50 of these people carried out the survey in a blue room, and the other 50 carried out the survey in a white room. The possible survey responses were given by calm and normal.

Let’s assume that people in the blue room have some probability p1 of choosing the calm answer, while the probability of people in the white room choosing this answer is given by some probability p2.

We can now begin our hypothesis test!

In hypothesis testing, the null hypothesis H0 describes the case that the sample observations result purely from chance. In our case, it would mean that we’d expect to see the same proportion of people feel calm in the blue room as in the white room. Looking at our probabilities, we could say the null hypothesis is given by: H0 : p1 = p2.

On the other hand, the alternative hypothesis HA describes the case that the sample observations are influenced by some non-random cause. In our example, this corresponds to the people in the blue room having a different probability of feeling calm than those in the white room: HA : p1 ≠ p2.

The general idea with hypothesis testing is that we look to see if our data provide evidence to reject H0. This is done by calculating something called a test statistic, and then looking at the probability of observing this test statistic in the case that our null hypothesis is true.

In order to see whether or not the value indicated by the null hypothesis is supported by the data, we need to set a significance level α for our hypothesis test. This is the probability that we incorrectly decide to reject the null hypothesis in the case that it is actually true! Of course, we want this to be small, so it’s usually set at 5%.

Some more definitions…

A test statistic T is a function of the data whose value we use to test a null hypothesis. It shows us how closely the data observed in our sample match the distribution that we’d expect to see if the null hypothesis were true.

The p-value of a test is the probability of observing a test statistic at least as extreme as observed, if the null hypothesis is true. This means that small p-values offer evidence against H0, because it is saying that if the null hypothesis is true, then it is very unlikely that we would’ve seen this result. Make sense?

Don’t worry if it doesn’t! If you’re new to hypothesis testing, it can be quite difficult to wrap your head around.

Let’s pause for a moment and think about what we would do in order to test our question of “Is blue a calming colour?”.

  1. Define our null hypothesis – “The colour blue has no effect on how calm a person feels. Or, in other words, the probability of a person choosing calm is the same, whether they are in the blue room or the white room.”
  2. Set our significance level – This is the probability of rejecting our null hypothesis when it is actually true. We obviously want this to be small, so α=0.05 is a good choice.
  3. Construct a test statistic – It’s up to you to choose what you would like to use as a test statistic. Basically, it is a function of the data that we can calculate to give a number. This could be the
  4. Calculate the p-value – This is the probability that we would’ve obtained our test statistic value if the null hypothesis is true.
  5. If the p-value is less than our significance level α, reject our null hypothesis – We can now say that “Blue is a calming colour!”
  6. If the p-value is greater than our significance level α, do not reject our null hypothesis – “We still don’t know if blue is a calming colour.”

Note that Step 6 says “do not reject our null hypothesis” and not “accept the null hypothesis”. This is important: failing to reject the null hypothesis just means that we did not provide sufficient evidence to conclude that blue is a calming colour; in other words, it still might be! But we don’t have enough evidence to say this.

So there you have it, a brief introduction to hypothesis testing! I hope you enjoyed this post and found it useful. If you want to know more about hypothesis testing, be sure to check out the further reading on this post!

Further reading …

This great blog post was written by one of my fellow STOR-i students, and it explains hypothesis testing in a bit more detail, for those of you looking to carry out your own hypothesis tests.

gives some great examples of simple hypothesis tests, to help get you started.

]]>