Danielle Notice

Solving Sudoku with Metaheuristics: GVNS

danielle-notice — Mon, 25 Apr 2022 09:00:00 +0000

Reading Time: 5 minutes

The last time I travelled, I saw a little old lady in the airport with her crossword puzzle book. My grandmother travels with her word search book. Me? In my old age, I will travel with a Sudoku book. Before I hit old age, I will take advantage of the opportunity to combine my studies with my favourite game. As it turns out, heuristic algorithms are pretty popular for solving, creating and rating Sudoku puzzles. In this post, we will look at how a metaheuristic, general variable neighbourhood search, has been used to solve Sudoku puzzle instances.

What is Sudoku?
What are metaheuristics?
General Variable Neighbourhood Search
Solving Sudoku

1. What is Sudoku?

Sudoku is Japanese puzzle which consists of an n² × n² grid divided into n² sub-grids each of size n × n. The word Sudoku is a combination of two Japanese words, Su (Number) and Duko (Single) and loosely translates to “solitary number”. n is the order of the puzzle, with n=3 being the most popular.

The objective of Sudoku is to fill each cell in a way that every row, column and sub-square contains each integer between 1 and n² inclusive
exactly once.

Sudoku is an example of a combinatorial optimisation (CO) problem, which is a class of problems whose solution is in a finite or countably infinite set. Constructing or completing a Sudoku puzzle from a partially filled grid are both NP-complete problems. This means that there is no deterministic algorithm which can solve all possible Sudoku problem instances in polynomial time. The solution space for an empty 9 × 9 Sudoku grid contains approximately 6.7 × 10²¹ possible combinations. However, the pre-filled cells serve as constraints and reduce the number of possible combinations.

2. What are metaheuristics?

When it comes to solving optimisation problems, there are 2 main types of approaches: exact methods and approximate methods. Exact methods are guaranteed to find an
optimal solution for every finite problem instance of an CO problem.

Approximate methods such as heuristic algorithms can be used when there is no known exact method that can be used to solve the problem or when the known exact methods are too computationally expensive to be used practically. In the context of optimisation problems, a heuristic is a well-defined intelligent procedure – based on intuition, problem context and structure – designed to find an approximate solution to the problem. Unlike exact methods, the solutions found may not be optimal, but are some type of “acceptable”. The effectiveness of a heuristic depends on the quality of the approximations that it produces.

The performance of heuristics can be improved using metaheuristics, which are high-level,
problem-independent strategies used to develop heuristic optimisation algorithms. They are designed to approximately solve a wide range of problems without needing to fundamentally change.

3. General Variable Neighbourhood Search

Variable neighbourhood search (VNS) algorithms, which were originally proposed by , are single-solution based metaheuristics. They successively explore a set of predefined neighbourhoods which are typically increasingly distant from the current candidate solution.

Illustration of the main idea of a basic VNS algorithm

VNS’ main cycle is composed of three phases: shaking, local search and move.

In the shake phase, a random solution ��′ is selected from the kth neighbourhood of the current solution s. This ��′ is then used as the initial solution for the local search algorithm being used which produces a new candidate solution ��′′. In the move phase, if ��′′ is better than the current solution s, then it replaces s and the cycle is restarted with this new solution; otherwise, the cycle is restarted with the same solution but a different neighbourhood.

Variable Neighbourhood Descent (VND) is a deterministic variant of the VNS algorithm. The VNS main cycle uses a best improvement method, choosing ��′′ as the local optimum in neighbourhood N_k.

The General Variable Neighbourhood Search (GVNS) uses the VND as the local search procedure (line 7 of the algorithm).

4. Solving Sudoku with GVNS

We will now look at the different elements from the algorithm above and how used this metaheuristic to solve 9 × 9 Sudoku puzzles.

Solution representation: each sub-grid is numbered from 1 to 9, and each cell in a sub-grid is numbered from 1 to 9. So x_ij denotes the jth cell in sub-grid i (see grid above for example of labelled cell).

Solution Initialisation: To initialise the solution, for each cell, a random number is selected
from a list of numbers that include all the numbers that could be assigned to the cell without violating any of the constraints with respect to the fixed cells. This is done in such a way that the sub-grid rule is satisfied. To reduce the solution space, they fixed the
cells that only have one possible value and repeated that until there were no more such cells.

Cost function f(s): evaluates the violation of the row and column constraints and counts how many values are repeated in each row and in each column (illustrated in figure below). The goal is to minimise the cost function. The optimal solution will have f(s)=0.

A candidate solution of the Sudoku puzzle with its fitness value. Repeated digits highlighted in first row and column.

Neighbourhood structures

Only one neighbourhood structure, Invert, is defined for the shake phase. In this structure two cells in the sub-grid are selected and the order of the sub-sequence of cells between them is reversed.

There are 3 neighbourhood structures defined which are used in the VND local search:

Insert – the value of a chosen cell in a sub-grid is inserted in front of another chosen cell.
Swap – the values of 2 unfixed cells in the same sub-grid are exchanged.
A Centered Point Oriented Exchange – a cell between the second and sixth cell in a sub-grid is selected as the center point to find exchange pairs. Values for pair of cells, each equidistant from the center, are swapped until at least one cell in the pair is fixed.

Each of these structures apply to a single sub-grid, and in the local search, the neighbourhoods of each of the sub-grids are explored. Within the VND local search, a deep local search algorithm is used. This uses the best improvement strategy which exploits the whole current neighbourhood structure search area to find the best neighbourhood solution.

Learn More

Metric Learning For Simulation Analytics

danielle-notice — Mon, 11 Apr 2022 09:00:00 +0000

Reading Time: 5 minutes

Usual output analysis of simulations, which is done at an aggregate level, gives limited insight on how a system and its performance change throughout the simulation. To gain greater insight regarding this, you can think of a simulation as a generator of dynamic sample paths. When we consider that we are in the age of “big data”, it’s now pretty reasonable to keep the full sample path data and to explore how to use it for deeper analysis. This can be done in a way that supports real-time predictions and reveals the factors that drive the dynamic performance.

In this post, we’ll look at the emerging field of simulation analytics.

What is simulation analytics?
Metric learning for simulation
A simple example
Some final thoughts

1. What is Simulation Analytics?

The idea of simulation analytics was first described by . It is not just “saving all the simulation data” and then applying modern data-analysis tools. It explores the differences between real and simulated data. Nelson outlines that the objectives of simulation analytics are to generate the following:

dynamic conditional statements: relationships of inputs and system state to outputs; and outputs to other (possibly time-lagged) outputs.
inverse conditional statements: relationships of outputs to inputs or the system state
dynamic distributional statements: full characterization of the observed output behaviour
statements on multiple time scales: both high-level aggregation and individual event times
comparative statements: how and why alternative system designs differ

2. Metric Learning for Simulation

The remainder of this post is a discussion of the by one of my STOR-i colleagues, Graham Laidler and his supervisors.

We can use sample path data available to build a predictive model for dynamic system response. In particular they use k-nearest-neighbour classification of the system state with metric learning to define the measure of distance [1] . In kNN classification, a simple rule is used to classify instances according to the labels of their k nearest neighbours.

From this definition, the paper uses

binary labels $y_i \in \{0,1\}$
instance $x_i$ is the system state at time $t_i$ . More specifically, this refers to some subset of information generated by the simulation up to time $t_i$ .

The classification for an instance $x^*$ is

$\hat{y}^* = \begin{cases} 1, & \text{if} \sum_{i=1}^k y^{*(i)} \geq c \\ 0, & \text{otherwise}, \end{cases}$

where $c \in [0, \infty)$ is some threshold and $y^{*(i)} \text{ for } i = 1\cdots k$ are the observed classification labels that correspond to the k instances nearest to $x^*$ . In words, if c or more of the k nearest neighbours to $x^*$ are observed to be 1, then $y^*$ is classified as 1 by the model.

The discussion then turned to the idea of quantifying the similarity of instances since nearest neighbour classifiers assume that instances that are similar in terms of $x$ are also similar in terms of $y$ . The authors attempt to fully characterise the system by including multiple predictors in their kNN model. Because of the multi-dimensionality of $x_i$ , all variables may not be comparable with respect to scale or interpretation, so using the Euclidean distance is not appropriate.

So we now look at metric learning, which automates the process of defining a suitable distance metric.

The aim of metric learning is to adapt a distance function over the space of $x$ . The paper uses Mahalanobis metric learning which has a distance function parametrized by $M$ , a symmetric positive semi-definite matrix. The metric learning problem is an optimization which minimizes, with respect to $M$ , the sum of a loss function to penalize violations of the training constraints under the distance metric and a function which regularizes the values of $M$ . The metric learning task is subject to similarity constraints, dissimilarity constraints and relative similarity constraints which are set based on prior knowledge about the instances or using the class labels.

3. A simple example

To evaluate the model, the authors create a formulation of the problem. In this formulation, the similarity and dissimilarity constraints are partly based on LMNN[2]. Because of the high-dimensional input, a global clustering of each class may not be appropriate, so a local neighbourhood approach was used when defining these constraint sets. The local neighbourhood of an instance $x_i$ was defined as the q nearest points in Euclidean distance. Points in that local neighbourhood are classified as similar if they had the same $y$ value and dissimilar if they did not. The aim was to minimise the sum of squared distances of instances classified as similar while keeping the average distance of dissimilar instances greater than 1. They set the local neighbourhood size q = 20 and k = 50 nearest neighbours.

One of the illustrations they applied it to was a simple stochastic activity network. The input space was the 5 activity times and the output was whether the longest path length is greater than 5. The activity times were i.i.d $X_i \sim Exp(1)$ . 10000 replications of the network were run. Because the data generating mechanism is exactly known, this example was useful for evaluating the model since the authors understood what the output M should reveal.

The diagonal elements of M indicate the weight given to the difference in each variable in the classification of instances as similar or not. From the results, $X_1, X_3, X_5$ were the most relevant, as was expected from the intuition of the problem. The off-diagonal terms of M indicate impact of interaction terms. Using the 2-5 fold CV, metric kNN model was a better classifier than a logistic regression model.

Visualisation of M (left), ROC curves for the classification (right)

The authors then added noise variables to the model. This makes the model more realistic since multi-dimensional characterizations are likely to include variables that have little or no relationship to the output variable. Metric learning was able to filter out the noise variables while still detecting the relationship between the 5 initial variables.

M for the noise-augmented data (left), ROC curves for classification (right)

4. Some Final Thoughts

I believe this solution is valuable for 2 main reasons:

It proposes a method for more in-depth analysis of simulation results which may be useful for real-time predictions and identifying drivers of system performance. The method is useful for revealing relationships between different components of the system and their effect on performance.
The method allows us to apply kNN on high-dimension input data without the needing to manually trim the state space. This allows analysis to be done without prior knowledge about what variables may or may not be relevant, as they can all be included and the metric learning will reveal the relevance.

Learn More

[1] Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

[2] Weinberger, K. Q., and L. K. Saul. 2009. “Distance Metric Learning for Large Margin Nearest Neighbor Classification”. Journal of Machine Learning Research 10(9):207–244.

The Tidyverse: the best* -verse for data scientist

danielle-notice — Mon, 21 Mar 2022 14:00:00 +0000

Reading Time: 4 minutes

There are a couple popular universes out there like the MCU and its multiverse, and Zuckerberg’s Metaverse. My personal favourite however is actually a universe of R packages.

This post is by no means a tutorial for the tidyverse. Nor is it an introduction to these packages or style of coding using R. Instead, this is just a compilation of my favourite features of the packages that will hopefully convince you of its power and convert you to the tidy side.

What is in the tidyverse?
Tibbles!
Pipes & Purrr
A few (more) of my favourite things

1. What is in the tidyverse?

The tidyverse is a collection of R packages designed by Hadley Wickham for data science. It includes packages useful for loading, wrangling, modelling and visualising data, and a couple that make programming in R so much better. When you install and load the tidyverse, the following core packages will be loaded:

install.packages("tidyverse")
library(tidyverse)

readr – to import tabular data
tibble – the better* dataframe
dplyr – for data manipulation
tidyr – to make data tidy
ggplot2 – for data visualisation
purrr – for functional programming
stringr – for string manipulation
forcats – for factors

There are also a couple other packages that are also installed for working with specific types of vectors, importing other data types and for modelling.

2. Tibbles!

I must start this section by addressing the * I’ve included so far. The tidyverse developers themselves describe it as an opinionated collection of R packages on the . So when I say that tibbles are better than data frames, that’s just my opinion as someone who has drunk the Kool-Aid and loves it.

If you’ve ever used the data.frame or data.table, unless you’ve completely mastered using them, you may agree with me that it can be a bit confusing remembering how many commas are needed, whether to use square brackets or parentheses, if something is being done in place or if you need to make a copy. A tibble is “a modern reimagining of the data.frame”. The developers put it nicely when they said that tibbles are lazy and surly data.frames: they do less and complain more.

3. Pipes & Purrr

When you’re trying to manipulate data, or doing analysis that isn’t super simple, it’s very likely that you’ll end up with nested functions. Here’s a simple example: you want to create a table of random numbers using different distributions and then add a column for a new distribution.

There are a couple ways to approach this: you could create a new variable (or overwrite the variable) at each step. Or you could use pipes. Among other things, piping saves you having to rewrite variable names, avoid nested function calls and makes code look so much more elegant. The pipe operator %>% is included when you install the tidyverse.

#both bits of code do the same thing

no_pipe = tibble(N = rnorm(10), E = rexp(10))
no_pipe = mutate(no_pipe, G = rgamma(10,1))

with_pipe = 
  tibble(N = rnorm(10), E = rexp(10)) %>%
  mutate(G = rgamma(10,1))

If you want to take it to the next level, the package magrittr includes several other piping operators. My personal favourite is the assignment pipe %<>% which allows you to modify data in place.

4. A few more of my favourite things

As I said at the start, this is by not meant to be a comprehensive introduction to the tidyverse. Now that I’ve introduced a few of the basics, here are a couple other features (each of which could have its own post really) that make these packages so great:

Tibbles and what they can store

With tibbles, the columns type does not have to be a core data type. As well as the basics (integer, numeric, string, factor, logical), cells can contain vectors, lists, tibbles or almost anything really. And you can move between complex and simple data types using functions in the package dplyr has useful functions to nest, unnest and pivot data to the desired shape without much hassle.

Purrr, map and all its variants

You can make code a lot easier to read by using map functions to replace for loops. This is really helpful when you have nested tibbles that you want to perform a set of operations over.

Tidyselect and dplyr

The group_by function from dplyr and the many helper functions included in the package tidyselect make summarising and manipulating groups of data super straightforward.

Grammar of Graphics

I’ll admit that when you just start using ggplot2 in R, it may seem really complicated, especially when compared to the base graphics package included in R. But once you get a hang of the basics, you can create some spectacular visualisations.

R Markdown

R Markdown is an amazing way to combine code, results and commentary and save them as accessible file types. It is really good as both a lab notebook to keep track of your work and thoughts, and as a means of communicating every step of the analysis process.

Learn More

This (as well as lots of practice) is where I learnt most of what I know about the tidyverse.
A really helpful for ggplot2

Fighting in the Karate Club: Stochastic block models

danielle-notice — Mon, 07 Mar 2022 09:00:00 +0000

Reading Time: 5 minutes

Imagine you love karate. You love the principles of the martial arts, you love the physical activity, and well… you love fighting. So you deal with your violent desires in a responsible way – you join your university’s Karate Club. Everything is going great, until there is some conflict over the price of lessons between the club president and the part-time karate instructor. Before you know it, the whole club is divided on the matter and it has become a conflict of ideology rather than just fees. Ultimately, the club leaders fire the instructor, and all of his supporters leave with him and form their own karate club. This is obviously not the type of fighting you had in mind when you joined.

This was studied by Wayne Zachary, an anthropologist. This and many other situations can be represented as networks to describe the social, physical and other structures where interactions between pairs of units are observed. These include social networks, biological structures, and collections of websites, documents and words.

Stochastic block models (SBMs) are a class of random graph models which are widely
studied and popular for statistical analysis of networks. These networks are modelled to discover and understand their underlying structure, which can be used to group similar elements or simulate how the network could grow. The goal is to infer unknown characteristics of the elements in the network from the observed measurements on pairwise properties.

In this blog, we discuss SBMs using the Karate club example.

Stochastic Block Models
Extensions of the model
1. Degree-corrected SBMs
2. Multi-membership SBMs
3. Assortative SBMs

1. Stochastic Block Models (SBMs)

Consider an SBM for the karate club. There are $n$ members, each representing a node in the graph. The members can be divided into $K$ mutually exclusive groups, $(B_1 \cdots B_K)$ . These groups are unknown. What is known about this club are all the relationships between its members, which can be represented in the adjacency matrix $\mathbf{Y}$ . This is an $n \times n$ matrix for the graph where for a pair of nodes $(p,q), \, \mathbf{Y}_{pq}$ is 1 if there is an edge (connection) between the nodes (members) $p$ and $q$ and 0 otherwise.

The SBM is defined by the stochastic block matrix, $\mathbf{C}$ , a $K \times K$ matrix for the graph where $\mathbf{C}_{ij} \in [0,1]$ is the probability that there is an edge from a node in $B_i$ to a node in $B_j$ . A key feature of SBMs is stochastic equivalence. This means that two persons in the same group have the same probability distribution of being connected to other persons, both those within and outside the group. It then follows that for node $p \in B_i$ and node $q \in B_j$ ,

$\mathbf{Y}_{pq} \sim \text{Bernoulli}(\mathbf{C}_{ij}).$

The Karate club network. Graph from SBM review paper listed below (1).

2. Modifications to the SBM

There are several limitations to the simple SBM described above:

Nodes within a group are expected to all have the same degree.
Nodes are restricted to belong to only one group.
The model is not guaranteed to produce assortative groups.

We will now look at a few modifications to the simple SBM that address each of these limitations.

Degree-corrected SBM

In the karate club (and many other networks), it is unlikely that every person in a particular group would have the same number of friends in the group. The extends the simple SBM to account for this possibility of different node degrees.

In an undirected graph, the degree of node $p$ is the number of edges between $p$ and another node. Consider the network as an undirected multi-graph. The stochastic block matrix $\mathbf{C}$ is redefined such that $\mathbf{C}_{ij} \in [0,1]$ is the expected number of edges between nodes in $B_i$ and $B_j$ .

Also included in the model is an n-vector $\phi$ where for each node $\phi_p$ which controls the expected degrees of node $p$ . $\phi_p$ is the probability that an edge connected to the group containing
p is connected to p itself. It then follows that for node $p \in B_i$ and node $q \in B_j$ ,

$\mathbf{Y}_{pq} \sim \text{Poisson}(\phi_p\phi_q\mathbf{C}_{ij}).$

Divisions of the karate club network found using the (a) uncorrected and (b) corrected SBMs. The size of each node is proportional to its degree and the shading reflects inferred group membership. The dashed line indicates the split observed in real life.

Mixed-membership SBMs

In the Karate club example, the 2nd issue is not all that major relevant (although it definitely could have been possible to be part of both clubs after the split), but it’s clear how for other networks, this can be a major limitation. Further, the strength of their affiliation with each group can be different. The (MMSBM) extends the simple SBM to accommodate these multi-faceted relationships using a mixed-membership approach.

In addition to the variables in the simple SBM, there is also an $n \times K$ matrix of membership probabilities $\Theta$ where each element $\Theta_{pi}$ represents the probability that node $p \in B_i$ . Each row is not restricted to have only 1 non-zero element, so each node can simultaneously belong to multiple groups.

Assortative SBMs

Another consideration is if we want to model a network with a particular goal of grouping
similar elements. For the karate club, there are probably members who are more casual about the situation (they’re just there to fight and go home, not to make friends), and so only have a couple of friends in the club. The simple SBM may conclude that all such persons belong to the same group even if there are hardly any connections in the group because they each have a comparatively small but similar number of friends. This is because the SBM is not designed to prioritize community detection. However, work has been done to extend the model to improve its usefulness in clustering tasks.

Community detection or assortativeness is the property of nodes being partitioned into blocks in such a way that the edge density is high within a group and low between groups. The (a-MMSB) is a special case of the MMSBM we looked at in the previous section. The model includes a parameter
for “community strength”, $\beta_i∈ \in (0, 1)$ for each group which represents how closely the nodes in the group are linked.

Learn More

A really good review of

A new reality TV show idea: the Stable Marriage algorithm

danielle-notice — Mon, 14 Feb 2022 09:00:00 +0000

Reading Time: 3 minutes

As a hopeful romantic, a believer in the principle of marriage and a lover of dating reality TV, I was immediately intrigued by this problem and solution. So to celebrate Valentine’s Day I thought it would be fitting to look at the stable marriage problem.

1. The Premise

Consider two disjoint sets with the same number of elements (for example a group of n men and a group of n women). A matching is a one-to-one mapping from one set onto the other (a set of n monogamous marriages between the men and women). Each man has an order of preference for the women and each woman an order of preference for the men.

A matching is unstable if there exists a possible pairing of a man and a woman (not currently married) that both prefer each other to their spouses. For example, Johnny is married to Bao but prefers Myrla and Myrla is married to Gil but prefers Johnny (IYKYK). Whereas this would making for entertaining TV, the stable marriage problem is to find a matching that avoids this situation.

2. The Pitch

Firstly, it is always possible to find a stable matching in this situation. One possible way to find a solution is the Gale-Shapley algorithm:

First Round

Each man proposes to the woman he prefers the most.
Each woman (if she received any proposals) tentatively accepts her favourite proposal and rejects all the others.

Subsequent Rounds

Each unengaged man proposes to the next woman he prefers the most who has not yet rejected him, regardless of whether she is currently engaged (scandalous!)
Each unengaged woman tentatively accepts her favourite proposal and rejects all the others.
Each engaged woman considers any new proposals and leaves her current partner if she prefers one of the new proposals. She tentatively accepts that better proposal and rejects all the others.

The subsequent rounds are repeated until everyone is engaged.

Example of the Gale-Shapley algorithm (from )

3. A Problem

Important for this algorithm is who makes the proposals – if the men propose, the overall outcome is better for them than for the women. If we score each marriage in the stable matching from both the male and female perspectives based on each person’s preferences and take total score for each gender, you can see a clear difference in the distribution of the scores. The difference is more drastic as the set size is increased.

Distribution of scores for stable matchings when males make proposal using randomly generated preference tables (female scores red, male scores blue)

4. In Practice

While I’ve introduced this problem as a pitch for a dramatic (even if biased) match-making show, Shapley and Roth won a for their applications of this problem and someone did his whole extending some of the ideas.

Here are some interesting situations that this algorithm or some variation of it have been used for in practice:

to transplant patients

Learn more

Gale, D., & Shapley, L. S. (1962). . The American Mathematical Monthly, 69(1), 9–15.

Lead (probably not) in my water: zero-inflated models

danielle-notice — Mon, 31 Jan 2022 08:00:00 +0000

Reading Time: 5 minutes

It’s interesting that there are some problems that the younger generation, if they even know it exists, assume have been dealt with completely. That’s what I thought about lead piping. Yet the University of Edinburgh and Scottish Water have ongoing research related to this.

It became relatively common knowledge in the 1970s that lead is dangerous. However before its harmful health effects were discovered, lead was commonly used in water pipes. In Scotland, the water supplies do not naturally have high lead levels. Since the banning of lead pipes in 1969, Scottish Water has worked to remove lead pipes from the mains distribution system although some pipes carrying water to customers’ houses may still be made of lead and require replacement. Additionally, properties built before 1970 are at higher risk of containing internal lead piping or tanks and having contaminated tap water.

What is the problem?
What is count regression?
Why a zero-inflated model?
Other considerations

1. So just find and replace all the pipes…

In a world without limitations, Scottish Water would visit every household built before 1970 and test their tap water for lead-contamination. However that is not possible in reality. So instead it makes sense to model which areas (for example postcodes) have more houses with lead-contaminated water so that sampling efforts can be focused there. The goal is to identify the possible factors which increase the risk of more households in a postcode returning water samples with lead concentration greater than 1μg/L.

Data

To look at this problem, we consider data containing 308 observations, each representing a different postcode. The variables included are:

Location related variables: water operational area, region, postcode area and district; the coordinate location of the postcodes.
Scottish Water data: an indicator variable for if the postcode’s water is phosphate dosed and the dosage measured; an indicator variable for if old supply pipes have been replaced*.
Census data: the 2011 census households count; the Urban Rural classification code (2-fold and 8-fold).
Property Age related variables
Presence of lead: number of households sampled, number of samples with lead concentration > 1μg/L (lead presence count).

Pairwise correlation between numeric variables in the data

Map of Scotland showing locations that were sampled

2. What is count regression?

Simple regression has a response variable that can take any real value. The models we want to use need to account for the fact that the data are non-negative integers. There are some regression models specifically for count data.

Poisson Regression

Naturally when doing any model testing, we start with simplest – which is Poisson regression for count data. The response variable conditional on the regressors is Poisson distributed with mean parameter connected to the regressors and parameters by the exponential mean function.

Despite the name of the model, it is not necessary for the response variable to be Poisson distributed (marginal vs conditional distribution). However, for valid statistical inference it is necessary for the conditional mean and expectation to be equal (as is expected for the Poisson distribution).

Negative Binomial Regression

When the conditional variance exceeds the conditional mean, the data is said to be overdispersed. The negative binomial model is a standard approach to address overdispersion. It includes a dispersion parameter, $\alpha$ which is also estimated. Similar to the Poisson model, the data does not need to follow the negative binomial distribution as long as the mean and variance are correctly specified.

3. Why a zero-inflated model?

Let’s take a step back and think about the problem for a moment. The reason many people don’t know lead piping is still an issue (aside from youthful ignorance) is that well… it’s been dealt with for the most part. So in reality many of the postcodes won’t have any households with lead contaminated samples.

This is a situation where count data may still have more zeroes than predicted by a parametric models even when using a distribution like the negative binomial. In a zero-inflated count model, the processes generating the zero counts and the positive counts are not constrained to be the same.

The zero-inflated model specifies

$Pr[y=j] = \begin{cases} \pi + (1-\pi)f_1 (0), & \text{if}\ j=0 \\ (1-\pi)f_1 (j), & \text{if}\ j>0 \end{cases}$

The model is a mixture of a count model and a probability mass function degenerate at zero. The proportion of zeroes added ( $\pi$ ) may be determined by a binary outcome model or be set to a constant.

4. Other considerations

To account for some of the variation related to location, random effect terms were added to each of the models selected. For variables like these, were more appropriate than fixed categorical variables.

If you read the first section closely, you’ll remember that included in the data was an indicator variable for if old pipes have been replaced. And you’d think, well there you go! However this actually wasn’t included in the modelling at all. That’s because it wasn’t clear if the sample was taken before or after the supply pipe was replaced. So we couldn’t really infer whether any contamination recorded was already dealt with or if it was caused by internal lead pipes.

5. Conclusion

So in the end, I chose a zero-inflated negative binomial model with a random effect. This model handled both the overdispersion and the zero-inflation in the data. It is also more generalisable that the zero-inflated Poisson model. For example, if more extensive sampling is done, the data may be more or less overdispersed than the current sample and a Poisson model would not account for that.

The model identified 2 factors that have a large impact on the risk of a postcode no being free of lead contamination: whether or not a postcode receives its water from a WTW which conducts orthophosphate dosing and whether a postcode is in urban or rural Scotland. By considering these 2 major factors, further sampling can be done in postcodes that are classified as urban and are not orthophosphate dosed.

Learn more

Scottish Water –
Cameron, A. Colin, and Pravin K. Trivedi. 2013. . 2nd ed. Econometric Society Monographs. Cambridge University Press.
R package –