Week 5 Practical: Collaborative Filtering on UK Parliament Data

We will analyse data collected by Public Whip on the votes each Member of Parliament (MP) has cast during the 2010-2015 parliament in the House of Commons. Each vote corresponds to a particular piece of legislature and is called a division. Download these data here and to load use:

In [2]:
df <- read_csv("~/data/publicwhip/parliament2010.dat")
votes <- df[,2:1227]
Parsed with column specification:
  .default = col_integer(),
  mpidstr = col_character(),
  firstname = col_character(),
  surname = col_character(),
  party = col_character()
See spec(...) for full column specifications.
  1. 664
  2. 1226

There are 664 MPs and 1226 divisions. In the data matrix votes, $+1$ indicates an aye on the division, $-1$ a no. $0$ indicates that the MP did not vote in that particular division. In reality, it is a bit more complicated than this as there are also tellers as well as the MPs who voted both ways (!), but I cleaned up the original format so that we can treat it as a simple binary dataset. Now let us consider the number of votes each MP has cast. It turns out that there is a huge range in these: there is an MP who voted only once(!), whereas another one voted in 1179 out of 1226 divisions.

In [3]:
#find number of votes cast - 
nz<-apply(votes,1,function(x) sum(x!=0))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   776.8   888.0   841.6   973.2  1179.0 
In [4]:
  1. 'Primarolo'
  2. 'Hoyle'

Ah, ok, those MPs who voted a very small number of times are actually the two deputy speakers - they don't traditionally vote. Let us remove them and recompute the number of votes cast.

In [5]:

#recompute number of votes cast
nz<-apply(votes,1,function(x) sum(x!=0))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   38.0   779.0   888.5   844.1   973.8  1179.0 

To start with, let us ignore the fact that votes which were not cast are essentially missing data (due to MPs not attending sessions, for example) and simply run PCA on this data matrix. Note that $p>n$ here so let us do SVD (don't forget to center the data matrix first!).

In [6]:
#center the data
y <- scale(votes, scale = FALSE)

y.svd <- svd(y,nu=npc,nv=npc)

proj<-y.svd$u %*% diag(y.svd$d[1:npc])

#plot pca projections
In [7]:
#now plot with party membership
In [21]:
plot(proj[,1:2],col=party,pch='x',xlab='PC 1',ylab='PC 2')

The first two principal components indicate a good split between the major parties: red are Conservatives, yellow are the Liberal Democrats and purple is Labour. What about MPs in each of these parties who are located towards the middle? Is it an indication of outlier MPs - some form of rebelliousness? Not really -- MPs close to the middle just don't vote much! The plot of the number of votes cast versus the distance from the mean in the principal components projections reveals this.

In [19]:
#find lengths of the vectors of projections
ll<-apply(proj[,1:2],1,function(x) sqrt(sum(x^2)))

#the points pulled towards the mean are not indication of rebelliousness but of not voting often!
[1] "correlation: 0.660324306181162"

This also becomes apparent when we look at the actual names of MPs. Indeed, those close to the middle are actually members of the cabinet who do not vote that often: Cameron, Clegg, Osborne, Hague (which is hardly a rebellious crowd).

In [22]:
plot(proj[,1:2],col=party,pch='x',xlab='PC 1',ylab='PC 2')