## Week 8 Practical: Aerosol Prediction Challenge¶

Solution below is just an illustration. There are many ways in which this problem may have been tackled. I will use a combination of regularised least squares on both handcrafted features for bags, as well as on some random Fourier features, which arise from large-scale approximations for kernel methods.

In [4]:
library(glmnet)

pinstance <- 7 #number of instance features
pbag <- 5 #number of bag features
p<-pinstance+pbag

ntr<-dim(ytr)[1]
ntst<-dim(ytst)[1]
n<-ntr+ntst
dlevel_indices<-1+100*(0:(n-1))

#scale first
Xdf[,1:p]<-scale(Xdf[,1:p])


Let us try a very simple summarisation of the bags first. We represent each bag by the means of instance attributes and concatenate the bag-level attributes. Then we apply $L_2$-regularised least squares regression using cross-validation for the regularisation parameter $\lambda$.

In [5]:
#############################
#linear model on just means

Phi<-data.frame(matrix(0,n,12))
for(ii in 1:n){
Phi[ii,]<-colMeans(Xdf[Xdf$bag_Id==ii,1:12]) } Phitr<-Phi[1:ntr,] Phitst<-Phi[(ntr+1):n,]  In [6]: #use 10-fold cross-validation to assess accuracy lmmean_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0,lambda=2^((-30:6)/2))
plot(lmmean_cvfit)
cat("crossvalidation error (means only):", sqrt(min(lmmean_cvfit$cvm))) lmmean_ypred<-predict(lmmean_cvfit,as.matrix(Phitst),s="lambda.min")  crossvalidation error (means only): 0.8011483 Now let us do the same but using also the standard deviations and the 0.2 and 0.8 quantiles of each bag. In [7]: #linear model on means+sdev+quantiles qlo<-function(x) {quantile(x,0.2)} qhi<-function(x) {quantile(x,0.8)} Phi<-data.frame(matrix(0,n,33)) for(ii in 1:n){ Phi[ii,1:12]<-colMeans(Xdf[Xdf$bag_Id==ii,1:12])
Phi[ii,13:19]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,sd) Phi[ii,20:26]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,qlo)
Phi[ii,27:33]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,qhi) } Phitr<-Phi[1:ntr,] Phitst<-Phi[(ntr+1):n,]  In [9]: #use 10-fold cross-validation to assess accuracy lmmsdq_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0,lambda=2^((-30:6)/2))
plot(lmmsdq_cvfit)
cat("crossvalidation error (means+sd+q): ", sqrt(min(lmmsdq_cvfit$cvm))) lmmsdq_ypred<-predict(lmmsdq_cvfit,as.matrix(Phitst),s="lambda.min")  crossvalidation error (means+sd+q): 0.744375 Can also try boosting decision trees on the same representation. We also cross-validate with 10 folds and try depths from 2 to 5. Looking at the RMSE in the last 10 rounds, it appears to stabilise at the level of around 0.8. In [20]: library(xgboost) dtrain <- xgb.DMatrix(as.matrix(Phitr), label = ytr$y)
for(depth in 2:5){
cvbst <- xgb.cv(data=dtrain,nrounds=200,nfold=10,max_depth=depth,verbose=FALSE)
cat("depth ",depth,":\n")
print(cvbst$evaluation_log[190:200,]) }  depth 2 : iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std 1: 190 0.3405308 0.005626707 0.8171730 0.07084886 2: 191 0.3394063 0.005757955 0.8171693 0.07078443 3: 192 0.3381343 0.005622461 0.8178292 0.07112857 4: 193 0.3368606 0.005526561 0.8173886 0.07078293 5: 194 0.3357133 0.005446230 0.8167742 0.07011427 6: 195 0.3346425 0.005604910 0.8171054 0.07005467 7: 196 0.3335177 0.005639652 0.8170846 0.06971750 8: 197 0.3323905 0.005591196 0.8167892 0.06985145 9: 198 0.3313020 0.005403414 0.8167484 0.06915680 10: 199 0.3302496 0.005314439 0.8167036 0.06905414 11: 200 0.3292970 0.005319198 0.8161652 0.06908361 depth 3 : iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std 1: 190 0.1323271 0.003695743 0.8122805 0.07720687 2: 191 0.1312963 0.003726922 0.8126330 0.07731688 3: 192 0.1300931 0.003608384 0.8127688 0.07723485 4: 193 0.1292995 0.003435556 0.8126322 0.07698781 5: 194 0.1284610 0.003363797 0.8121975 0.07661431 6: 195 0.1275089 0.003595304 0.8120819 0.07660368 7: 196 0.1264874 0.003593135 0.8120990 0.07693506 8: 197 0.1256408 0.003563828 0.8121754 0.07681704 9: 198 0.1248102 0.003535153 0.8119923 0.07682059 10: 199 0.1239083 0.003387762 0.8120850 0.07710302 11: 200 0.1230967 0.003397207 0.8121109 0.07677651 depth 4 : iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std 1: 190 0.0350295 0.002336821 0.7884588 0.06661102 2: 191 0.0345539 0.002256576 0.7885534 0.06654254 3: 192 0.0340868 0.002244931 0.7886516 0.06666072 4: 193 0.0335425 0.002239418 0.7885448 0.06665725 5: 194 0.0331140 0.002268100 0.7884728 0.06664551 6: 195 0.0327095 0.002184309 0.7883999 0.06672309 7: 196 0.0322429 0.002126396 0.7884258 0.06674836 8: 197 0.0318014 0.002132264 0.7884654 0.06674739 9: 198 0.0313515 0.002069476 0.7884212 0.06675904 10: 199 0.0309520 0.002127688 0.7883344 0.06685271 11: 200 0.0306224 0.002177868 0.7883446 0.06685822 depth 5 : iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std 1: 190 0.0066952 0.0004329565 0.8066028 0.07565139 2: 191 0.0065598 0.0004483045 0.8066118 0.07563133 3: 192 0.0064157 0.0004403633 0.8066043 0.07563331 4: 193 0.0062649 0.0004342556 0.8066009 0.07557833 5: 194 0.0061298 0.0004261572 0.8066055 0.07556719 6: 195 0.0059766 0.0003970356 0.8066217 0.07554167 7: 196 0.0058587 0.0004411807 0.8066171 0.07554324 8: 197 0.0057369 0.0004389586 0.8066013 0.07556361 9: 198 0.0056267 0.0004362777 0.8065979 0.07553457 10: 199 0.0054844 0.0004100703 0.8066031 0.07554467 11: 200 0.0053694 0.0004047864 0.8066358 0.07553597  In [21]: bst<-xgboost(data=dtrain,nrounds=200,max_depth=4,verbose=FALSE) bst_ypred<-predict(bst,as.matrix(Phitst))  Now, let us consider random Fourier feature (RFF) expansions. First, we generate random Fourier features for bag attributes and summarise them with the averages of RFFs (approximate kernel embeddings). We then concatenate random Fourier features for bag attributes. In [22]: ########################## #random feature embeddings m1<-100 m2<-50 med.heur = function(X) { ##computes the median heuristic for the kernel bandwidth theta of ##the Gaussian RBF kernel exp(-0.5||x-x'||^2/theta^2) ##heuristic is: theta=median(||x-x'||)/sqrt(2) n <- dim(X)[1] if(n>1000){ Xs <- X[sample(1:n,1000),] heur <- median(dist(Xs))/sqrt(2) } else { heur <- median(dist(X))/sqrt(2) } return(heur) } rff1 <- function(X,W1){ M<-as.matrix(X) return(cbind(cos(M%*%W1),sin(M%*%W1))) } rff2 <- function(X,W2){ M<-as.matrix(X) return(cbind(cos(M%*%W2),sin(M%*%W2))) } #use median heuristic bandwidths med1<-med.heur(Xdf[,1:7]) med2<-med.heur(Xdf[dlevel_indices,8:12]) #sample frequencies W1<-matrix(rnorm(7*m1),7,m1) W2<-matrix(rnorm(5*m2),5,m2)  In [23]: Phi<-data.frame(matrix(0,n,m1+m2)) for(ii in 1:n){ Phi1<-colMeans(rff1(Xdf[Xdf$bag_Id==ii,1:7],W1/med1))
Phi2<-rff2(Xdf[dlevel_indices[ii],8:12],W2/med2)
Phi[ii,]<-c(Phi1,Phi2)
}

Phitr<-Phi[1:ntr,]
Phitst<-Phi[(ntr+1):n,]

In [25]:
rffadditive_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0, lambda=2^((-30:6)/2)) plot(rffadditive_cvfit) cat("crossvalidation error (rff additive): ", sqrt(min(rffadditive_cvfit$cvm)))

crossvalidation error (rff additive):  0.7754882

Now, let us try the model that captures the interactions between the instance-level attributes and the bag-level attributes. We form this by computing the outer products between the approximate kernel embeddings computed on the instance attributes and the random Fourier features for the bag attributes.

In [26]:
Phi<-data.frame(matrix(0,n,m1*m2))
for(ii in 1:n){
Phi1<-colMeans(rff1(Xdf[Xdf$bag_Id==ii,1:7],W1/med1)) Phi2<-rff2(Xdf[dlevel_indices[ii],8:12],W2/med2) Phi[ii,]<-c(Phi1%*%Phi2) } Phitr<-Phi[1:ntr,] Phitst<-Phi[(ntr+1):n,]  In [27]: rffouter_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0, lambda=2^((-30:6)/2))
plot(rffouter_cvfit)

cat("crossvalidation error (rff outer product): ", sqrt(min(rffouter_cvfit$cvm))) rffouter_ypred<-predict(rffouter_cvfit,as.matrix(Phitst),s="lambda.min")  crossvalidation error (rff outer product): 0.7316888 OK, we are now happy with our models, so let's compare them on the test set. In [28]: rmse<-function(ypred){sqrt(mean((ypred-ytst$y)^2))}
results<-cbind(LMMEANS=rmse(lmmean_ypred),LMMSDQ=rmse(lmmsdq_ypred),