Newton’s Algorithm finds the root of a function. Let \(\boldsymbol f(\boldsymbol x)\) be a multivariate function, and define a sequence by \[ x_{k+1} = x_k - \left(\frac{\partial \boldsymbol f(x_k)}{\partial \boldsymbol x}\right)^{-1} f(x_k). \] Under reasonable conditions, if this sequence converges then it converges to a root of \(f\). We can write an R function to take generic vector-valued function and apply a Newton step to it. We will use a numerical approximation for the gradient.

## step()
#' Obtain one step of a Newton algorithm
#'
#' @param FUN    function taking single vector argument
#' @param x      current value of argument to \code{FUN}
#' @param eps    length of step size used for calculating numerical derivative
#'
#' @return numeric vector giving next \code{x} value
step = function(FUN, x, ..., eps = 1e-8) {
  val = FUN(x, ...)
  deriv = matrix(NA, length(val), length(val))
  
  for (i in 1:length(val)) {
    eps_vector = rep(0, length(val))
    eps_vector[i] = eps
    
    deriv[,i] = (FUN(x + eps_vector, ...) - FUN(x, ...)) / eps
  }
  out = -solve(deriv) %*% val
  return(out)
}

## newton()
#' Run Newton algorithm to find root of function
#'
#' @param FUN    function taking single vector argument
#' @param start  starting value of argument to FUN
#' @param tol    maximum value of function for convergence
#' @param eps    length of step size used for calculating numerical derivative
#'
#' @return numeric vector of point achieved after convergence
#' of the Newton algorithm
newton = function(FUN, start, ..., tol = 1e-8, eps = 1e-8) {
  x = start
  
  while (max(abs(FUN(x, ...))) > tol) {
    move = step(FUN, x, ..., eps=eps)
    x = x + move
  }
  
  return(x)
}

The syntax in the comments is chosen carefully to work with the R package Roxygen for generating documentation, but we don’t need to worry about that yet.

A variation of Newton’s algorithm is used to find the MLEs for generalized linear models (Fisher Scoring). Consider a Poisson GLM with \(Y_i \sim \text{Poisson}(\mu_i)\) with \[ \log \mu_i = \beta_0 + \beta_1 x_{i1} + \cdots \beta_k x_{ik}. \] You can check that the derivative of the log-likelihood for \(\beta_i\) is \[ \dot{l}(\beta_0, \boldsymbol\beta) = X^T Y - X^T \exp(\beta_0 + X \boldsymbol\beta) \] where \(\boldsymbol\beta = (\beta_1, \cdots, \beta_k)^T\). In order to use our Newton’s algorithm to solve the likelihood equations, we can write a function for \(\dot{l}\) whose first argument is the vector \((\beta_0, \boldsymbol\beta)\).

## log-likelihood function for Poisson
Dloglik = function(beta, X, y) {
  c(sum(y - exp(beta[1] + X %*% beta[-1])),
  t(X) %*% y - t(X) %*% exp(beta[1] + X %*% beta[-1]))
}

Now let’s generate some data: 1,000 observations.

set.seed(94) # reproducible results
X = matrix(rnorm(3000), ncol=3)
y = rpois(1000, lambda = exp(1 - 0.5*X[,1] + 0.2*X[,3]))

Dloglik(c(0,0,0,0),X,y)
## [1]  2169.00 -1703.72    43.97   668.55

Now let’s run our Newton algorithm:

newton(Dloglik, c(0,0,0,0), X=X, y=y)
##           [,1]
## [1,]  0.992211
## [2,] -0.534254
## [3,] -0.004111
## [4,]  0.176011

We can check the solution with glm():

glm(y ~ X, family="poisson")$coef
## (Intercept)          X1          X2          X3 
##    0.992211   -0.534254   -0.004111    0.176011

How quick is our function?

The quickest way to check the time taken by some code is to use system.time().

system.time(newton(Dloglik, c(0,0,0,0), X=X, y=y))
##    user  system elapsed 
##   0.014   0.000   0.014

The microbenchmark package allows more sophisticated comparisons.

library(microbenchmark)
microbenchmark(
  newton(Dloglik, c(0,0,0,0), X=X, y=y),
  glm(y ~ X, family="poisson")$coef,
times=10)
## Unit: milliseconds
##                                          expr    min    lq median     uq    max neval
##  newton(Dloglik, c(0, 0, 0, 0), X = X, y = y) 12.463 12.86 13.622 14.252 15.406    10
##           glm(y ~ X, family = "poisson")$coef  6.151  6.19  6.651  7.251  7.768    10

It seems our solution is a bit slower than using glm().

What’s Slowing it Down?

Use Rprof() to take a quick look at where your code takes time. This records information in a file whose default name is Rprof.out. It’s good practice to use a temporary file for this instead.

tmp = tempfile()   # get a temporary file
Rprof(tmp, interval=0.001)
newton(Dloglik, c(0,0,0,0), X=X, y=y)
Rprof(NULL)
summaryRprof(tmp)

This displays something like

$by.self
            self.time self.pct total.time total.pct
"exp"            0.46    46.94       0.46     46.94
"%*%"            0.40    40.82       0.40     40.82
"-"              0.06     6.12       0.06      6.12
"t.default"      0.04     4.08       0.04      4.08
"sum"            0.02     2.04       0.02      2.04

$by.total
            total.time total.pct self.time self.pct
"FUN"             0.98    100.00      0.00     0.00
"newton"          0.98    100.00      0.00     0.00
"step"            0.90     91.84      0.00     0.00
"exp"             0.46     46.94      0.46    46.94
"%*%"             0.40     40.82      0.40    40.82
"-"               0.06      6.12      0.06     6.12
"t.default"       0.04      4.08      0.04     4.08
"t"               0.04      4.08      0.00     0.00
"sum"             0.02      2.04      0.02     2.04

$sample.interval
[1] 0.02

$sampling.time
[1] 0.98

Most of the function’s time is spent evaluating FUN (i.e. Dloglik), so we should start with that. There are several obvious inefficiencies:

Dloglik2 = function(beta, X, y) {
  yxB = y - exp(beta[1] + X %*% beta[-1])
  c(sum(yxB), t(X) %*% yxB)
}
## roughly halves the time taken 
microbenchmark(newton(Dloglik, c(0,0,0,0), X=X, y=y),
               newton(Dloglik2, c(0,0,0,0), X=X, y=y), times=10)
## Unit: milliseconds
##                                           expr    min     lq median     uq   max neval
##   newton(Dloglik, c(0, 0, 0, 0), X = X, y = y) 11.693 11.852 12.161 12.632 12.90    10
##  newton(Dloglik2, c(0, 0, 0, 0), X = X, y = y)  6.067  6.089  6.502  6.538  6.95    10

Notice also that the function step() evaluates the function at x repeatedly, which is unnecessary; in fact this is already calculated by newton(), so we could use this instead:

step = function(FUN, x, val, ..., eps = 1e-8) {
  if(missing(val)) val = FUN(x, ...)
  deriv = matrix(NA, length(val), length(val))
  
  for (i in 1:length(val)) {
    eps_vector = rep(0, length(val))
    eps_vector[i] = eps
    
    deriv[,i] = (FUN(x + eps_vector, ...) - val) / eps
  }
  out = -solve(deriv) %*% val
  return(out)
}

newton = function(FUN, start, ..., tol = 1e-8, eps = 1e-8) {
  x = start
  val = FUN(x, ...)
  
  while (max(abs(val)) > tol) {
    move = step(FUN, x, val=val, ..., eps=eps)
    x = x + move
    val = FUN(x, ...)
  }
  
  return(x)
}

microbenchmark(newton(Dloglik2, c(0,0,0,0), X=X, y=y), times=10)
## Unit: milliseconds
##                                           expr   min    lq median    uq   max neval
##  newton(Dloglik2, c(0, 0, 0, 0), X = X, y = y) 3.544 3.557  3.641 4.016 4.143    10

We have halved the time taken again!