#+title: Genetic drift
#+seq_todo: TODO | DONE
#+property: cache no

* A simple model of evolution
  Evolution is the result of changes in the genetic composition of
  populations over time. One of the simplest models of evolution is as
  follows. There is a population of N individuals, among which there
  are two genetic types: red and blue[fn:1]. Here is the initial
  generation of the population (N=10).

#+begin_src ditaa :file drift-1-gen.png :cmdline -r :exports none
                  /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+
  Generation 1    |cRED| |cBLU| |cBLU| |cBLU| |cRED| |cRED| |cBLU| |cRED| |cRED| |cRED|
                  |    | |    | |    | |    | |    | |    | |    | |    | |    | |    |
                  +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/  
#+end_src

#+results:


  There is no mutation, no selection and no sex; the next generation
  is made up by randomly choosing 10 individuals from the previous
  generation[fn:2]. A single individual can be chosen more than once,
  or not at all; the number of times an individual is chosen
  corresponds to the number of progeny it has in the next
  generation. Even without mutation or natural selection the
  proportions of red and blue types will change, because different
  individuals will have different numbers of offspring, by chance.

  So the first two generations might look like this.

#+begin_src ditaa :file drift-2-gen.png :cmdline -r :exports none
                  /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+
  Generation 1    |cRED| |cBLU| |cBLU| |cBLU| |cRED| |cRED| |cBLU| |cRED| |cRED| |cRED|              
                  |    | |    | |    | |    | |    | |    | |    | |    | |    | |    |
                  +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ 
                  /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+ /----+
  Generation 2    |cBLU| |cBLU| |cRED| |cRED| |cRED| |cBLU| |cRED| |cRED| |cBLU| |cBLU|              
                  |    | |    | |    | |    | |    | |    | |    | |    | |    | |    |
                  +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ +----/ 
#+end_src

#+results:



  This is a form of evolution called "genetic drift". It is inevitable,
  although if the population is very large it will have less effect.

  Let X_t be the number of red individuals in generation t, and let p_t
  be the proportion of red individuals. So X_1 is 6 and p_1 is 0.6. To
  choose the number of red individuals in generation 2 we make 10
  choices, each time having probability 6/10 of getting a red
  individual. So X_2 is a /binomial/ random variable, with 10 trials and
  success probability 0.6. In general, the random process is described
  by the following transition probabilities.

#+begin_src latex :file transprob.png :exports results
  \begin{equation}
  \Pr(X_t=j|X_{t-1}=i) = \frac{j(j-1)}{2}\Big(\frac{i}{N}\Big)^j\Big(\frac{N-i}{N}\Big)^{n-j}
  \end{equation}
#+end_src

#+results:


  We can simulate the evolution over many generations in R. This code
  simulates the change in frequency in a single population over 100
  generations. We'll make the population larger (N=1000) but still
  start off with 60% red individuals.

#+source: simpledrift(N=1000, X1=600, ngens=100)
#+begin_src R :file simpledrift.png :exports code
  p <- numeric(ngens)
  p[1] <- X1/N
  for(g in 2:ngens)
      p[g] <- rbinom(1, size=N, prob=p[g-1]) / N
  plot(p, type="l", ylim=c(0,1), xlab="Generation", ylab="Proportion red")
#+end_src

#+results: simpledrift


  But how variable is this process? To answer this we need to repeat
  the simulation many times (i.e. simulate many identical but
  independent populations). We could do that as follows

#+begin_src R :session t
  drift.slow <- function(N, X1, ngens, nreps) {
      p <- matrix(NA, nrow=ngens, ncol=nreps)
      p[1,] <- X1/N
      for(rep in 1:nreps) {
          for(g in 2:ngens)
              p[g,rep] <- rbinom(1, size=N, prob=p[g-1,rep]) / N
      }
      p
  }
#+end_src

  But this is not a good implementation. One should make use of
  "vectorisation", which makes the simulation much more efficient when
  there are many replicates[fn:3]. Note the way that rbinom simulates
  all replicates at once, but still one generation at a time.

#+begin_src R :session t
  drift.faster <- function(N, X1, ngens, nreps) {
      p <- matrix(NA, nrow=ngens, ncol=nreps)
      p[1,] <- X1/N
      for(gen in 2:ngens)
          p[gen,] <- rbinom(n=nreps, size=N, prob=p[gen-1,]) / N
      p
  }
#+end_src

  To run the simulation:

#+source: drift(N=1000, X1=600, nreps=10, ngens=100)
#+begin_src R :session t :file repdrift.png :exports code
  p <- drift.faster(N, X1, ngens, nreps)
  matplot(p, type="l", ylim=c(0,1), lty=1)
#+end_src

#+results: drift


  And let's quickly see how much of a speed difference the vectorisation
  makes.

#+source: compare-times(N=1000, X1=600, nreps=1000, ngens=100)
#+begin_src R :session t :colnames t :results output
  functions <- c(drift.slow=drift.slow, drift.faster=drift.faster)
  times <- sapply(functions, function(f) as.numeric(system.time(f(N, X1, ngens, nreps))[1]))
  print(times)
  cat(sprintf("\nFactor speed-up = %.1f\n", times[1] / times[2]))
#+end_src

#+results: compare-times
:   drift.slow drift.faster 
:        6.053        0.204
: 
: Factor speed-up = 29.7

* Footnotes...
* Config                                                           :noexport:...
* Tasks                                                            :noexport:...