Boltzmann Weighted Combinatorics of RNA Secondary Structures

RNA molecules and DNA molecules are very similar as both consist of a long chainlike sugar-phosphate backbone with sidechains one of four amine bases (Adenine (A), Cytosine (C), Guanine (G), and Uracil (U)). But where DNA usually comes in a pair of complementary chains, or strands, that structurally attain the well-known double helix with consecutive base pairings between the sidechain bases in the two strands, RNA molecules are usually present as just a single strand. So when an RNA molecule forms a three dimensional structure the side-chain bases have to form base pairs within the molecule, i.e. with other bases of the strand. The set of base pairings in the equilibrium structure of an RNA molecule is called the secondary structure of the RNA molecule.

The combinatorics of RNA secondary structures has a long history. Already [7] studied the number of different types of structures. The number of different RNA secondary structures is a proud member of the The On-Line Encyclopedia of Integer Sequences [6], and asymptotically grows as

rna_structures_growth

with sequence length n. Studies investigating the expected shape of structures can help develop more suitable models for structure combinatorics and identify features where functional RNA molecules deviates markedly from expectations. The latter can in turn be used for improved RNA gene finding.

Initial work considered structures devoid of sequence context, essentially assuming that all types of base pairs are valid. Though this can be said to be a valid viewpoint based on types of interactions observed in known RNA, in reality only Cs paired with Gs, As paired with Us, and Gs paired with Us are observed with significant frequency. This has inspired the so-called Bernoulli model for random secondary structures, where two positions in the sequence can form a base pair with probability p for 0 ≤ p ≤ 1. This essentially allows the expectations to be restricted to sequences of fixed base composition. Asymptotics of these expectation have been developed in [1,5].

Even taking base pairing feasibility into account by the Bernoulli model falls far short of realistically modelling RNA secondary structure formation. Though this model does account for structure feasibility by including a factor representing the probability that two random bases can form a canonical base pair, it completely ignores the fact that not all feasible structures for a particular sequence are equally probable. Structure probability is in thermodynamics described by the Boltzmann distribution that specifies that the probability of a structure is proportional to an exponential of the negative of its free energy. For prediction purposes, models for approximating the free energy of an RNA secondary structure have been developed and refined for several decades [2]. Boltzmann distributions based on this model are readily computable [3].

Though this model is probably too complex to use in developing asymptotics of expectations, including some aspects should be feasible. This project proposes developing asymptotics similar to [1,5], extended to account for some aspects of Boltzmann distributions. A good starting point would be to include a factor exponentially dependent on the number of base pairs in the structure. Possible extensions would be to let this dependence be on the number of base pair stackings (i.e. neighbouring base pairs) and to also have a negative dependence on the number of loops in the structure. Another extension would be to extend the Bernoulli model to restrict expectations to sequences of fixed dinucleotide distribution, a factor that is known to influence RNA secondary structure stability [8].

For a given fixed sequence, the problem of computing expectation, variance and higher moments over the Boltzmann distribution has recently been considered [4]. Averaging over randomly sampled sequences can provide an excellent control of the asymptotics developed, and also give a bench mark for how adequate the structural model used is. An initial project could be to simply bench mark the Bernoulli model against expectations averaged over sampled sequences. A prerequisite for undertaking the full project of developing asymptotics in more refined models of RNA secondary structure will be a strong knowledge of generating function theory.

Bibliography

  1. I. L. Hofacker, P. Schuster, and P. F. Stadler. Combinatorics of RNA secondary structures. Discrete Applied Mathematics, 88:207-237, 1998.
  2. D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology, 288:911-940, 1999.
  3. J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105-1119, 1990.
  4. I. Miklós, I. M. Meyer, and B. Nagy. Moments of the boltzmann distribution for RNA secondary structures. Bulletin of Mathematical Biology, 67:1031-1047, 2005.
  5. M. E. Nebel. Investigation of the Bernoulli model for RNA secondary structures. Bulletin of Mathematical Biology, 66(5):925-964, 2004.
  6. N. J. A. Sloane. The on-line encyclopedia of integer sequences. www.research.att.com/~njas/sequences/, 2006.
  7. M. S. Waterman. Secondary structure of single-stranded nucleic acids. Advances in Mathematics, Supplementary Studies, 1:167-212, 1978.
  8. C. Workman and A. Krogh. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Research, 27(24):4816-4822, 1999.