Pseudoknots in RNA secondary structure

RNA molecules and DNA molecules are very similar as both consist of a long chainlike sugar-phosphate backbone with sidechains of amine bases. But where DNA usually comes in a pair of complementary chains, or strands, that structurally attain the well-known double helix with consecutive base pairings between the sidechain bases in the two strands, RNA molecules are usually present as just a single strand. So when an RNA molecule forms a three dimensional structure the sidechain bases have to form base pairs within the molecule, i.e. with other bases of the strand. The set of base pairings in the equilibrium structure of an RNA molecule is called the secondary structure of the RNA molecule.

To be able to predict the secondary structure of an RNA molecule, a basic ingredient is the ability to score different structures such that the true secondary structure, or a structure not too dissimilar to the true structure, gets the best score. A model based on approximating the free energy of any given structure was proposed by Tinoco et al. [1] in the early 1970's. This model has turned out to work remarkably well.

Another basic ingredient is the ability efficiently to search the possible structures of an RNA molecule to determine which one(s) have the optimal score. For the model proposed by Tinoco et al. this problem was solved in the beginning of the 1980's by algorithms that find the structure of optimal score in cubic time and quadratic space [2, 3, 4]. With current energy parameters for the Tinoco model these algorithms predict structures that on average contain more than 70% of the base pairs of the true secondary structure [5].

Though 70% does not sound too bad, especially not when compared to the accuracy of state of the art methods for other prediction problems in bioinformatics, it would be nice to do better. One way to obtain improvements would be to improve on the energy parameters used for the Tinoco model. However, no matter how much we improve on the parameters the accuracy of predictions using the Tinoco model can never reach 100%. This is due to a restriction on secondary structures in the model: the base pairs of any structure are required to form a tree-like hierarchy. I.e. for two base pairs the parts of the RNA sequence spanned by each of the base pairs are required either to be disjoint, or the part spanned by one of the base pairs have to be contained in the part spanned by the other base pair. With this restriction the structure can be broken down into a set of loops. The Tinoco model states that the free energy of an entire structure is just the sum of independent free energy contributions for each of the loops of the structure.

When the parts of the RNA sequence spanned by two base pairs are neither disjoint, nor have one contained in the other, the two base pairs form a pseudoknot. One reason that the RNA secondary structure prediction community has been able to get away with ignoring pseudoknots for nearly thirty years is that for known true RNA structures you can usually find a set of at least 95% of the base pairs that does not contain any pseudoknots; on the other hand, almost all RNA structures contain one or more pseudoknots. So one way to try to improve on prediction accuracy would be to find a feasible way to include structures with pseudoknots.

Unfortunately the general problem of predicting structures with pseudoknots seems intractable[6, 7]. So several groups have experimented with trying to generalise the Tinoco model to allow some, albeit not all, possible structures including pseudoknots[8, 9, 6]. One major drawback of these generalisations is an increase in the resource requirements for searching for the structure of optimal score. The algorithm presented for the most general of the extensions [8] require time O(n6) and space O(n4). Though polynomial, and thus from a theoretical standpoint tractable, these requirements make it infeasible to predict structures for all but the shortest naturally occuring RNA sequences.

However, for a more restricted model an algorithm is presented in [6] that can be modified to require time O(n4) and only quadratic space. This is, arguably, within reasonable bounds, so in itself an implementation of this algorithm would have merit. Moreover, an implementation would allow experiments with RNA sequences of intermediate length using different energy parameters for the pseudoknot parts of the structure. The more resource intensive algorithms in [8, 9] have both been implemented and succesfully used to predict structures of short RNA sequences. Markedly different energy parameters were used in the experiments testing the two algorithms, however. So one might suspect that the reported successes were as much due to the low number of good alternative structures for short sequences as to the actual choice of energy parameters.

Another interesting question is the optimal accuracy that can be achieved by the different restricted pseudoknot models, both compared to each other and to the Tinoco model. If the increase in the fraction of base pairs of known structures that can be captured by the model of [6] compared to the Tinoco model is insignificant, it might not be worth the effort implementing the associated prediction algorithm. On the other hand, if the fraction of base pairs that can be captured by the model of [6] is close to the fraction of base pairs that can be captured by the model of [8], the extra quadratic factors on time and space requirements would seem rather excessive compared to the gain. Ideally one should develop a framework for specifying structure models and an efficient algorithm that given a structure model and a database of known RNA structures efficiently would determine the fraction of base pairs that are captured by the model.

Finally, one should also consider the problem of using several homologous sequences to infer the structure. For the Tinoco model some work has been devoted to this problem [10, 11]. But even using just two sequences this results in increasing time requirements to O(n6) and space requirements to O(n4). Using a similar approach for the model of [6] would lead to time requirements of O(n8) and space requirements of O(n4) which is well beyond reasonable. So developing an alternative approach for efficient prediction of the shared structure of more than one homologous RNA sequences is in demand.

References

  1. Tinoco, I., Borer, P.N., Dengler, B., Levine, M.D., Uhlenbeck, O.C., Crothers, D.M., and Gralla, J. Improved estimation of secondary structure in ribonucleic acids, Nature New Biology, 246, pp. 40-41, 1973.
  2. Nussinov, R. and Jacobson, A.B. Fast algorithm for predicting the secondary structure of single-stranded RNA, Proceedings of the National Academy of Sciences of the USA, 77, pp. 6309-6313, 1980.
  3. Zuker, M. and Stiegler, P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information, Nucleic Acids Research, 9, pp. 133-148, 1981.
  4. Lyngsų, R.B., Zuker, M., Pedersen, C.N.S. Fast evaluation of internal loops in RNA secondary structure prediction, Bioinformatics, 15(6), pp. 440-445, 1999.
  5. Mathews, D.M., Sabina, J., Zuker, M., and Turner, D.H. Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure, Journal of Molecular Biology, 288(5), pp. 911-940, 1999.
  6. Akutsu, T. Dynamic programming algorithms for RNA secondary prediction with pseudoknots, Discrete Applied Mathematics, 104, pp. 45-62, 2000.
  7. Lyngsų, R.B. and Pedersen, C.N.S. RNA Pseudoknot Prediction in Energy Based Models, Journal of Computational Biology, 7(3/4), pp. 409-428, 2000.
  8. Rivas, E. and Eddy, S. A dynamic programming algorithm for RNA structure prediction including pseudoknots, Journal of Molecular Biology, 285(5), pp. 2053-2068, 1999.
  9. Uemura, Y., Hasegawa, A., Kobayashi, Y., and Yokomori, T. Tree adjoining grammars for RNA structure prediction, Theoretical Computer Science, 210, pp. 277-303, 1999.
  10. Sankoff, D. Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems, SIAM Journal on Applied Mathematics, 45(5), pp. 810-825, 1985.
  11. Gorodkin, J., Heyer, L.J., and Stormo, G.D. Finding the most significant common sequence and structure motifs in a set of RNA sequences, Nucleic Acids Research, 25(18), pp. 3724-3732, 1997.