- Back to Home »
- Efficient alignment of RNA secondary structures using sparse dynamic programming
Posted by : tes
Kamis, 03 Oktober 2013
Background
Non-coding RNAs (ncRNAs) have recently been recognized as important regulators of the biological systems [1, 2]. They participate in the control of alternative splicing [3], gene transcription [4] and translation [5], and mRNA localization [6]. Most of the ncRNAs exert their biological functions by folding into specific structures, which makes the study of the RNA structurome a critical step towards complete understanding of the operational mechanism of the biological system [7]. Recently, genome-wide RNA structurome analysis has led to many interesting discoveries regarding novel regulatory mechanisms. For example, analysis of the RNA structural elements in Drosophila melanogaster 3?-UTR suggests a cluster of ncRNA elements that can direct the localization of their upstream genes within the spermatids [8]. Similar studies have also been applied to the Ciona intestinalis genome for novel ncRNA family discovery [9]. With the finishing of the ENCODE [10] and modENCODE [11] projects, we expect that much more RNA transcripts will be experimentally identified. Many of these RNA transcripts may have exceptionally large sizes [12], and calls for more efficient computational tools to analyze their structures.
As more RNA transcripts are being discovered, the experimental approaches for probing ncRNA structures are also being revolutionized, allowing more accurate functional investigation through exploiting the structure-function relationship. Traditional RNA three-dimensional (3D) structure determination techniques such as X-ray crystallography, NMR and cryo-EM are expensive, making them inappropriate for genome-wide survey of RNA structures. Currently, the emerging massive parallel sequencing technology has been incorporated into the traditional chemical probing methods, making genome-wide experimental determination of RNA secondary structures possible and with low cost. Available techniques in this category include PARS [13], FragSeq [14], and SHAPE-seq [15]. The RNA secondary structures determined by these techniques are much more accurate than those predicted by pure computational methods. For example, when coupled with SHAPE-seq data, the free energy minimization approach [16] is able to predict the secondary structure of a 16S rRNA with over 95% accuracy [17]. In this case, the major purpose of this work is to develop an efficient and accurate RNA secondary structure alignment algorithm to facilitate genome-wide comparative studies of these RNA secondary structures.
There are many existing algorithms that focus on the RNA secondary structure alignment problem [18, 19, 20, 21, 22, 23, 24]. RNA secondary structures can be represented as tree structures, and the edit-distance between the tree structures can be used to represent their structural similarity [19]. Algorithms using such strategy are usually called tree editingalgorithms. Using heavy path decomposition, Klein [25] improved the time complexity of the tree editing algorithm to O (l 3logl ). Recently, Demaine et al. [26] further improved the time complexity to O (l 3 ) based on Klein?s algorithm. However, Jiang et al. [20] proposed to compute tree alignment distance for the comparison of trees. Algorithms that compute such a measure are called tree alignment algorithms. The tree alignment algorithm is a special case of the tree editing algorithm [27]. The tree alignment algorithm has been implemented into an RNA secondary structure alignment tool called RNAforester[21]. Both the tree editing and tree alignment algorithms rely on tree representation of the RNA structure, and make sophisticated scoring functions difficult to implement (such as the affine gap penalty for the loop regions). In addition, both tree editing and tree alignment algorithms do not treat base pairs as units of comparison, and make it difficult to implement a complete set of base-pair edit operations for RNA secondary structures editing (base-pair match, mismatch, breaking, altering, and removing; as defined by Jiang et al. [24]). We demonstrate such a problem by showing a real example from the implementation of the widely-used RNA secondary structure alignment tool RNAforester[21].
Consider that the two RNA structures shown in Figure?1 (a) are being aligned as trees. In the first RNA structure, due to the insertion of a uracil (U), an additional base pair is predicted (dashed arc, Row 1). Both structures are enclosed by G-C base pairs, and we focus on the alignment of their inner regions (boxed regions, Row 1). Following RNAforester?s extended tree representation [21], the two RNA structures can be transformed into two trees (Row 2). The ?P? node represents a base pair formed between the two corresponding nucleotides. Because there is no base pair in the second structure, the only allowed operations are bond breaking and base-pair deletion (Row 3). For the bond breaking operation, the base pair formed between A and U is broken, leaving them aligned to A and G in the second structure, respectively (blue boxes, Row 3). The alignment between the U (first structure) and G (second structure) introduces an unnecessary mismatch, making the alignment incorrect (blue boxes, Row 4). For the base-pair deletion operation, the entire base pair (including the two nucleotides A and U) is deleted (red box, Row 3). This operation opens two unnecessary gaps in the alignment (red boxes, Row 4), making it underestimate the real structural similarity. On the other hand, we expect to handle the mis-predicted base pairs in a more straightforward way. As shown in Figure?1 (b), we simply break the base pair interaction and disassociate the two corresponding nucleotides completely (red cross, Row 2). These two nucleotides are then treated as regular unpaired nucleotides. We can use the standard sequence alignment algorithm [28] (with affine gap penalty for better alignment quality in the unpaired regions) to evaluate the pure sequence similarity between the boxed hairpin-loop regions (Row 3). The resulting alignment contains only one gap, and correctly interprets the true structural difference between the two RNA structures (red boxes, Row 4)