doi:10.1016/j.jmb.2004.04.058

J. Mol. Biol. (2004) 340, 385–395

3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments
Orla O’Sullivan1, Karsten Suhre2, Chantal Abergel2 ´ Desmond G. Higgins1 and Cedric Notredame2,3*
1 Conway Institute, University College Dublin, Belﬁeld Dublin 4, Ireland

´ Information Genomique et Structurale, CNRS UPR-2589 31, Chemin Joseph Aiguier 13402 Marseille, France Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges Switzerland
3

2

Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment.
q 2004 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: multiple alignment; structural superposition; TCoffee; threading; sap

Introduction
It has long been assumed that using structural information can increase the accuracy of multiple protein sequence alignments (MSA).1 Recent results2,3 suggest that accurate MSAs obtained this way are useful for making functional assignments. These ﬁndings are quite exciting in a context where a structure may soon be available for each protein family (transmembrane proteins excepted).4 However, making the best out of this wealth of data will require the development of new automatic methods, able to efﬁciently incorporate protein structure information within
Abbreviations used: MSA, multiple protein sequence alignment(s); S-MSA, structure-based MSA; DP, dynamic programming; NW, Needlman & Wunsch; CS, column score. E-mail address of the corresponding author: cedric.notredame@gmail.com

MSAs. The incentive for doing so is very strong, considering the critical role MSAs play in so many sequence analysis applications,5 like phylogenetic reconstruction, structure prediction, functional characterization, database searches and nonsynonymous single nucleotide polymorphism characterization.6 Despite their usefulness, accurate MSAs remain difﬁcult to compute, owing to reasons that are both computational7 and biological.8 From a computational point of view, the assembly of an optimal MSA is a complex problem and an exact solution can be computed only for small sets of related sequences.9 This is the reason why most packages use an approximate heuristic, the progressive alignment algorithm,10 that gives no guarantee on delivering an optimal solution but can rapidly align large sets of sequences. On the biological side, one is limited by the lack of an objective and accurate criterion to assess MSA quality.8 As a consequence, most methods use

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

386

3DCoffee: Mixing Sequences and Structures

sequence similarity (assessed with a substitution matrix) as a criterion for optimization. However, similarity is not informative enough to drive the correct alignment of distantly related sequences, a situation that typically requires using structure comparison methods so that a structure-based MSA (S-MSA) can be derived. S-MSAs constitute the de facto standard of truth for assessing sequence alignment accuracy and several established S-MSAs collections11 – 13 are used routinely to evaluate MSA packages.14 – 17 Although one may argue that these highly accurate MSAs (as judged from structural analysis) are not always optimal from an evolutionary point of view, they usually reﬂect well the structural and functional relationships between the considered proteins. With 3DCoffee, we show that using a small amount of structural information when assembling an MSA makes it possible to improve alignment accuracy and emulate the computation of an S-MSA. Combining sequences and structures in this manner requires the integration of three types of methods: (i) sequence alignment methods; (ii) methods for comparing two or more structures and deduce a sequence alignment; (iii) methods for comparing sequences and structures, often referred to as threading. Sequence – sequence comparison methods rely mostly on the dynamic programming (DP) algorithm to compute an alignment where gaps are disposed in such a manner that similarity is maximized between the two sequences.18,19 Given a substitution matrix and a gap penalty scheme, DP can be used to compute global or local alignments20,21 but accurate alignments can be obtained only with pairs of sequences that are at least 30% identical.22 Structure–structure comparison has been approached using a wide variety of heuristics,23,24 and to this day more than 30 algorithms have been reported. The simplest, like LSQman,25 use rigid body superposition and let the algorithm look for an optimal superposition where intermolecular distances are minimized between superposed positions in the two structures. These methods perform well on similar structures where the 3D relationships of residues have been well preserved by evolution. These structures are usually encoded by closely related sequences. When dealing with more distantly related sequences, the residue equivalences can be worked out iteratively, as done in STAMP,26 where the equivalences are used to drive a superposition that is used, in turn, to compute a distance matrix. The algorithm uses this updated matrix to reﬁne the set of residue equivalences and make a new superposition. The process is carried out until it converges. SAP27 uses a similar principal, although rather than being iterative, the algorithm computes the series of rigid superpositions associated with forcing the superposition of every possible pair of residues. The ﬁnal alignment is computed by DP, using the summed distance matrices of all the superpositions considered. DALI produces align-

ments of comparable accuracy, computed by considering the local comparison of the distance maps associated with the considered structures.28 Most of these methods make it possible to use structures for aligning sequences that are less than 30% identical. Although they diverge slightly in the alignment they produce, it is hard to establish which one (if any) performs better than the others. Sequence – structure comparisons (or threading) can be achieved using two categories of methods.29,30 One may use techniques inspired from molecular replacement to check whether a sequence is compatible with a 3D fold,31 or sophisticated DP where the algorithm analyses the 3D-structure to determine local gap penalties and local substitution costs. Fugue is based on this principle and turns a structure into a positionspeciﬁc substitution matrix, so that a sequence – structure alignment can be delivered using DP.32 Many of the structure-based alignment methods have been extended to generate S-MSAs. For instance, the double DP strategy of SAP has been coupled with a progressive algorithm to align more than two structures.33 At least two other pairwise structural alignment methods have been incorporated in a progressive alignment strategy: STAMP and COMPARER. COMPARER34 was used to assemble HOMSTRAD, the collection of multiple structural alignments used in this work for validation purposes. Other multiple structural alignment methods exist that use more speciﬁc procedures. For instance, DALI produces S-MSAs by aligning several structures to a master structure. One may use Fugue in a similar fashion by aligning several sequences to a single structural template. MNYFIT computes a consensus structure and uses it as a master to align all the others.35 The lack of method-independent reference datasets makes it difﬁcult to benchmark these packages accurately and establish their respective strength and weaknesses. Yet they all share a common drawback: they are all built around a speciﬁc pairwise alignment algorithm, making it difﬁcult to combine the respective strengths of several algorithms into a single model. Furthermore, none of the available methods can seamlessly handle a mixture of sequences and structures, and when doing so, the most common strategy is to start aligning the structures into an S-MSA, before adding the sequences in a semi-manual fashion.2 We designed 3DCoffee to address this problem. 3DCoffee uses the TCoffee v2.00 MSA package. TCoffee computes MSAs using pre-compiled libraries of pairwise alignments. The libraries can be compiled using any method able to generate pairwise alignments, like threading and structure superposition. This makes the library a powerful means to incorporate structural information into the MSA assembly process. Using methods like SAP or Fugue, we studied the effect of compiling the library with a mixture of sequences and structures. Our methodology could easily be extended to incorporate methods that have not yet been

3DCoffee: Mixing Sequences and Structures

387 superposition.27 When integrating these alignments within the primary library, we set to 100 the weight associated with each pair of aligned residues. This is the maximum weight an individual constraint can receive in a TCoffee primary library. Although this value is meant to reﬂect the high reliability of SAP, it only makes it more likely for these pairs to be aligned in the ﬁnal MSA without explicitly forcing them to be so. Not forcing every pair of the structural alignments to ﬁnd their way into the ﬁnal alignment is important, as some portions of the SAP alignments correspond to non-superposable portions of the structures and are therefore unreliable. These segments usually have a low consistency within the primary library, and are therefore down-weighted at the extension stage. LSQman is a rigid body structure superposition package that makes structure-based sequence alignments.40 When turning an LSQman structure superposition into a sequence alignment, we considered two residues to be aligned if they were ˚ less than 3 A apart in the superposition. LSQman constraint weights are set to 100, like those of SAP and for similar reasons. Producing multiple sequence structure alignments We adapted TCoffee so that, given a collection of sequences and structures, one may specify which structures must be used and which methods should be applied on each possible pair. For instance, given a peptide ﬁle, 3DCoffee considers in turn every possible sequence pair within the dataset. For a given pair, the program computes a global alignment using NW and a series of local alignments using Lalign. If both sequences have an available structure, a pairwise alignment is computed using SAP and another one using LSQman. If one sequence only has a known structure, an alignment is made using the threading method Fugue. All these alignments are added to the TCoffee library using the standard procedure described above. Benchmarking procedure We used the February 2002 release of HOMSTRAD11 (e) to design a benchmark strategy for 3DCoffee. HOMSTRAD is a hand-curated database of high-quality S-MSAs built around the multiple structure alignment package COMPARER. We selected within HOMSTRAD the most demanding alignments using two criteria: at least four sequences and less than 25% average identity within the MSA. This yields a collection of 43 MSAs, four of which had to be discarded (FADOxidase_C, FAD-Oxidase_NC, TPR and bv) because they are impossible to align with any of the available methods and are therefore uninformative for the analysis. The 39 remaining MSAs (245 sequences) constitute our HOM39

considered so that biologists can integrate and combine their techniques of choice.

Principle of the 3DCoffee method
Computation of TCoffee multiple sequence alignments We used TCoffee version 2.00 to compute nonstructure-based MSAs (default mode), as well as S-MSAs. In its default mode, TCoffee does not use structures, it takes sequences as input and makes pairwise comparisons to compile a primary library. This primary library is a list of weighted pairs of residues.36 A residue pair appears in the library when it has been observed in one of the precompiled pairwise alignments. The pairwise alignments compiled in the primary library can be computed using any method one ﬁnds suitable. By default, TCoffee computes for each pair of sequence a global pairwise alignment obtained with the Needlman & Wunsch (NW)18 algorithm and the ten best-scoring local alignments as given by the SIM algorithm.37 The weight associated with every residue pair obtained this way is set to the average percentage identity within the primary alignment (local or global). When two alignments contribute the same pair of aligned residues, the weights are added. The weights within the primary library are then re-estimated according to the library selfconsistency,36 and the re-weighted library (named an extended library) is used as a position-speciﬁc substitution matrix to carry out a progressive multiple alignment.38 Doing so involves computing a distance matrix by comparing every pair of sequences and using this matrix to compute a neighbor-joining guide tree.39 The tree topology determines the order in which the sequences are incorporated within the MSA, using standard DP and the extended library as a position-speciﬁc substitution matrix. Incorporation of structural information within the TCoffee library Structural information is incorporated within the library by the means of structure-based pairwise sequence alignments. We used three methods, now fully integrated within TCoffee, providing the associated package is installed. Fugue is a threading method that aligns a protein sequence with a 3D-structure.32 3DCoffee directly submits sequence/structure pairs to the ofﬁcial Fugue server† and retrieves the resulting pairwise alignments that are integrated into the primary library using the standard TCoffee weighting scheme. SAP uses double DP to compute a pairwise alignment based on a non-rigid structure
† http://www.cryst.bioc.cam.ac.uk/~fugue/

388

3DCoffee: Mixing Sequences and Structures

dataset. It has the advantage of being both compact and discriminative. We assessed the biological quality of our MSAs by comparing them with their HOM39 reference MSA, using the aln_compare package36 that computes the column score (CS), which is a measure of the fraction of columns aligned identically between two MSAs.41 We checked whether sequences without a known structure could beneﬁt from being aligned with sequences whose structure is known. We named this measure the induced improvement, and measured it by removing the provided structure(s) from the reference and the target MSAs before comparing them. System and packages Academic licences (free of charge) to run TCoffee 2.00, SAP and LSQman were obtained for each package. These were installed on an SGI 02, running Irix 6.2. The protocols used here are now part of the TCoffee documentation.

Results
Improving MSA accuracy with a single structure Single structures can be incorporated into an MSA only by using a threading method like Fugue. Before doing so, we evaluated the accuracy of Fugue as a pairwise method on the entire HOM39 dataset. Figure 1(a) shows a comparison between Fugue and TCoffee (TCoffee uses SIM and NW by default) where the relative performances of the two methods are assessed by comparison with the HOM39 reference. Fugue clearly outperforms TCoffee when making pairwise alignments. For instance, when comparing Fugue and TCoffee on all pairs of sequences from HOM39 (Figure 1(a)), we found Fugue to be three percentage points more accurate than TCoffee (61.8% accuracy for Fugue against 58.8% for TCoffee). The difference is signiﬁcant with a P-value of 1029 (Wilcoxon signed rank test). We then computed each HOM39 MSA while providing TCoffee with one structure via the -struc_to_use ﬂag. In each test case, we chose the most distantly related sequence (as judged with the average percentage identity in the HOM39 reference). The extent of identity between the selected structures and the rest of their MSA ranged between 12% and 24%. A new ﬂavor of TCoffee (TC-Fugue) was designed, that uses three pairwise alignment methods: SIM, NW, and Fugue (Table 1A). We also used TCoffee associated with the Fugue method only (Fugue) as a control. This last procedure amounts to aligning the sequences one after the other onto the sequence with known structure, using the Fugue algorithm. Two other controls were set up using TCoffee in default mode and CLUSTAL W version 1.83 (CW183).

Figure 1. Performances of pairwise structure-based sequence alignment methods. Each dot corresponds to a parwise alignment taken from HOM39 (see method). The vertical axis represents the difference of alignment accuracy (Column Score) between TCoffee and (a) Fugue, (b) SAP and (c) LSQman. The horizontal axis shows the percent identity between the two sequences being considered, as measured on the reference HOM39 MSA.

3DCoffee: Mixing Sequences and Structures

389

Table 1. Direct (A) and induced (B) improvement when providing one structure of the HOM39 datasets
Method N str. 0 0 1 1 0 0 1 1 Avg. acc. 42.24 38.43 31.26 46.33 52.83 45.75 35.53 54.73 Difference with TCoffee – 23.8 210.9 1 4.1 – 27.1 217.3 1 1.9 P-value (Wilcoxon signed-rank test) – 2 £ 1022 2 £ 1024 1 3 1023 – 1 £ 1023 3 £ 1024 4 3 1021

A. Direct improvement TCoffee CW-183 Fugue TC-Fugue B. Induced improvement TCoffee CW-183 Fugue TC-Fugue

Method indicates the method being used: TCoffee (TCoffee with NW and SIM), CW-183 (CLUSTAL W, 1.83), TC-Fugue (TCoffee with NW, SIM and Fugue), Fugue (TCoffee þ Fugue, without NW or SIM). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical signiﬁcance of the observed difference between the considered method and the default TCoffee. The best performing method is in bold.

Our results (Table 1A) show that providing a structure to TC-Fugue improves MSAs by four percentage points over TCoffee (or by a litle less than eight percentage points over CLUSTAL W). The difference is signiﬁcant with a P-value of 1023, and an observed improvement on 23 of the 31 alignments that are not tied between the two methods. We found (Figure 2(a)) that the amount of reported improvement depends loosely on the structure/sequence ratio, with high ratios yielding greater improvements. The low performances of the Fugue control are probably explained by the stringency of the CS measure that requires every sequence to be aligned correctly and is not well adapted to the pairwise-based alignment method used here. We measured the induced improvement in the TC-Fugue alignments by removing the provided structure and found the average TC-Fugue accuracy to remain higher than that of TCoffee (Table 1B and Figure 2(b)), although in this case the difference is not statistically signiﬁcant, as the observed difference is associated with a P-value of only 0.4. Note that the values in Table 1B are higher than the corresponding values in Table 1A because in Table 1B the evaluation is carried out while ignoring the provided structure (usually the hardest sequence to align). Improving MSA accuracy with two structures Using two structures offers the possibility of making structure– structure (SAP, LSQman) as well as structure –sequence comparisons. Before using these methods to compute an MSA, we evaluated their pairwise accuracy (Figure 1(b) and (c)). As expected, we found SAP and LSQman to outperform TCoffee signiﬁcantly. A measure made on the SAP alignments of every HOM39 pair of sequence (Figure 1(b)) indicates an average accuracy of 86.3%. The difference with TCoffee is highly signiﬁcant with a P-value of 10211 (Wilcoxon signed rank test). Under the same conditions, LSQman outperforms TCoffee by 12 points with

an average accuracy of 70.3%, and a difference also highly signiﬁcant. We computed every HOM39 MSA while providing TCoffee with two structures: the one used previously with TC-Fugue and its most distantly related homologue (lowest percentage identity) within the considered HOM39 MSA. An attempt to use the most informative pairs guided this choice. In order to judge the individual contribution of each of the three structure-based methods (Fugue, SAP and LSQman) to the overall accuracy of 3DCoffee, we used them separately, each time in conjunction with SIM and NW (Table 2A). These three new ﬂavors of TCoffee are named TC-Fugue, TC-SAP and TC-LSQ, and the combination of all the available pairwise methods (Fugue, SAP, LSQman, SIM and NW) constitutes the new 3DCoffee method (TC-3D in the Tables). As expected, TC-Fugue, TC-SAP and TC-LSQ all outperform TCoffee (Table 2A). Furthermore, TC-3D outperforms every alternative ﬂavor and, given two structures, it produces MSAs on average ten percentage points better than TCoffee and 4.5 percentage points better than TC-Fugue (Table 2A). As indicated in Table 2A, all the differences reported between the new methods and TCoffee are statistically signiﬁcant. Here as well, the extent of the improvement depends on the structure/ sequence ratio (Figure 3(a)). Similar trends were observed when measuring the induced improvement (Figure 3(b)), which amounts to slightly less than 3.5 percentage points when comparing TC-3D with TCoffee (Table 2B). Although limited in amplitude, this improvement is also statistically signiﬁcant. Improving MSAs accuracy with many structures We examined the effect of varying the structure/ sequence ratio for every HOM39 MSA. We did so by applying TC-3D on each HOM39 dataset, using structural sets that contained between one and N structures (N being the total number of sequences).

390

3DCoffee: Mixing Sequences and Structures

Figure 2. Comparative performances of TC-Fugue and TCoffee when using one structure. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39. The vertical axis indicates the difference of accuracy between a TC-Fugue and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (1 structure) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to Figure 2(a), the MSA Accuracy is measured while ignoring the contribution of the provided structure.

The structural sets were assembled in an incremental manner. Given an MSA, one starts with the most distantly related structure (as shown above) before adding the structure of the less similar remaining sequences one by one, until N structural sets are deﬁned for each HOM39 MSA. We then realigned every HOM39 MSA with each of its

associated structural sets and compared the resulting alignments with the HOM39 reference. This makes a total of 200 MSA (between four and 15 for each HOM39 protein family) that were used to compute the data presented in Figure 4(a) and 161 for Figure 4(b). The results are presented in the form of a

3DCoffee: Mixing Sequences and Structures

391

Table 2. Direct (A) and Induced (B) improvement when providing two structure of the HOM39 datasets
Method N Str. 0 0 2 2 2 2 0 0 2 2 2 2 Avg. acc. 42.24 38.43 46.39 50.81 47.26 52.52 56.12 50.22 58.07 58.49 57.52 59.55 Difference with TCoffee 0.0 23.8 þ4.0 þ8.5 þ5.0 1 10.3 0.0 25.9 þ1.9 þ2.4 þ1.4 1 3.4 P-value Wilcoxon signed-rank test 1.0 2 £ 1022 5 £ 1023 6 £ 1026 2 £ 1023 1 3 1025 1.0 1 £ 1021 2 £ 1021 2 £ 1021 4 £ 1021 2 3 1022

A. Direct improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D B. Induced improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D

Direct improvement is measured on the complete alignment, including the used structures. The induced improvement is measured only on the sequences whose structures were not used. Method indicates the method being used: TCoffee (TCoffee with SIM and NW), TCW-183 (CLUSTAL W version 1.83) TC-Fugue (TCoffee þ NW þ SIM þ Fugue), TC-SAP (TCoffee þ SIM þ NW þ SAP), TC-LSQ (TCoffee þ SIM þ NW þ LSQman), TC-3D (TCoffee þ SIM þ NW þ Fugue þ SAP þ LSQman). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical signiﬁcance of the difference between the considered method and TCoffee default using the Wilcoxon signed-rank test. The best performing method is in bold.

boxplot in Figure 4(a) (direct improvement) and Figure 4(b) (induced improvement). Figure 4(a) indicates the existence of a reasonable correlation between the structure/sequence ratio and the MSA accuracy, although the data are not distributed evenly. One gains roughly ten percentage points in accuracy with every 20 percentage points increase of the structure/sequence ratio. An individual analysis of each protein family suggests that this trend is consistent across most of the HOM39 dataset, although the phenomenon varies in amplitude. When using 3DCoffee and all the available structures in a procedure that amounts to assembling a multiple structural alignment, we obtained a score of 71.9% accuracy, a value short of the theoretical maximum of 100 that might have been expected if the unreliable regions of HOM39 had been removed from the evaluation. This value is an estimate of the correlation between the two-structure superposition method SAP and COMPARER rather than an estimate of accuracy. The induced improvement follows a similar trend, albeit more modestly (Figure 4(b)), and yields a gain of roughly two percentage points for every 20 percentage points of ratio increase. The distribution of this induced improvement is even less regular than that of the direct improvement. It indicates that in the HOM39 dataset, sequences beneﬁt only modestly from the incorporation of the 3D information associated with one of their remote homologue.

Conclusion
3DCoffee is a novel method that takes advantage of structural information for aligning sequences. We benchmarked 3DCoffee using HOM39, a collection of high-quality reference S-MSAs. We used

the TCoffee package to mix sequences, structures and structure/sequence alignment methods, and found this new protocol to improve MSA accuracy in a manner that depends on the structure/ sequence ratio within the considered dataset. Our results suggest that using structures can improve the alignment accuracy of sequences without a known structure. The 3DCoffee protocol bears several advantages. It is relatively fast: given all the pairwise alignments, it takes a few seconds to align ten sequences 200 residues long on a standard workstation. It is also very ﬂexible and could easily be adapted to include any structure analysis method able to deliver a sequence alignment. We show here that one can effectively use this protocol to combine the output of methods based on different principles, like a rigid structure superposition method (LSQman) and a non-rigid one (SAP). This makes 3DCoffee a versatile tool that could easily be used in MSAs computation the way meta-methods are used in structure prediction.42 Yet, this study lends itself to a more paradoxical conclusion. Although structural information clearly helps improve MSA accuracy, it is surprising to ﬁnd that its usage lacks the dramatic effect one may have expected. For instance, using one structure on a dataset of distantly related sequences increases the average accuracy by only an average four percentage points (and a maximum of ten). One may have hoped that the ﬁrst or the ﬁrst two structures would have delivered a larger share of the potential improvement. Yet this does not happen and every extra structure has about the same effect as the others on the overall accuracy, thus yielding a quasi-linear correlation between the structure/sequence ratio and the overall MSA accuracy. This ﬁnding suggests that the standard methods

392

3DCoffee: Mixing Sequences and Structures

Figure 3. Comparative performances of TC-3D and TCoffee when using two structures. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39 (see method). The vertical axis indicates the difference in accuracy between a TC-3D and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (2 structures) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to (a) with the MSA accuracy computed on the sequences without known structure.

we used here are not yet able to let the structural information diffuse optimally onto distantly related sequences. As a consequence, the best way to obtain a highly accurate MSA of remote homologues is to use more than one structure and, if possible, one structure for each sequence (or group of closely related sequences). On the

basis of these results one may argue that given current methods, the “one structure for every protein family” strategy43 may prove short of solving all the alignments problems faced by homology modeling. Achieving this purpose will require either better sequence comparison methods or more structures.

3DCoffee: Mixing Sequences and Structures

393

Figure 4. Alignment accuracy and structure/sequence ratio. (a) Each box indicates the average accuracy difference between TC-3D and TCoffee when computing HOM39 MSAs with various structure/sequence ratios: [0 – 20] (6 values), [21 – 40] (27 values), [41 –60] (44 values), [61 – 80] (44 values), [81 – 100] (20 values). The vertical axis shows the average difference of accuracy and the horizontal axis the average structure/ sequence ratio. The boxplot was generated with the R package using standard settings. Each box stretches from its lower hinge (deﬁned as the 25th percentile) to its upper hinge (the 75th percentile). The median is shown as a line across the box. The top and the bottom whisker indicate the smallest data value larger then lower inner fence. The lower inner fence (not drawn) is equal to 1.5p spread to the 25th percentile. Values below the lower inner fence are plotted as a dot. The upper whisker is plotted in a similar fashion while using the 50th percentile as reference. (b) Induced improvement. Identical to 3b, with the measure of accuracy made on the sequences without known structure only.

Acknowledgements
Orla O’Sullivan was paid from Enterprise Ireland and Hewlett Packard provided some support. We thank Willie Taylor for helping us with

setting up SAP, and Kenji Miziguchi for helping with Fugue. The comments of both referees were very helpful in improving the manuscript. We thank Jean-Michel Claverie for his many suggestions in improving and clarifying this manuscript.

394

3DCoffee: Mixing Sequences and Structures

References
1. Lesk, A. M. & Chothia, C. (1980). How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136, 225– 270. 2. Al-Lazikani, B., Sheinerman, F. B. & Honig, B. (2001). Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases. Proc. Natl Acad. Sci. USA, 98, 14796– 14801. 3. Marchler-Bauer, A., Panchenko, A. R., Ariel, N. & Bryant, S. H. (2002). Comparison of sequence and structure alignments for protein domains. Proteins: Struct. Funct. Genet. 48, 439– 446. 4. Brenner, S. E. (2001). A tour of structural genomics. Nature Rev. Genet. 2, 801–809. 5. Duret, L. & Abdeddaim, S. (2000). Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics, Sequence, Structure and Databanks (Higgins, D. & Taylor, W., eds), pp. 135– 147 Oxford University Press, Oxford. 6. Ng, P. C. & Henikoff, S. (2002). Accounting for human polymorphisms predicted to affect protein function. Genome Res. 12, 436– 446. 7. Wang, L. & Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337– 348. 8. Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J. C. & Poch, O. (2001). Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 314, 937– 951. 9. Lipman, D. J., Altschul, S. F. & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412– 4415. 10. Hogeweg, P. & Hesper, B. (1984). The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175– 186. 11. Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469– 2471. 12. Raghava, G. P., Searle, S. M., Audley, P. C., Barber, J. D. & Barton, G. J. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. 13. Thompson, J. D., Plewniak, F. & Poch, O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87 – 88. 14. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl. Acids Res. 30, 3059– 3066. 15. Lassmann, T. & Sonnhammer, E. L. (2002). Quality assessment of multiple alignment programs. FEBS Letters, 529, 126– 130. 16. Lee, C., Grasso, C. & Sharlow, M. F. (2002). Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452– 464. 17. Thompson, J. D., Plewniak, F. & Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682 –2690. 18. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443– 453.

19. Pearson, W. R. & Miller, W. (1992). Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210, 575– 601. 20. Huang, X., Hardison, R. C. & Miller, W. (1990). A space-efﬁcient algorithm for local similarities. Comput. Appl. Biosci. 6, 373– 381. 21. Smith, T. F. & Waterman, M. S. (1981). Identiﬁcation of common molecular subsequences. J. Mol. Biol. 147, 195– 197. 22. Brenner, S. E., Chothia, C. & Hubbard, T. J. (1998). Assessing sequence comparison methods with reliable structurally identiﬁed distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073– 6078. 23. Eidhammer, I., Jonassen, I. & Taylor, W. R. (2000). Structure comparison and structure patterns. J. Comput. Biol. 7, 685– 716. 24. Sillitoe, I. & Orengo, C. (2002). Protein structure comparison. In Bioinformatics: genes, proteins and computers (Orengo, C., Jones, D. & Thornton, J., eds), pp. 250– 265, BIOS Scientiﬁc Publisher, Oxford. 25. Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog. sect. A, 34, 827– 828. 26. Russell, R. B. & Barton, G. J. (1992). Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue conﬁdence levels. Proteins: Struct. Funct. Genet. 14, 309– 323. 27. Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208, 1 – 22. 28. Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123– 138. 29. Jones, D. T., Orengo, C. A. & Thornton, J. M. (1996). Protein Folds and their Recognition From Sequence Protein Structure Prediction (Sternberg, M. J. E., ed.), 1st edit., vol. 170, pp. 173– 206, Oxford University Press, Oxford. 30. Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L. & Elofsson, A. (2001). A study of quality measures for protein threading models. BMC Bioinform. 2, 5. 31. Bryant, S. H. & Lawrence, C. E. (1993). An empirical energy function for threading protein sequence through the folding motif. Proteins: Struct. Funct. Genet. 16, 92 – 112. 32. Shi, J., Blundell, T. L. & Mizuguchi, K. (2001). FUGUE: sequence-structure homology recognition using environment-speciﬁc substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243– 257. 33. Taylor, W. R., Flores, T. P. & Orengo, C. A. (1994). Multiple protein structure alignment. Protein Sci. 3, 1858– 1870. 34. Sali, A. & Blundell, T. L. (1990). Deﬁnition of general topological equivalence in protein structures. J. Mol. Biol. 212, 403– 428. 35. Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Knowledge based modelling of homologous proteins. Part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng. 1, 377–384. 36. Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205– 217. 37. Huang, X. & Miller, W. (1991). A time-efﬁcient, linear-space local similarity algorithm. Advan. Appl. Math. 12, 337– 357. 38. Thompson, J., Higgins, D. & Gibson, T. (1994). CLUSTAL W.: improving the sensitivity of progressive

3DCoffee: Mixing Sequences and Structures

395

multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673– 4690. 39. Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406– 425. 40. Jones, T. A. & Kleywegt, G. J. (1999). CASP3 comparative modeling evaluation. Proteins: Struct. Funct. Genet. Suppl. 3, 30 – 46.

41. Karplus, K. & Hu, B. (2001). Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. 42. Bourne, P. E. (2003). CASP and CAFASP experiments and their ﬁndings. Methods Biochem. Anal. 44, 501 –507. 43. Vitkup, D., Melamud, E., Moult, J. & Sander, C. (2001). Completeness in structural genomics. Nature Struct. Biol. 8, 559– 566.

Edited by J. Thornton (Received 14 November 2003; received in revised form 20 April 2004; accepted 22 April 2004)

Nucleic Acids Research, 2004, Vol. 32, Web Server issue W37–W40 DOI: 10.1093/nar/gkh382

3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment
Olivier Poirot1, Karsten Suhre1, Chantal Abergel1, Eamonn O’Toole3 and Cedric Notredame1,2,*
1

´ Information Genomique et Structurale UPR2589-CNRS, CNRS, 31, Chemin Joseph Aiguier, 13 402 Marseille Cedex 20, France, 2Swiss Institute of Bioinformatics, Lausanne University, Chemin des Boversesses, 1066 Epalinges, Switzerland and 3hp High Performance Technical Computing Division, Hewlett Packard, BallyBrit, Galway, Ireland

Received February 14, 2004; Accepted March 16, 2004

ABSTRACT This paper presents 3DCoffee@igs, a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs). 3D-Coffee makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments. Structures can be either provided as PDB identifiers or directly uploaded into the server. Given a set of sequences and structures, pairs of structures are aligned with SAP while sequence–structure pairs are aligned with Fugue. The resulting collection of pairwise alignments is then combined into an MSA with the T-Coffee algorithm. The server and its documentation are available from http://igs-server.cnrsmrs.fr/Tcoffee/.

INTRODUCTION The assembly of an accurate multiple sequence alignment (MSA) is a key step in many sequence analysis procedures. One could cite in bulk: the identiﬁcation of a protein signature such as a Prosite pattern (1), the building of a domain proﬁle (or HMM) needed for identifying the most remote members of a protein family (2), structure prediction and homology modeling (3) and phylogenetic analysis (4). More recently, MSAs have also proven useful to characterize nsSNPs (non-synonymous Single Nucleotide Polymorphisms) (5,6). The success of such applications depends very much on the MSA quality, hence the importance of accuracy when computing an alignment. In practice, structurally correct alignments are considered to be a good starting point for most MSA applications (with maybe the exception of phylogenetic reconstruction), and established collections of reference structural alignments are widely used to benchmark and train existing MSA packages (7,8). However, when state-of-the-art packages are applied to sets of distantly related sequences,

they deliver alignments that are only partly correct from a structural point of view (8), thus suggesting that sequencebased alignment procedures can still be greatly improved. In the current situation, the best way to produce a high-quality MSA remains the assembly of a multiple structural alignment. Unfortunately, few examples exist where enough related structures are available to carry out such a task. An elegant alternative to the use of many structures is to mix sequences and structures, in the hope that the 3D information contained within the structures will help deliver a better alignment of the other sequences. Such a mix also constitutes a realistic solution, considering the increasing proportion of sequences without a known structure and the decreasing proportion of protein families not associated with at least one structure. However, the problem of combining sequences and structures has not yet been extensively addressed, and only a handful of methods are available that allow the seamless combining of sequences and structures (9) while appropriately using 3D information. Here we present 3DCoffee@igs, a web server especially designed to combine sequences and structures by seamlessly integrating in T-Coffee (10) the three types of alignment methods needed for this purpose: sequence–sequence, sequence–structure and structure–structure alignment methods. When using one or more structures, the alignments thus produced are more accurate than similar alignments based on sequence information alone, as judged by the comparison with reference structure-based alignments (O.O’Sullivan, K.Suhre, D.Higgins and C.Notredame, submitted for publication). The inclusion of a threading method (sequence– structure alignment) makes it possible to use as little as one structure. METHODS Standard T-Coffee sequence alignment assembly We use T-Coffee to mix sequences and structures. Given a set of sequences, the regular T-Coffee procedure involves the

*To whom correspondence should be addressed. Tel: +33 491 164 606; Fax: +33 491 164 549; Email: cedric.notredame@gmail.com The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. ª 2004, the authors

Nucleic Acids Research, Vol. 32, Web Server issue ª Oxford University Press 2004; all rights reserved

W38

Nucleic Acids Research, 2004, Vol. 32, Web Server issue O(N3L2), N being the number of sequences and L their average length. However, in 3D-Coffee, SAP is the limiting factor, with a time complexity in the order of O(L3). USING THE TCOFFEE@IGS SERVER 3D-Coffee is a new service that is available through the previously presented Tcoffee@igs server (17). It is maintained by ´ IGS (Information Genomique et Structurale) and runs on a dedicated Alpha ES45 quadriprocessor server. It supports the analysis of a maximum number of 100 sequences with a maximum of 2000 residues each. The 3D-Coffee service is provided in two versions, a regular and an advanced version. The regular version requires limited input from the user while the advanced version offers more possibilities such as uploading personal PDB structures and controlling the methods used to compute the library. Tcoffee@igs server The homepage of the server (http://igs-server.cnrs-mrs.fr/ Tcoffee/) contains pointers to the four types of computation performed: (i) The Make a Multiple Alignment section opens to the standard computation of a T-Coffee MSA, using the default parameters of the program, as described in (10). (ii) The Evaluate a Multiple Alignment section provides an alignment evaluation using the CORE method as described in (17). (iii) The Combine Multiple Alignments section makes it possible to combine several alignments into one. The advanced section of each server offers extra control on the library computation (choice of the methods) as well as a larger number of output options. These servers have all been previously described in (16). (iv) The last section, Align Structures (3D-Coffee), is new and described in the next paragraph. Align structures and sequences with 3DCoffee::regular The 3DCoffee::regular server inputs a set of sequences in FASTA format. Among the sequences, those with a 3D structure must be named according to their PDB identiﬁer. If the PDB ﬁle contains several chains, the chain index (letter or number) must be added to the name (1pptA). If the sequence provided in the FASTA ﬁle is a subsequence of the indicated chain, T-Coffee aligns the provided sequence with its full PDB counterpart and makes sure that only the appropriate 3D information is used for alignment computation. This comparison also handles slight sequence discrepancies between the PDB and the user-provided sequence. In the regular mode of 3D-Coffee, the handling of the structures is entirely under T-Coffee control, which uses the FASTA information to gather the structures and chop them to the relevant portion. For users familiar with the stand-alone version of T-Coffee, we give the corresponding command line: t_coffee-in seq:fasta Msap_pair Mfugue_pair Mslow_pair Mlalign_id_pair sap_pair, fugue_pair, slow_pair and lalign_id_pair are pairwise methods used to compute the T-Coffee library. Once the

computation of a collection of pairwise alignments where for each possible pair of sequences in the dataset, the program computes the best global alignment and the 10 best local alignments [using the Sim algorithm from the Lalign package (11)]. This collection of pairwise alignments is named a library. The second step of the procedure involves the assembly of an MSA with a high level of consistency with the alignments contained in the library. Since T-Coffee uses a heuristic, the optimality of this process is not guaranteed, although the results are generally satisfactory as judged by comparison with alternative optimization methods (12). The assembly procedure is very similar to that described for ClustalW (13); extensive details are available in the original publication (10). 3D-Coffee protocol The 3D-Coffee protocol takes advantage of the methodindependent manner in which T-Coffee uses its libraries. Rather than ﬁlling the library with sequence-based pairwise alignments, 3D-Coffee compiles it using three types of pairwise methods: sequence–sequence, structure–structure and structure–sequence (threading) alignment procedures. From among the vast variety of structure comparison algorithms, we selected SAP (14) for the structure–structure alignments and Fugue (15) for the structure–sequence comparisons. A full validation of these choices is detailed in O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication. Our main criterion was the relatively high accuracy of these two methods and their ease of integration within the T-Coffee framework. It is nonetheless worth pointing out that any method with similar characteristics (i.e. able to deliver a sequence alignment) could easily be added to the procedure we describe here. In practice, given a sequence dataset, the program starts by identifying the sequences associated with a structure and those that are not. It then considers all the possible pairs and applies the appropriate methods to these pairs. For instance, given a pair of structures, the program will successively make a global pairwise alignment, a local pairwise sequence alignment and a structure-based sequence alignment with SAP. If only one of the two sequences has a known structure, Fugue will be used instead of SAP. The resulting pairwise alignments are compiled into a list of weighted pairs of aligned residues found in the individual alignments. Each pair receives a weight equal to the average level of identity within the pairwise alignment where it occurred. When two or more alignments contribute the same pair, their respective weights are added to yield the ﬁnal weight. The collection of weighted residue-pairs constitutes the T-Coffee library. T-Coffee uses the library to assemble a standard progressive alignment in a ClustalW-like manner. The program starts by computing the distance matrix of the sequences and uses it to estimate a guide tree. The guide tree controls the order in which the sequences are included one by one into the MSA. Each sequence is incorporated using the library in place of a substitution matrix. A recent modiﬁcation of the T-Coffee algorithm (to be described elsewhere) has made it possible to signiﬁcantly reduce the time complexity of the algorithm, down to O(N2L2) from the previously reported

Nucleic Acids Research, 2004, Vol. 32, Web Server issue

W39

Figure 1. Typical output of a standard 3D-Coffee computation. Five structures have been aligned with a sequence (Q53396). The display is the ESPript (18) output produced by the Tcoffee@igs server. The CORE index is displayed on the alignment and indicates the relative reliability of the various sections (color code: blue, unreliable; green, low reliability; red, highly reliable portion of the alignment). DSSP (19) is used to determine the secondary structures from the PDB coordinates. Blue, green and yellow portions are mostly incorrectly aligned, as judged by comparison with HOMSTRAD reference alignment (9).

computation is over, the server returns a page of links to the produced result ﬁles. An ESPript (18) post-processing step makes it possible to visualize the secondary structure elements within the used structures (Figure 1). The returned alignment is a sequence alignment, albeit generally improved by the use of structural information. Systematic benchmarking, carried out on a subset of HOMSTRAD (O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication), indicates that the accuracy of mixed sequences/structure alignments increases proportionally to the amount of structural information provided. The 3DCoffee::advanced server The advanced server makes it possible to upload user-deﬁned PDB structures (up to three). The sequences of the uploaded structures should not be included within the FASTA sequences. The limitation to three private structures is arbitrary and will be increased upon request. In case the

ﬁle contains more than one chain, the program extracts only the ﬁrst one. It is the user’s responsibility to provide the correct chain. The advanced server also makes it possible to control the computation of the T-Coffee library by selecting the methods one wishes to include. For instance, if all the sequences have a known 3D structure, it is advisable to use only sap_pair, the structure–structure alignment method, to generate a structurebased MSAs.

CONCLUSION In this paper, we present 3D-Coffee, a major extension of the Tcoffee@igs server. This new feature of the server makes it possible to combine sequences and structures within an MSA, thus producing high-quality MSAs. The method we present here is versatile and easy to use. It affords the possibility of seamlessly combining structure and

W40

Nucleic Acids Research, 2004, Vol. 32, Web Server issue

sequence information, private and public data, without the need to install additional programs such as SAP and Fugue locally. It certainly constitutes an adequate means to efﬁciently use available structural data. Future plans will involve the addition of new modules, rendering easier the mapping of structural information on to sequence data. We strongly encourage users to send us their feedback.

REFERENCES
1. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. 2. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. 3. Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202. 4. Phillips,A., Janies,D. and Wheeler,W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. 5. Ng,P.C. and Henikoff,S. (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Res., 12, 436–446. 6. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88. 8. O’Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19, I215–I221.

9. de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. (2001) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17, 748–749. 10. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 11. Huang,X. and Miller,W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. 12. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 13. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. 14. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. 15. Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) FUGUE: sequence– structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243–257. 16. Poirot,O., O’Toole,E. and Notredame,C. (2003) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res., 31, 3503–3506. 17. Notredame,C. and Abergel,C. (2003) Using multiple sequence alignments to assess the quality of genomic data. In Andrade,M. (ed.), Bioinformatics and Genomes: Current Perspectives. Horizon Scientific Press, Norfolk, UK, pp. 30–50. 18. Gouet,P., Robert,X. and Courcelle,E. (2003) ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res., 31, 3320–3323. 19. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577–2637.

BIOINFORMATICS

Vol. 19 no. 1 2003, pages i1–i7 DOI: 10.1093/bioinformatics/btg1029

APDB: a novel measure for benchmarking sequence alignment methods without reference alignments
Orla O’Sullivan 1, Mark Zehnder 3, Des Higgins 1, Philipp Bucher 3, ´ Aurelien Grosdidier 3 and Cedric Notredame 2, 3,∗
1 Department

of Biochemistry, University College, Cork, Ireland, 2 Information ´ ´ Genetique et Structurale, CNRS UMR-1889, 31, Chemin Joseph Aiguier, 13402 Marseille, France and 3 Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges, Switzerland

Received on January 6, 2000; revised on Month xx, 2000; accepted on February 20, 2000

Author please check use of A and B heads is correct

ABSTRACT Motivation: We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efﬁciently and objectively benchmark multiple sequence alignment methods. Results: Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages. Availability: APDB is implemented in C, its source code and its documentation are available for free on request from the authors. Contact: cedric.notredame@gmail.com

INTRODUCTION We introduce APDB (Analyze alignments with PDB), a new method for benchmarking and improving multiple sequence alignment packages with minimal human intervention. We show how it is possible to avoid the use of reference alignments when PDB structures are available for at least two homologous sequences in a test alignment. Using this method it should become possible to systematically benchmark or train multiple sequence alignment methods using all known structures, in a completely automatic manner. There are strong justiﬁcations for improving multiple sequence alignment methods. Many sequence analysis
∗ To whom correspondence should be addressed.

techniques used in bioinformatics require the assembly of a multiple sequence alignment at some point. These include phylogenetic tree reconstruction, detection of remote homologues through the use of proﬁles or HMMs, secondary and tertiary structure prediction and more recently the identiﬁcation of the nsSNPs (non synonymous Single Nucleotide Polymorphisms) that are most likely to alter a protein function. All of these important applications demonstrate the need to improve existing multiple sequence alignment methods and to establish their true limits and potential. Doing so is complicated, however, because most multiple sequence alignment methods rely on a complicated combination of greedy heuristic algorithms meant to optimize an objective function. This objective function is an attempt to quantify the biological quality of an alignment. Almost every multiple alignment package uses a different empirical objective function of unknown biological relevance. In practice, most of these algorithms are known to perform well on some protein families and less well on others, but it is difﬁcult to predict this in advance. It can also be very hard to establish the biological relevance of a multiple alignment of poorly characterized protein families. See Duret and Abdeddaim (2000) and Notredame (2002) for two recent reviews of the wide variety of techniques that have been used to make multiple alignments. Given such a wide variety of methods and such poor theoretical justiﬁcation for most of them, the main option for a rational comparison is systematic benchmarking. This is usually accomplished by comparing the alignments produced by various methods with ‘reference’ alignments of the same sequences assembled by specialists with the help of structural information. Barton and Sternberg (1987) made an early systematic attempt to validate a multiple sequence alignment method using structure based alignments of globins and immunoglobulins. Later on,
1

Bioinformatics 19(1) c Oxford University Press 2003; all rights reserved.

O.O’Sullivan et al.

Notredame and Higgins (1996) used another collection of such alignments assembled by Pascarella and Argos (1992). More recently, it has become common practice to use BAliBASE (Thompson et al., 1999); a collection of multiple sequence alignments assembled by specialists and designed to systematically address the different types of problems that alignment programs encounter, such as the alignment of a distant homologue or long insertions and deletions. In this work, we examined two such reference collections: BaliBase and Homstrad (Mizuguchi et al., 1998), a collection of high quality multiple structural alignments. There are two simple ways to use a reference alignment for the purpose of benchmarking Karplus and Hu (2001). One may count the number of pairs of aligned residues in the target alignment that also occur in the reference alignment and divide this number by the total number of pairs of residues in the reference. This is the Sum of Pairs Score (SPS). The main drawback is that it is not very discriminating and tends to even out differences between methods. The more popular alternative is the Column Score (CS) where one measures the percentage of columns in the target alignment that also occur in the reference alignment. This is widely used and is considered to be a stringent measure of alignment performance. In order to avoid the problem of unalignable sections of protein sequences (i.e. segments that cannot be superimposed), it is common practice to annotate the most reliable regions of a multiple structural alignment and to only consider these core regions for the evaluation. In BaliBase, the core regions make up slightly less than 50% of the total number of alignment columns. Such use of multiple sequence alignment collections for benchmarking is very convenient because of its simplicity. However, a major problem is the heavy reliance on the correctness of the reference alignment. This is serious because, by nature, these reference alignments are at least partially arbitrary. Although structural information can be handled more objectively than sequence information, the assembly of a multiple structural alignment remains a very complex problem for which no exact solution is known. As a consequence, any reference multiple alignment based on structure will necessarily reﬂect some bias from the methods and the specialist who made the assembly. The second drawback is that given a set of structures there can be more than one correct alignment. This plurality results from the fact that a structural superposition does not necessarily translate unambiguously into one sequence alignment. For instance, if we consider that the residues to be aligned correspond to the residues whose alpha carbons are the closest in the 3-D superposition, it is easy to imagine that sometimes an alpha carbon can be equally close to the alpha carbons of two potential homologous residues. Most structure based sequence
i2

alignment procedures break this tie in an arbitrary fashion, leading to a reference alignment that represents only one possible arrangement of aligned residues. This problem becomes most serious when the sequences one is considering are distantly related (less than 30% identity). Unfortunately, this is also the most interesting level of similarity where most sequence alignment methods make errors and where it is important to accurately benchmark existing algorithms. The APDB method that we describe in this work has been designed to speciﬁcally address this problem and remove, almost entirely, the need for arbitrary decisions when using structures to evaluate the quality of a multiple sequence alignment. In APDB, a target alignment is not evaluated against a reference alignment. Rather, we measure the quality of the structural superposition induced by the target alignment given any structures available for the sequences it contains. By treating the alignment as the result of some sort of structure superposition, we simply measure the fraction of aligned residues whose structural neighborhoods are similar. This makes it possible to avoid the most expensive and controversial element of the MSA benchmarking methods: the reference multiple sequence alignment. APDB requires just three parameters. This is tiny if we compare it with any reference alignment where each pair of aligned residue can arguably be considered as a free parameter. In this work we show how the APDB measure was designed and characterized on a few carefully selected pairs of structures. Among other things we explored its sensitivity to parameter settings and various sequence and structure properties, such as similarity, length, or alignment quality. Finally, APDB was used to benchmark known methods using two popular data sets: BaliBase and Homstrad. These were either used as standard reference alignments or as collections of structures suitable for APDB. It should be noted that there are several methods for evaluating the quality of structure models and predictions using known structures. The development of these has been driven by the need to evaluate entries in the CASP protein structure prediction competition and have been reviewed by Cristobal et al. (2001). These all depend on generating structure superpositions between the model and the target and evaluating the quality of the match using, for example, RMSD between the two or using some measure of the number of alpha carbons that superimpose well (e.g. MaxSub by Siew et al. (2000)). In principle, this could also be used to benchmark alignment methods. However, one serious disadvantage is the requirement for a superposition, which is itself a difﬁcult problem. A second disadvantage is the way RMSD measures behave with different degrees of sequence divergence and their sensitivity to local or global alignment differences. We

APDB

have carefully designed APDB so that on the one hand it remains very simple but on the other hand it is able to measure the similarity of the structural environments in a manner that lends itself to measuring alignment quality.

correct X :Y (Z : W) is equal to 1 if d(X, Z ) and d(Y,W) are sufﬁciently similar as set by T 1. aligned(X : Y) is equal to 1 if most pairs Z : W in the X : Y bubble are correct as set by T 2. considered X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad correct X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad and |d(X, Z ) − d(Y, W )| < T 1 (1) (2)

Author please resupply ﬁg as not reproducing correctly, thanks

SYSTEM AND METHODS The APDB scoring function APDB is a measure designed to evaluate how consistent an alignment is with the structure superposition this alignment implies. Let us imagine that A and B are two homologous structures. If the structure of sequence A tells ˚ us that the residues X and Z are 9A apart, then we expect to ﬁnd a similar distance between the two residues Y and W of sequence B that are aligned with X and Z. The difference between these two distances is an indicator of the alignment quality.
________9Å_____ _ _ A aaaaaaaaaaaXaaaaaaaaaaaaaaaZaaaaaaa B bbbbbbbbbbbYbbbbbbbbbbbbbbbWbbbbbbb _________________ 9 Å?

aligned(X : Y) = 1 Z :W Corr ect X :Y (Z : W ) × 100 > T 2 (3) if Z :W Consider ed X :Y (Z : W ) Finally, the APDB measure for the entire alignment is deﬁned as: APDB Score =
X :Y

Aligned(X : Y ) N

(4)

In APDB we take this idea further by measuring the differences of distances between X:Y (X aligned with Y) and Z:W within a bubble of ﬁxed radius centered around X and Y. The bubble makes APDB a local measure, less sensitive than a classic RMSD measure to the existence of non-superposable parts in the structures being considered. Furthermore it ensures that a bad portion of the alignment does not dramatically affect the overall alignment evaluation. The typical radius of this bubble is ˚ 10A, and it contains 20 to 40 amino acids. We consider two residues to be properly aligned if the distances from these two residues to the majority of their neighbors within the bubble are consistent between the two structures. In other words, we check whether a structural neighborhood is supportive of the alignment of the two residues that sit at its center. This can be formalized as follows: X : Y and is a pair of aligned residues in the alignment N Number of aligned pairs of residues d(X, Z) is the distance between the Cα of the two residues X and Z within one structure. Brad is the radius of the bubble set around residues X ˚ and Y (Brad = 10 A by default). T1 is the maximum difference of distance between ˚ d(X, Z ) and d(Y, W ) (T 1 = 1 A by default). T2 is the minimal percentage of residues that must respect the criterion set by T 1 for X and Y to be considered correctly aligned (70% by default). considered X :Y (Z : W) is equal to 1 if the pair Z : W is in the bubble deﬁned by pair X : Y

Given a multiple alignment of sequences with known structures, the APDB score can easily be turned into a sum of pairs score by summing the APDB score of each pair of structures and dividing it by the total number of sequence pairs considered.

Design of a benchmark system for apdb In order to study the behavior of APDB, we used two established collections of reference alignments: BAliBASE (Thompson et al., 1999) and HOMSTRAD (Mizuguchi et al., 1998). First we extracted 9 structure based pair-wise sequence alignments from HOMSTRAD, which we refer to as HOM 9. These reference alignments were chosen so that their sequence identities (as measured on the HOMSTRAD reference alignments) evenly cover the range 17 to 90%. These alignments are between 200 and 300 residues long and are used for detailed analysis and parameterization of APDB. The PDB names of the pairs of structures are given in the ﬁgure legend for Figure 2. Next, in order to assemble a discriminating test set, we selected the most difﬁcult alignments from HOMSTRAD. We chose alignments which had at least 4 sequences and where the average percent identity was 25% or less. This resulted in a selection of 43 alignments, which we refer to as HOM 43. BAliBASE version 1 has 141 alignments divided into 5 reference groups. We chose all alignments where 2 or more of the sequences had a known structure. This resulted in a subset of 91 alignments from the ﬁrst 4 reference groups of BAliBASE which we refer to as BALI 91. Minor adjustments had to be made to ensure consistency between BAliBASE sequences and the corresponding PDB ﬁles. HOM 43 and BALI 91 test sets are available in the APDB distribution.
i3

O.O’Sullivan et al.

A second method for generating sub-optimal alignments was based on the PROSUP package (Lackner et al., 2000). PROSUP takes two structures, makes a rigid body superposition and generates all the sequence alignments that are consistent with this superposition, thus producing alternative alignments that are equivalent from a structural point of view. Typically PROSUP yields 5 to 25 alternative alignments within a very narrow range of RMSDs.

Comparison of apdb with other standard measures In order to compare the APDB measure with more conventional measures, we used the Column Score (CS) measure as provided by the aln compare package (Notredame et al., 2000). CS measures the percentage of columns in a test alignment that also occur in the reference alignment. In BAliBASE this measure is restricted to those columns annotated as core region in the reference. Although alternative measures have recently been introduced (Karplus and Hu, 2001), CS has the advantage of being one of the most widely used and the simplest method available today.

Fig. 1. Tuning of Brad, the bubble radius using sub-optimal alignments of two sequences from HOM 9 Each graph represents the correlation between CS and APDB for 4 different Bubble Radius ˚ values (Brad of 6, 8, 10 and 12A). In each graph, each dot represents a sub-optimal alignment from HOM 9, sampled from the genetic algorithm.

Generation of multiple alignments We compared the performance of APDB on two different multiple alignment methods. We tested the widely used ClustalW version 1.80 (Thompson et al., 1994). We also tested the more recent T-Coffee version 1.37 (Notredame et al., 2000) using default parameters. Generation of suboptimal alignments In order to evaluate the sensitivity of APDB to the quality of an alignment, we used an improved version of the genetic algorithm SAGA (Notredame and Higgins, 1996) in order to generate populations of sub-optimal alignments. In each case a pair of sequences was chosen in HOM 9 and 50 random alignments were generated and allowed to evolve within SAGA so that their quality gradually improved (as measured by their similarity with the HOMSTRAD reference alignment). Ten alignments were sampled at each generation in order to build a collection of alternative alignments with varying degrees of quality. This algorithm was stopped when optimality was reached, thus typically yielding collections of a few hundred alignments.
i4

RESULTS AND DISCUSSION Fine tuning of apdb Three parameters control the behaviour of APDB: Brad (the bubble radius), T1 (the difference of distance threshold) and T2 (the fraction of the bubble neighbourhood that must support the alignment of two residues). We exhaustively studied the tuning effect of each of these parameters using HOM 9 and parameterised APDB so that its behaviour is as consistent as possible with the behaviour of CS on HOM 9. In Figure 1 we show the relationship between CS and APDB for 250 sub-optimal alignments generated by genetic algorithm for one of the 9 test cases from HOM 9 over 4 different settings of Brad, the Bubble Radius. While the two scoring schemes are in broad agreement, the correlation improves dramatically as Brad increases. This trend can be summarised using the correlation coefﬁcient measured on each of the graphs similar to those shown in Figure 1. The overall results for all nine HOM 9 test cases are shown in Figure 2. These results clearly show that the ˚ behaviour of APDB is best for values of Brad of 10 A or above. With these values the level of correlation between CS and APDB increases and so does the agreement across ˚ all 9 test cases. We chose 10 A as the default value in order to ensure a proper behaviour while retaining as much as possible the local character of the measure. Given the ˚ default value of 10 A for Brad, we examined T1 and T2 in a similar fashion and found the most appropriate values as ˚ 1 A for T1 and 70% for T2.

APDB

Fig. 2. Correlation between the Column Score measure (CS) and APDB on HOM 9 Each HOM 9 test set is labelled according to its average percent sequence identity as measured on the HOMSTRAD reference. The horizontal axis indicates the value of Brad. The vertical axis indicates the correlation coefﬁcient between CS and APDB as measured on a population of sub-optimal alignments similar to the ones in Figure 1. Each dot indicates a correlation coefﬁcient measured on one HOM 9 test set, using the indicated value of Brad. Each HOM 9 test set is an alignment between two sequences whose PDB names are as follows: 17: 2gar versus 1fmt, 18: ljﬂ versus lb74, 33: 1isi versus 11be, 43: 2cev versus 1d3v, 52: 1aq0 versus 1ghs, 63: 2gnk versus 2pii, 71: 1hcz versus 1cfm, 82: 1dvg versus 1qq8, 89: 1k25 versus 1qme.

Sensitivity of apdb to sequence and structure similarity It is important to verify that the behaviour of APDB remains consistent across a wide range of sequence similarity levels. It is especially important to make sure that when two different alignments of the same sequences are evaluated, the best one (as judged by comparison with the HOMSTRAD reference) always gets the best APDB score. In order to check for this, we used the genetic algorithm to generate sub-optimal alignments for each test case in HOM 9. In each case, we gathered a collection of 250 sub-optimal alignments with CS scores of 0–40%, 41–60%, 61–80% and 81–100%. The CS score measures the agreement between an alignments and its reference in HOMSTRAD. We then measured the average APDB score in each of these collections. Each of these measures corresponds to a dot in Figure 3 where vertically aligned series of dots correspond to different measures made on the same HOM 9 test set. Figure 3 clearly shows that regardless of the percent identity within the HOM 9 test set being considered, alignments with better CS scores always correspond to a better APDB score (this results in the lines never crossing one another on Fig. 3). We did a similar analysis using the RMSD as measured on the HOMSTRAD alignment in place of sequence identity. The behaviour was the

same and clearly indicates that APDB gives consistent results regardless of the structural similarity between the structures being considered.

Suitability of apdb for analysing sub-optimal alignments Collections of sub-optimal alignments for each of the nine HOM 9 test sets were generated using SAGA and evaluated for their CS scores and APDB scores. These results were pooled and are displayed on the graph shown on Figure 4. This Figure indicates good agreement between the CS and the APDB score regardless of the level of optimality within the alignment being considered. It suggests that APDB is informative over the complete range of CS values. It also conﬁrms that APDB is not ‘too generous’ with sub-optimal alignments We also checked whether sequence alignments that are structurally equivalent obtain similar APDB scores even if they are different at the sequence level. For this purpose, we used PROSUP (Lackner et al., 2000). Given a pair of structures, PROSUP generates several alignments that are equally good from a structure point of view (similar RMSD), but can be very different at the sequence level (different Column Score). We manually identiﬁed two such test sets in HOMSTRAD and the results are summarized in Table 1. For each of these two test sets, we
i5

O.O’Sullivan et al.

Fig. 3. Estimation of the sensitivity of APDB to sequence identity On this graph, each set of vertically aligned dots corresponds to a single HOM 9 test set. The 9 HOM 9 test sets are arranged according to their average identity (17–89%, see Figure 2). Each dot represents the average APDB score of a population of 250 sub-optimal alignments (generated by genetic algorithm) with a similar CS score (binned in four groups representing CS of <40%, 41–60%, 61–80% and 81–100%) generated for one of the 9 HOM 9 test sets.

Table 1. Evaluating PROSUP suboptimal alignments with APDB

Set 1 2

St1 1e96B 1e96B 1cd8 1cd8

St2 1a17 1a17 1qfpa 1qfpa

ALN aln1 aln2 aln1 aln2

RMSD ˚ 1.45A 1.50˚ A ˚ 2.95A 2.95˚ A

CS 100.0 65.6 100.0 55.1

APDB 80.2 80.7 18.7 17.9

Set indicates the test set index, St1 and St2 indicate the two structures being aligned by PROSUP, ALN indicates the alignment being considered, RMSD shows the RMSD associated with this alignment, CS indicates its CS score, with the CS score of aln1 alignments being set to 100 because they are used as references. APDB indicates the APDB score.

Fig. 4. Correlation between CS and APDB on the complete HOM 9 test set Each dot corresponds to a sub-optimal alignment of one of the HOM 9 test cases, generated by genetic algorithm. For each alignment the graph plots the APDB score against its CS counterpart.

In both test sets, using aln1 as a reference for the CS measure leads to the conclusion that aln2 is mostly incorrect (cf. CS column of Table 1). This is not true since these alignments are structurally equivalent as indicated by their RMSDs. In such a situation, APDB behaves much more appropriately and gives to each couple aln1/aln2 scores that are nicely consistent with their RMSD, thus indicating that APDB can equally well reward two suboptimal alignments when these are equivalent from a structural point of view.

selected in the output of PROSUP two alignments (aln1 and aln2) to which PROSUP assigns similar RMSDs. aln1 is used as a reference and therefore gets a CS score of 100 while the CS score of the second alignment (aln2) is computed by direct comparison with its aln1 counterpart.
i6

Using apdb to benchmark alignment methods Table 2 shows the average CS and APDB scores for the test sets in each of the four Bali 91 categories being considered here and in HOM 43. The highest scores in all cases, for both measures, come from the reference column (the last column). This is desirable providing the reference alignments really are consistent with the

APDB

Table 2. Correlation between APDB and CS on BaliBase and Homstrad

Set

N CS

ClustalW APDB 59.9 26.6 38.5 59.5 60.2

CS

T-Coffee APDB 58.3 47.1 46.9 64.5 61.6

Reference CS APDB 100 100 100 100 100 64.7 55.2 53.2 65.7 72.9

B91 R1 B91 R2 B91 R3 B91 R4 H43

35 23 22 11 43

70.1 32.7 46.4 52.0 35.4

67.7 33.9 48.6 52.5 38.9

Test Set: indicates the test set being considered, either one of the BaliBase 91 references (B91R#) or HOM 43(H43), a subset of HOMSTRAD. N indicates the number of test alignments in this category. ClustalW indicates a set of measures made on alignments generated with ClustalW. T-Coffee indicates similar measures made on T-Coffee generated alignments. Reference indicates measures made on the reference alignments as provided in BaliBase or in Homstrad. CS columns are the Column Score measures while APDB indicates similar measures made using APDB.

local evaluation and the absence of a reference alignment, the only possible effect of non-superposable regions is to decrease the proportion of residues found aligned in a structurally optimal sequence alignment, thus yielding scores lower than 100 in the case of distantly related structures. A key advantage of APDB is its simplicity. It only requires three parameters and a few PDB ﬁles. Most importantly, APDB does not require any arbitrary manual intervention such as the assembly of a reference alignment. In the short term, all the existing collections of reference alignment could easily be integrated and extended with APDB. In the longer term, APDB could also be used to evaluate and compare existing collections of alignments such as proﬁles, when structures are available.

REFERENCES
Barton,G.J. and Sternberg,M.J.E. (1987) A strategy for the rapid multiple alignment of protein sequences: conﬁdence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Cristobal,S., Zemla,A., Fischer,D., Rychlewski,L. and Elofsson,A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5. Duret,L. and Abdeddaim,S. (2000) Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. Bioinformatics, Sequence, Structure and Databanks. Higgins,D. and Taylor,W. (eds), Oxford University Press, Oxford. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a reﬁned tool for protein structure alignment. Protein Eng., 13, 745–752. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Notredame,C. (2002) Recent progress in multiple sequence alignments. Pharmacogenomics, 3, 131–144. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel algorithm for multiple sequence alignment. J. Mol. Biol., 302, 205–217. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Siew,N., Elofsson,A., Rychlewski,L. and Fischer,D. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J., Plewniak,F. and Poch,O. (1999) BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15(1), 87–88.

underlying structures. If we now compare the columns two by two, we ﬁnd that every variation on CS from one column to another agrees with the corresponding variation of APDB. For instance in row 1 (Bali 91 Ref1), when T-Coffee/CS is lower than ClustalW/CS, T-Coffee/APDB is also lower. This observation is true for the whole table, regardless of the pair of results being considered. When considering the 134 alignments one by one, this observation remains true in more than 70 % of the cases.

CONCLUSION This work introduces APDB, a novel method that makes it possible to evaluate the quality of a sequence alignment when two or more tertiary structures of the sequences it contains are available. This method does not require a reference alignment and it does not depend on any complex procedure such as structure superposition or sequence alignment. We show here that APDB sensitivity is comparable with that of CS, a well-established measure that compares a target alignment with a reference alignment. Our results also indicate that APDB can discriminate better than CS between structurally correct sub-optimal sequence alignments and structurally incorrect sequence alignments, even when the structures being considered are distantly related. Apart from the cost associated with their assembly, a serious problem with reference alignments is that they need to be annotated to remove from the evaluation regions that correspond to non-superposable portions of the structures. This is necessary because otherwise these regions (whose alignment cannot be trusted) will bias a CS evaluation toward rewarding the arbitrary alignment conformation displayed in the reference. Table 2 illustrates well the fact that such an annotation is not necessary in APDB. In our measure, thanks to the combination of

i7

BIOINFORMATICS APPLICATIONS NOTE
Sequence analysis

Vol. 22 no. 19 2006, pages 2439–2440 doi:10.1093/bioinformatics/btl404

APDB: a web server to evaluate the accuracy of sequence alignments using structural information
´ Fabrice Armougom1, Olivier Poirot1, Sebastien Moretti1, Desmond G. Higgins2, 3 1 Phillip Bucher , Vladimir Keduas and Cedric Notredame1,Ã
1

CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientiﬁque de Luminy, 163 Avenue de Luminy, FR-13288, Marseille cedex 09, France, 2The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and 3Institut Suisse de Recherche et d’Experimentation sur le Cancer, Ch. des Boveresses 155 CH-1066 Epalinges, Switzerland

Received on March 31, 2006; revised on June 17, 2006; accepted on July 21, 2006 Associate Editor: Thomas Lengauer ABSTRACT Summary: The APDB webserver uses structural information to evaluate the alignment of sequences with known structures. It returns a score correlated to the overall alignment accuracy as well as a local evaluation. Any sequence alignment can be analyzed with APDB provided it includes at least two proteins with known structures. Sequences without a known structure are simply ignored and do not contribute to the scoring procedure. Availability: APDB is part of the T-Coffee suite of tools for alignment analysis, it is available on www.tcoffee.org. A standalone version of the package is also available as a freeware open source from the same address. Contact: cedric.notredame@gmail.com

1

INTRODUCTION

Structure-based sequence alignments have become a key component in the design and the improvement of sequence alignment methods. The extensive usage of structural information to align sequences results mostly from the observation that 3D folds evolve slower than primary sequences (Chothia and Lesk, 1986) and can be used to derive accurate sequence alignments, even when the sequences themselves have diverged beyond recognition. This property is often used to compute sequence alignments of remote homologues or to assemble collections of reference alignments that are used as gold standards to validate, benchmark and improve sequence alignment methods (Edgar, 2004; Thompson et al., 2005; Van Walle et al., 2005). Detailed analysis, however, shows that structure alignment methods often disagree on distantly related proteins (Kolodny et al., 2005). For instance, the alignments delivered by CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1996) only agree on 70% of the positions as judged on the 1682 pairs of homologous structures contained in the Prefab Database (Edgar, 2004). These variations probably explain why established collections of structurebased alignments are sometimes inconsistent with one another. In some previous work, we argued that it may sometimes be more reliable to evaluate sequence alignments by directly using the
Ã

structural information they are associated with, rather than depending on an intermediate reference alignment that constitutes a potentially distorted interpretation of the original structural signal. In an attempt to provide such a direct measure, we developed an algorithm named APDB (Analyze PDB) (O’Sullivan et al., 2003) that uses the structural information independently of any structural alignment or superposition. APDB relies on the simple observation that if two similar structures are aligned correctly, the intramolecular distances between equivalent Ca (as deﬁned by the alignment) must be similar. By simply measuring the geometric compatibility of two structures according to their alignment, APDB makes no reference to any authoritative alignment and is therefore independent from any kind of optimization, unlike similar methods like MaxSub (Siew et al., 2000), LSQman (Kleywegt and Jones, 1999) or TMScore (Zhang and skolnick, 2004). It also makes APDB suitable for comparing alternative alignments of the same sequences, as long as corresponding structures are available.

2

USING THE APDB SERVER

To whom correspondence should be addressed.

The server is available on http://www.tcoffee.org/. It only makes sense to use the APDB server if the alignment one wishes to evaluate contains at least two sequences with a known structure. These sequences must be named according to their structure PDB identiﬁer (with the chain index appended if appropriate). Whenever there is an imperfect match between the user’s and the PDB sequence, the program makes an automatic alignment based reconciliation. This process explicitly fails when the sequences are less than 80% identical. Sequences without a known structure are simply ignored and do not contribute to the scoring procedure. The 1cvl_1tca Prefab dataset has been aligned with Muscle (Edgar, 2004) (a) and Clustalw(Thompson, et al., 1994) (b). The resulting alignments were evaluated with the APDB server and the following ﬁgure displays the localAPDB score. Sequences in red and orange are considered correctly aligned by APDB. The server returns the overall APDB scores and a color-coded alignment with local APDB score (Fig. 1). The overall APDB score is an estimation of the fraction of columns likely to be correctly aligned within a pairwise structural alignment and the color code estimates the potential correctness of each individual alignment position (red: very likely; orange: possible; green/blue: unlikely).

Ó The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

2439

F.Armougom et al.

criterion. The score returned by APDB is in broad agreement with these ﬁgures (Clustalw APDB: 50.3%, Muscle: 47.5%).

ACKNOWLEDGEMENTS
The authors thank Prof. Jean-Michel Claverie (head of IGS) for material support. The development of the server was supported by CNRS (Centre National de la Recherche Scientifique), Sanofi´ Aventis Pharma SA, Marseille-Nice Genopole and the French National Genomic Network (RNG). Conflict of Interest: none declared.

REFERENCES
Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J., 5, 823–826. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 283, 595–602. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta Crystallogr. D Biol. Crystallogr, 55, 941–944. Kolodny,R. et al. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188. O’Sullivan,O. et al. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 (Suppl. 1), i215–i221. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. Siew,N. et al. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D. et al. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. Van Walle,I. et al. (2005) SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267–1268. Zhang,J. and Skolnick,J. (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702–710.

Fig. 1. Output of a standard APDB computation.

Figure 1 shows the evaluation of two alternative alignments of the same structures. The ﬁrst one was produced by Muscle (3.52) and is estimated to be 43.8% accurate as compared with its Prefab reference (Edgar, 2004). The second one, produced by ClustalW (1.81), is expected to have an accuracy of 55.7% according to the Prefab

2440

BIOINFORMATICS

Vol. 14 no. 5 1998 Pages 407-422

COFFEE: an objective function for multiple sequence alignments
Cédric Notredame 1, Liisa Holm 1 and Desmond G. Higgins 2
1EMBL OutstationćThe European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland Received on January 19, 1998; revised and accepted on February 24, 1998

Abstract Motivation: In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. Results: We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments. Availability: The package is available along with the test cases through the WWW: http://www.ebi.ac.uk/∼cedric Contact: cedric.notredame@ebi.ac.uk Introduction
Multiple alignments are among the most important tools for analysing biological sequences. They can be useful for structure prediction, phylogenetic analysis, function prediction and polymerase chain reaction (PCR) primer design. Unfortunately, accurate multiple alignments may be difficult to build. There are two main reasons for this. First of all, it is difficult to evaluate the quality of a multiple alignment. Secondly, even when a function is available for the evaluation,
E Oxford University Press

it is algorithmically very hard to produce the alignment having the best possible score (optimal alignment). Cost functions or scoring functions roughly fall into two categories. First of all, there are those that rely on a substitution matrix. These are the most widely used. They require a substitution matrix (Dayhoff, 1978; Henikoff and Henikoff, 1992) that gives a score to each possible amino acid substitution, a set of gap penalties that gives a cost to deletions/insertions (Altschul, 1989), and a set of sequence weights (Altschul et al., 1989; Thompson et al., 1994b). Under this scheme, an optimal multiple alignment is defined as the one having the lowest cost for substitutions and insertion/deletions. One of the most widely used scoring methods of this type is the ‘weighted sums of pairs with affine (or semi affine) gap penalties’ (Altschul and Erickson, 1986). The main limitation of these scoring schemes is that they rely on very general substitution matrices, usually established by statistical analysis of a large number of alignments. These may not necessarily be adapted to the set of sequences one is interested in. To compensate for this drawback, a second type of scoring scheme was designed: profiles (Gribskov et al., 1987) and Hidden Markov Models (HMMs) (Krogh and Mitchison, 1995). Profiles allow the design of a sequencespecific scoring scheme that will take into account patterns of conservation and substitution characteristic of each position in the multiple alignment of a given family. To some extent, HMMs can be regarded as generalized profiles (Bucher and Hofmann, 1996). In HMMs, sequences are used to generate statistical models. The sequences of interest are then aligned to the model one after another to generate the multiple sequence alignment. The main drawback of HMMs is that to be general enough, the models require large numbers of sequences. However, this can be partially overcome by incorporating in the model some extra information such as Dirichlet mixtures (the equivalent of a substitution matrix in an HMM context) (Sjolander et al., 1996). Whatever scoring scheme one wishes to use, the optimization problem may be difficult. There are two types of optimization strategies: the greedy ones that rely on pairwise alignments and those that attempt to align all the sequences

407

C.Notredame, L.Holm and D.G.Higgins

simultaneously. The main tool for making pairwise alignments is an algorithm known as dynamic programming (Needleman and Wunsch, 1970) and is often used for optimizing the sums of pairs. The complexity of the algorithm makes it hard to apply it to more than two sequences (or two alignments) at a time. Nevertheless, it allows greedy progressive alignments as described by Feng and Doolittle (1987) or Taylor (1988). In such a case, the sequences are aligned in an order imposed by some estimated phylogenetic tree. The alignment is called progressive because it starts by aligning together closely related sequences and continues by aligning these alignments two by two until the complete multiple alignment is built. Some of the most widely used multiple sequence alignment packages like ClustalW (Thompson et al., 1994a), Multal (Taylor, 1988) and Pileup (Higgins and Sharp, 1988) are based on this algorithm. They have the advantage of being fast and simple, as well as reasonably sensitive. Their main drawback is that mistakes made at the beginning of the procedure are never corrected and can lead to misalignments due to the greediness of the strategy. It is to avoid this pitfall that the second type of methods have been designed. They mostly involve aligning all the sequences simultaneously. For the sums of pairs, this is a difficult problem that has been shown to be NP-complete (Wang and Jiang, 1994). However, using the Carrillo and Lipman (1988) algorithm implemented in the Multiple Sequence Alignment program MSA (Lipman et al., 1989), one can simultaneously align up to 10 sequences. Other global alignment techniques using the sums of pairs cost function involve the use of stochastic heuristics such as simulated annealing (Ishikawa et al., 1993a; Godzik and Skolnik, 1994; Kim et al., 1994), genetic algorithms (Ishikawa et al., 1993b; Notredame and Higgins, 1996) or iterative methods (Gotoh, 1996). Simulated annealing can also be used to optimize HMMs (Eddy, 1995). The stochastic methods have two main advantages over the deterministic ones. First of all they have a lower complexity. This means that they do not have strong limitations on the number of sequences to align or on the length of these sequences. Secondly, these methods are more flexible regarding the objective function they can use. For instance, MSA is restricted to an approximation of the sums of pairs using semi-affine gap penalties (Lipman et al., 1989) instead of the natural ones shown to be biologically more realistic (Altschul, 1989). This is not the case with simulated annealing (Kim et al., 1994). The main drawback of stochastic methods is that they do not guarantee optimality. However, in some previous work, we showed that with the Sequence Alignment Genetic Algorithm (SAGA), results similar to MSA could be obtained (Notredame and Higgins, 1996). We also showed that the package was able to handle test cases with sizes much beyond the scope of MSA. The robustness of SAGA as an optimizer was confirmed by results obtained

on a different objective function for RNA alignment (Notredame et al., 1997) and motivated our choice to use SAGA for optimizing the new objective function described here. The main argument for aligning all the sequences simultaneously instead of making a greedy progressive alignment is that using all the available information should improve the final result. However, one limitation of such methods is that regions of low similarity may induce some noise that will weaken the signal of the correct alignment (Morgenstern et al., 1996). In order to avoid this, one would like a scheme that filters some of the initial information and allows its global use. The approach we propose here is an attempt to do so. The underlying principle is to generate a set of pairwise alignments and look for consistency among these alignments. In this case, we define the optimal multiple alignment as the most consistent one and produce it using the SAGA package. The idea of using the consistency information in a multiple sequence alignment context is not new (Gotoh, 1990; Vingron and Argos, 1991; Kececioglu, 1993). In his scheme, Gotoh (1990) proposed the identification of regions that are fully consistent among all the pairwise alignments. These regions are used as anchor points in the multiple alignment, in order to decrease complexity. A similar strategy was described by Vingron and Argos (1991), allowing the computation of a multiple alignment from a set of dot matrices. Although very interesting, these methods had several pitfalls, including a sensitivity to noise (especially when some sequences are highly inconsistent with the rest of the set) and a high computational complexity. The work of Kececioglu (1993) bears a stronger similarity to the method we propose here. Kececioglu directly addressed the problem of finding a multiple alignment that has the highest level of similarity with a collection of pairwise alignments. Such an alignment is named ‘maximum weight trace alignment’ (MWT), and its computation was shown to be NP-complete. An optimization method was also described, based on dynamic programming and limited to a small number of sequences (six maximum). More recently, a method was described that allows the construction of a multiple alignment using consistent motifs identified over the whole set of sequences by a variation of the dynamic programming algorithm (Morgenstern et al., 1996). This algorithm should be less sensitive to noise than the one described by Vingron and Argos, but its main drawback is that it does rely on a greedy strategy for assembling the multiple alignment. An important aspect of multiple sequence alignment often overlooked is estimation of the reliability. Since all the alignment scoring functions available are known to be intrinsically inaccurate, identifying the biologically relevant portions of a multiple alignment may be more important than increasing the overall accuracy of this alignment. A few tech-

408

COFFEE: an objective function for multiple sequence alignments

niques have been proposed to identify accurately aligned positions in pairwise (Vingron and Argos, 1990; Mevissen and Vingron, 1996) and multiple sequence alignments (Gotoh, 1990; Rost, 1997). We show here that our method allows a reasonable estimation of a multiple alignment local reliability. The measure we use for reliability is in fact very simple and could easily be extended much further to incorporate other methods such as the one described by Mevissen and Vingron (1996).

Methods
The overall approach relies on the definition of an objective function (OF) describing the quality of multiple protein sequence alignments. Given a set of sequences and an ‘allagainst-all’ collection of pairwise alignments of these sequences (library), the score of a multiple sequence alignment is defined as the measure of its consistency with the library. This objective function was optimized with the SAGA package. Sets of sequences with a known structure and for which a multiple structural alignment is available were extracted from the 3D_ali database (Pascarella and Argos, 1992) and used in order to validate the biological relevance of the new objective function. Two other test cases were designed using the DALI server (Holm and Sander, 1996a) and aligned using libraries made of structural pairwise alignments extracted from the FSSP database (Holm and Sander, 1993).

Objective function
The OF is a measure of quality for multiple sequence alignments. Ideally, the better its score, the more biologically relevant the multiple alignment. The method proposed here requires two components: (i) a set of pairwise reference alignments(library), (ii) the OF that evaluates the consistency between a multiple alignment and the pairwise alignments contained in the library. We named this objective function COFFEE (Consistency based Objective Function For alignmEnt Evaluation).

Creation of the library
A library is specific for a given set of sequences and is made of pairwise alignments. Taken together, these alignments should contain at least enough information to define a multiple alignment of the sequences in the set. In practice, given a set of N sequences, we included in the library a pairwise alignment for each of the (N2 – N)/2 possible pairs of sequences. This choice is arbitrary since in theory there is no limit regarding the amount of redundancy one can incorporate into a library. For instance, instead of each pair of sequences being represented by a single pairwise alignment, one could use several alternative alignments of this pair, obtained by various methods. In fact, the library is mostly an

interface between any method one can invent for generating pairwise alignments, and the COFFEE function optimized by SAGA. However, the method follows the rule ‘garbage in/garbage out’ and the overall properties of the COFFEE function will most likely reflect the properties of the method used to build the library. The amount of time it takes to build the library depends on the alignment method used and increases quadratically with the number of sequences. Inside the evaluation algorithm, the library is stored in a look-up table. If each pair of sequences is represented only once, the amount of memory required for the storage increases quadratically with the number of alignments and linearly with their length. For the analyses presented here, two types of libraries were built. The first one relies on ClustalW. Given a set of N sequences, each possible pair of sequences was aligned using ClustalW with default parameters. The collection of output files obtained that way constitutes the library (ClustalW library). The motivation for using ClustalW as a pairwise method stems from the fact that Clustal is using local gap penalties, even for two sequences. In order to show that COFFEE is not dependent on the method used to construct the library, a second category of library was created using the FSSP database (Holm and Sander, 1996b). FSSP is a database containing all the PDB structures aligned with one another in a pairwise manner. For each test case, a set of sequences was chosen and the (N2 – N)/2 pairwise structure alignments involving these sequences were extracted from the FSSP database to construct an FSSP library. We also used as references the multiple alignments contained in FSSP. An FSSP entry is always based around a guide structure to which all the other structures are aligned in a pairwise manner. This collection of pairwise alignments can be regarded as a pairwise-based multiple alignment. This means that if one is interested in a set of N protein structures, FSSP contains the N corresponding pairwise-based multiple alignments, each using one structure of the set as a guide. Generally speaking, these N multiple alignments do not have to be consistent with one another, but only consistent with the subset of the pairwise alignments that was used to produce them.

Evaluation procedure: the COFFEE function
Let us assume an alignment of N sequences and an appropriate library built for this set. Evaluation is made by comparing each pair of aligned residues (i.e. two residues aligned with each other or a residue aligned with a gap) observed in the multiple alignment to those present in the library (Figure 1). In such a comparison, residues are identified by their position in the sequence (gaps are not distinguished from one another). In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the

409

C.Notredame, L.Holm and D.G.Higgins

multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical). In practice, this scheme needs extra readjustments to incorporate some important properties of the sequence set. For instance, the significance of the information content of each pairwise alignment is not identical. Several schemes have been described in the literature for weighting sequences according to the amount of information they bring to a multiple alignment (Altschul et al., 1989; Sibbald and Argos, 1990; Vingron and Sibbald, 1993; Thompson et al., 1994a). In COFFEE, our main concern was to decrease the amount of noise produced by inaccurate pairwise alignments in the library. To do so, each pairwise alignment in the library is given a weight that is a function of its quality. For this purpose, we used a very simple criterion: the weight of a pairwise alignment is equal to the per cent identity between the two aligned sequences in the library. This may seem counterintuitive since weighting schemes are normally used in order to decrease the amount of redundancy in a set of sequences (i.e. down-weighting sequences that have a lots of close relatives). Doing so makes sense in the context of profile searches (Gribskov et al., 1987; Thompson et al., 1994b) where it is important to prevent domination of the profile by a given subfamily. However, in the case of multiple sequence alignments made by global optimization, it is more important to make sure that closely related pairs of sequences are correctly aligned, regardless of the background noise introduced by other less related sequences. In such a context, a weight can be regarded as a constraint. The consequence is that the alignment of a given sequence will mostly be influenced by its closest relatives. On the other hand, if a sequence lacks any really close relative, its alignment will mostly be influenced by the consistency of its pairwise alignments with the rest of the library. The COFFEE function can be formalized as follow. Given N aligned sequences S1 … SN in a multiple alignment, Ai,j is the pairwise projection (obtained from the multiple alignment) of the sequences Si and Sj , LEN(Ai,j ) is the length of this alignment, SCORE(Ai,j ) is the overall consistency (level of identity) between Ai,j and the corresponding pairwise alignment in the library and Wi,j is the weight associated with this pairwise alignment. Given these definitions, the COFFEE score is defined as follows:

COFFEE score +

ƪȍ ȍ
N*1 N

ƪȍ ȍ
N*1 N

W i,j

SCORE(A i,j) ń

i+1 j+i)1

W i,j

LEN(A i,j)

i+1 j+i)1

ƫ

ƫ

(1)

with: SCORE(Ai,j ) = number of aligned pairs of residues that are shared between Ai,j and the library (2) The COFFEE function presents some similarities with the ‘weighted sums of pairs’ (Altschul and Erickson, 1986). Here as well, we consider all the pairwise substitutions in the multiple alignment, and weight these in a way that reflects the relationships between the sequences. The library plays the role of the substitution matrix. The main differences between the COFFEE function and the ‘weighted sums of pairs’ are that (i) no extra gap penalties are applied in our scheme, since this information is already contained in the library, (ii) the COFFEE score is normalized by the value of the maximum score (i.e. its value is between 0 and 1) and (iii) the cost of the substitutions is made position dependent, thanks to the library (i.e. two similar pairs of residues will have potentially different scores if the indices of the residues are different). Under this formulation, an alignment having an optimal COFFEE score will be equivalent to an MWT alignment using a ‘pairwise alignment graph’ (Kececioglu, 1993). The score defined above is a global measure for an entire alignment. It can also be adapted for local evaluation. We have defined two types of local scores: the residue score and the sequence score. The residue score is given below. S xis the i residue x in sequence S i and A x,yis the pair of aligned residues i,j S x and S y in the pairwise alignment A i,j. i j
residue score(S x) + i

ȍ
j+1,j!+1

N

W i,j

OCCURRENCE(A x,y)ń
i,j

ȍ
j+1,j!+1

N

W i,j

(3)

OCCURRENCE( A x,y) is equal to the number of occuri,j rences of the pair A x,y in the reference library (0 or 1 when i,j using the libraries described here). The sequence score is the natural extension of the residue score. It is defined as the sum of the score of each residue in a sequence divided by the number of residues in that sequence.

Optimizing an alignment for its COFFEE score: SAGA-COFFEE
The aim is to create an alignment having the best possible COFFEE score (optimal alignment). Doing so is a difficult

410

COFFEE: an objective function for multiple sequence alignments

Fig. 1. COFFEE scoring scheme. This figure indicates how a column of an ALIGNMENT is evaluated by the COFFEE function using a REFERENCE LIBRARY. Each pair in the alignment is evaluated (SCORE MATRIX). In the score matrix, a pair receives a score of 0 if it does not appear in the library or a score equal to the WEIGHT of the pair of sequences in which it occurs in the PAIRWISE LIBRARY. S ince the matrix is symmetrical, the column score is equal to the sum of half of the matrix entries, excluding the main diagonal. This value is divided by the maximum score of each entry (i.e. the sum of the weights contained in the library). The residue score is equal to the sum of the entries contained by one line of the matrix, divided by the sum of the maximum score of these entries.

task. The computational complexity of a dynamic programming solution is known to be NP-complete (Kececioglu,

1993). For reasons discussed in the Introduction, we used SAGA V0.93 (Notredame and Higgins, 1996).

411

C.Notredame, L.Holm and D.G.Higgins

Table 1. Accuracy of the prediction made on the category 5 of substitution Test case Length Nseq. Proportion (H+E) (%) 57 68 43 48 57 74 53 53 67 57 51 / / Avg Id. (%) 21 31 42 17 36 24 24 39 22 61 27 14 12 COFFEE score Clustal 0.48 0.72 0.84 0.49 0.86 0.78 0.63 0.87 0.59 0.96 0.69 / / SAGA 0.56 0.84 0.87 0.62 0.89 0.80 0.67 0.87 0.65 0.97 0.74 / / Accuracy (H+E) % Clustal SAGA 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 / / 50.2 64.5 90.7 47.0 83.1 85.2 78.1 72.3 64.7 96.9 66.6 / / Accuracy (ov.) % Clustal SAGA 35.2 50.0 88.3 35.7 76.7 82.1 65.6 72.3 60.4 93.6 57.7 / / 45.9 61.7 86.1 43.6 80.2 81.7 69.4 72.4 61.4 93.6 61.2 / / CPU time (s) 21 009 1003 699 936 91 28 477 110 453 256 388 644 44 978 13 756 43 568 N.G.

ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot ceo vjs

248 500 146 136 52 183 194 213 90 331 229 882 1207

14 7 6 9 8 17 37 6 8 7 15 7 8

535 166 259 480 55 222 132 105 110 127 744 882 1400

Test case: generic name of the test case, taken from 3D_ali for the first 11 (ac prot: acid proteases, binding: sugar/amino acid binding proteins, cytc: cytochrome c’ ss, fniii: fibronectin type III, gcr: crystallins, globin: globins/phycocyanins/collicins, igb: immunoglobulin fold, lzm: lyzozymes/lactalbumin, phenyldiox: dihydroxybiphenyl dioxygenase, sbt: subtilisin, s_prot: serine protease fold) and from the DALI server for the last two. ceo includes: 1cbg, 1ceo, 1edg, 1byb, 1ghr, 1×yzA and vjs includes: 1cnv, 1vjs, 1smd, 2aaa, 1pamA, 2amg, 1ctn, 2ebn. Length: length of the reference alignment. Nseq: number of sequences in the alignment. Proportion (E+H): percentage of the substitutions involving E→E or H→H. Avg. Id.: average level of identity between the sequences. OF score: score measured with the COFFEE function using a ClustalW library on the ClustalW or the SAGA alignments. Accuracy (E+H): percentage of the (E+H) substitutions found identical between the SAGA (or ClustalW alignment) and the reference. Accuracy (ov.): percentage of substitutions similar in the SAGA (or ClustalW) alignment and in the reference. CPU time: cpu time in seconds using an alpha 8300 machine N.G.: number of generations needed by SAGA to find the solution. The results for the two last test cases analysis are presented in Table 6.

SAGA follows the general principles of genetic algorithms as described by Goldberg (1989) and Davis (1991). In SAGA, a population of alignments evolves through recombination, mutation and selection. This process goes on through series of cycles known as generations. Every generation, the alignments are evaluated for their score (COFFEE). This score is turned into some fitness measure. In an analogy with natural selection, the fitter an alignment, the more likely it is to survive and produce an offspring. From one generation to the next, some alignments will be kept unchanged (statistically the fittest), others will be randomly modified (mutations), combined with another alignment (cross-over) or will simply die (statistically the less fit). The new population generated that way will once again undergo the same chain of events, until the best scoring alignment cannot be improved for a specific number of generations (typically 100). Operators play a central role in the GA strategy. They can be subdivided between two categories. First the cross-overs, which combine the content of two multiple alignments. Thanks to them, and to the pressure of selection, good blocks tend to be merged into the same alignments. On their own, cross-overs cannot create new blocks, this needs to be done by the second category of operators: the mutations. These are

specific programs that input an alignment and modify it by inserting or moving patterns of gaps. Mutations can be slightly greedy (attempt to make some local optimization) or completely random. A key concept in the genetic algorithm strategy is that the fitness-based selection is not absolute but statistical. To select an individual, a virtual wheel is created. On this wheel, each individual is given a number of slots proportional to its fitness. To make a selection, the wheel is spun. Therefore, the best individuals are simply more likely to survive, or to be selected for a cross-over or a mutation. This form of selection protects the GA search from excessive greediness, hence preventing it from converging onto the first local minimum encountered during the search. SAGA V0.93 is mostly similar to the Version 0.91 described in Notredame and Higgins (1996). Most of the changes between Version 0.93 and 0.91 have to do with some improvement in the implementation and the user interface, but do not affect the algorithm itself. To optimize the COFFEE scores, SAGA was run using the default parameters described for SAGA 0.91 in Notredame and Higgins (1996). SAGA was also modified so that it could evaluate any alignment (including a ClustalW alignment) using the COFFEE function.

412

COFFEE: an objective function for multiple sequence alignments

Test cases
To assess the biological accuracy of the COFFEE function and the efficiency of its optimization by SAGA, 13 test cases were designed. They are based on sequences with known three-dimensional structures, for which a structural alignment is available. This choice was guided by the fact that structure-based alignments are usually biologically more correct than any other alternative, especially when they involve proteins with low sequence similarity. For this reason, we used these structure-based alignments as a standard of truth in our analyses. Eleven test cases were extracted from the 3D_ali Release 2.0 (Pascarella and Argos, 1992). Alignments were selected according to the following criteria: alignments with more than five structures and with a consensus length larger than 50. In 3D_ali, 18 alignments meet this requirement. Among these, we removed those for which ClustalW produces a multiple alignment >95% identical to its structural counterpart (four alignments). We also removed three alignments which were impossible to align accurately using ClustalW or SAGA/ COFFEE. These consist of sets of very distantly related sequences with unusually long insertions/deletions (barrel, nbd and virus in 3D_ali). These alignments are beyond the scope of conventional global sequence alignment algorithms. This leaves a total of 11 alignments used in our analyses. Their characteristics are shown in Table 1. The two last test cases were extracted from the FSSP database. As opposed to the 11 other test cases, they have been specifically designed for making a multiple sequence alignment using a structure-based reference library. This explains their low level of average sequence identity, as can be seen in Table 1 (the two last entries, vjs and ceo).

Evaluation of the COFFEE function accuracy
When evaluating a new OF with SAGA, two main issues are involved: the quality of the optimization and the biological relevance of the optimal alignment. Another aspect involves the comparison of the new OF with already existing methods. The evaluation of the biological relevance of the COFFEE function required the use of some references. The structural alignments described above were used for this purpose. Comparison between a sample alignment and its reference was made following the protocol described in Notredame and Higgins (1996), inspired by the method used by Vogt et al. (1995) for substitution matrix comparison and Gotoh (1996). All the pairs (excluding gaps) observed in the sample alignment were compared to those present in the reference. The level of similarity is defined as the ratio between the number of identical pairs in the two multiple alignments divided by the total number of pairs in the reference. This procedure gives access to an overall comparison. It does not reflect the fact that in a global structural alignment,

some positions are not correctly aligned because they cannot be aligned (this is true of any position where the two structures cannot be superimposed). In practice, structural alignment procedures may deal with these situations in different ways, producing sequence alignments that are sometimes locally arbitrary (especially in the loops). While in DALI these regions are explicitly excluded from the alignment, it is not so obvious to identify them in the multiple sequence alignments contained in 3D_ali. To overcome this type of noise, a procedure was designed that should be less affected by misalignments. For this alternative measure of biological relevance, we only take into account substitutions that involve a conservation of secondary structural state in the reference alignment (helix to helix and beta strand to beta strand). In the text and the tables, this category of substitution is referred to as (E+H). In most of the test cases, the (E+H) category makes up the majority of the observed pairs, as can be seen in Table 1. For each of the first 11 test cases (3D_ali), the evaluation procedure involved making multiple alignments with five different methods (cf. the next section) and a ClustalW library (default parameters). The ClustalW library was used with SAGA for producing a multiple alignment having an optimal COFFEE score. The biological relevance of this alignment was then assessed by comparison with the structural reference, and compared to the accuracy obtained with the other methods on the same sequences. On the two last test cases (vjs and ceo), alignments were made using FSSP libraries. Alignment accuracy was assessed using the DALI scoring measure. Given a pairwise alignment, this is a measure of the quality of the structure superimposition implied by the alignment. The program used for this purpose (trimdali) returns the DALI score (Holm and Sander, 1993) and two other values: the length of the consensus (number of residues that could be superimposed) and the average RMS (the average deviation between equivalent Ca atoms). These values were computed for each possible pairwise projection of the multiple alignments and averaged. The scores obtained that way for the SAGA alignments were compared to similar scores measured on the FSSP pairwise-based multiple alignments.

Comparison of COFFEE with other methods
In total, six alignment methods where used to align the 3D ali test cases: ClustalW v1.6 (Thompson et al., 1994a), SAGA with the MSA objective function (SAGA-MSA) (Notredame and Higgins, 1996), PRRP (V 2.5.1), the iterative alignment method recently described by Gotoh (1996), PILEUP (Higgins and Sharp, 1988) in GCG v9.1 and SAM (v2.0), a HMM package (Hughey and Krogh, 1996) and SAGA-COFFEE.

413

C.Notredame, L.Holm and D.G.Higgins

Apart from SAM, all these methods were used with the default parameters that came along with the package. In the case of SAM, since it is known that HMMs usually require large sets of sequences in order to evaluate a model, we used the Dirichlet mixture regularizer provided in the package, which is supposed to compensate for this type of problems. SAGA-MSA is the package previously described (Notredame and Higgins, 1996), Rationale 2 weights (Altschul et al., 1989) were computed using the MSA package. (Lipman et al., 1989). When possible, MSA was used on the same sequences as SAGA-MSA in order to control the quality of the optimization. Results were consistent with those previously reported.

workstation, it takes ∼4 s/generation for the gcr test case and ∼7 min/generation for the igb test case. Unfortunately, establishing the complexity in terms of the number of generations needed to reach a global optimum is much harder. This depends on several factors: number of sequences, length of the consensus, relative similarity of the sequences, complexity of the pattern of gaps needed for optimality, operators used for mutations and cross-overs. Since the pattern of gaps is an unknown factor, it is impossible really to predict how many generations will be required for one specific test case. On the other hand, judging from the data in Table 1 (N generations column), it seems that the length of the alignment has a stronger effect than the number of sequences.

Implementation
The COFFEE function and the procedure for building ClustalW pairwise libraries have been implemented in ANSI C. These programs have been integrated in Version 0.93 of the SAGA package also written in ANSI C. These are available upon request from the authors (http://www.ebi.ac.uk/∼cedric).

Comparison of the COFFEE function with other methods
Multiple alignments were produced with SAGA-COFFEE using ClustalW libraries (best scoring alignment out of 10 runs). These alignments were compared to the structural references. Multiple alignments of the same sets, generated with five other methods, were also compared to the references in order to evaluate relative performances. Since in the way it is used here, the COFFEE function depends heavily on ClustalW, special emphasis was given to the comparison of these two methods (Table 1). The results are unambiguous. When considering the overall comparison, nine test cases showed that SAGA makes an improvement over ClustalW (in two of these, the improvement is >10%). The trend is similar when looking only at (E+H) substitutions, where 10 test cases out of 11 present an improvement. In the few cases where it occurs, the degradation made by SAGA is always <2%. The extent of the observed improvements usually correlates well with the differences in the scores measured with the COFFEE function. Degradation is only observed in the cases where the ClustalW alignment already has a high level of consistency with the reference library (>75%), as can be seen with the globin (COFFEE score of the ClustalW alignment = 0.78) and the cytochrome C (COFFEE score of the ClustalW alignment = 0.84). In order to put SAGA-COFFEE in a wider context, comparisons were made using five other different methods (Table 2). These results show that in most of the cases SAGA-COFFEE does reasonably well. When its alignment is not the best, it is usually within 3% of the best (except for the binding and gcr tests, for which the difference is greater). Apart from the HMM method (SAM) that has low performances, it is relatively hard to rank existing methods. PRRP is one of the newest methods available. It has been described as being one of the most accurate (Gotoh, 1996) and happens to be the only one that significantly outperforms SAGACOFFEE on some test cases. It is also interesting to notice that SAGA-COFFEE is always among the best for test cases

Results Accuracy and complexity of the optimization
Since our approach relies on the ability of SAGA to optimize the COFFEE function, we checked that this optimization was performed correctly. For each test case, a dummy library was created, containing sets of pairwise alignments identical to those observed in the reference multiple structure alignment. In such a case, the structural alignment has a score of 1 since it agrees completely with the library. Therefore, the maximum score that can be reached by SAGA also becomes 1. Since, under these artificial conditions, the score of the optimum is known, we could test the accuracy of SAGA’s optimization. Several runs made on the same set reached the optimum value in an average of 5.4 runs out of 10. The lowest reproducibility was found with the largest test cases of Table 1 (igb or s prot with a score of 1 being reached, respectively, one and two times out of 10 runs). However, even if the optimal score is not reached, we found that it is always possible to produce an alignment with a score better than 0.94. Although they do not constitute a full proof, these results support the assumption that SAGA is a good choice for optimizing the COFFEE function. An important aspect is the complexity of the program and the factors that influence it. As we previously reported when optimizing the sums of pairs with SAGA (Notredame and Higgins, 1996), establishing the complexity is not straightforward. The evaluation of a COFFEE score is quadratic with the number of sequences and linear with the consensus length. Using a given population size, the time required for one generation will vary accordingly. For instance, on a fast

414

COFFEE: an objective function for multiple sequence alignments

having a low level of identity. This trend is confirmed by the results shown in Table 3, where the sequences are grouped according to their average similarity with the rest of their family (as measured on the reference structural alignment). In this table, we analysed the overall performance of each method and compared it with SAGA-COFFEE by counting (i) the overall per cent of (E+H) residues correctly aligned and (ii) the number of sequences for which SAGA-COFFEE makes a better (b)/worse (w) alignment than a given method. Overall, the results confirm that SAGA-COFFEE seems to do better than the other methods when the sequences have a low level of identity with the rest of their set. The poor performances of SAM can probably be explained by two reasons: the small number of sequences in each test case and perhaps some inadequate default settings in the program (in practice, SAM is often used as an alignment improver rather than on its own). Sequence identity: minimum and maximum average identity of the sequences of each category with the rest of their alignment as measured on the reference structural alignment. Nseq.: number of sequences in a category. Nres.: number of residues. SAGA-COFFEE percentage of the (E+H) substitutions present in the reference structural alignment observed in the SAGA-COFFEE alignment. ClustalW: (%), similar but using ClustalW alignment; (b), number of
Table 2. Method comparison on the 3D_ali test cases Test case Avg. id. (%) 21 31 42 17 36 24 24 39 22 61 27 Nseq. SAGA COFFEE (%) 50.2 64.5 90.7 *47.0 83.1 85.2 *78.1 *72.3 *64.7 96.9 66.6 PRRP (%) 48.8 *76.2 89.4 36.3 *92.8 *87.0 74.9 71.1 49.9 96.7 64.3

sequences for which SAGA-COFFEE produces a better alignment than ClustalW; (w), number of sequences for which SAGA-COFFEE produces a worse alignment than ClustalW. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. [Note that the (b) and (w) categories do not necessarily add up to the overall number of sequences because they do not include sequences having the same score with the two method compared.] Test case: generic name of the test case, taken from 3D_ali (see 3D_ali for PDB identifiers), see Table 1 for more details. Nseq: number of sequences in the alignment. Avg. Id.: average level of identity between the sequences. SAGA-COFFEE accuracy of the alignments obtained with SAGA-COFFEE as judged by comparison with the structural alignment, only considering the (E+H) substitutions. ClustalW: similar but with ClustalW alignments. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model.

ClustalW (%) 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5

PILEUP (%) 40.9 66.6 *94.6 37.8 80.8 72.6 52.4 *72.3 37.4 *97.4 57.9

SAGA MSA (%) *51.2 64.2 67.3 45.2 80.8 78.0 70.1 *72.3 55.6 96.0 *68.5

SAM (%) 27.9 36.9 67.3 16.2 85.7 67.8 67.2 55.3 45.7 90.6 61.7

ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot

14 7 6 9 8 17 37 6 8 7 15

*Indicates the method performing best on a test case. Table 3. Method comparison on the 3D_ali test cases: global results
Sequence identity [00.0–20.0] [20.0–40.0] [40.0–100.0] Nseq. Nres. SAGA COFFEE (%) *63.3 *76.2 89.3 PRRP (%) 62.2 74.6 *90.9 ClustalW (%) b 49.7 66.1 84.6 11 68 20 PILEUP (%) b 42.4 60.2 89.8 23 80 3 SAGA MSA (%) b 56.9 69.7 87.8 20 63 16 SAM (%) w 7 24 2 36.4 59.1 64.3 18 84 25 3 2 0

b 18 57 14

w 8 31 4

w 6 20 3

w 3 8 15

b

w

28 88 18

3424 12 010 3808

*Indicates the method performing best on a given range of identity.

415

C.Notredame, L.Holm and D.G.Higgins

Fig. 2. Correlation between sequence score and alignment accuracy. (a) The average level of identity of each sequence with the rest of its alignment was measured on the reference structural alignment. The average level of accuracy of the SAGA-COFFEE alignment of each of these sequences was also estimated on the (E+H) category. The two values are plotted against one another. (b) For each sequence, the sequence score was measured on the SAGA-COFFEE alignment, this value was plotted against the accuracy of the sequence alignment. The coefficient of linear correlation was estimated on these points (r = 0.65).

These results also indicate that there is no such thing as an ideal method. Even if COFFEE seems to do better on average, one can see in Table 2 and III that the alignments it produces are not always the best. In fact, it seems that depending on the test case any method can do better than the others. Unfortunately, as discussed by Gotoh (1996), it is hard to discriminate the factors that should guide the choice of a method. For this reason, being able to identify the correct portions in a multiple sequence alignment may be even more important than being able to do a very accurate alignment.

Correlation between COFFEE score and alignment accuracy
As mentioned in Methods, the score can be assessed at a local level (sequence score or residue score). One of the benefits of such evaluation is that local score and accuracy can be correlated, thus allowing the identification of potentially correct portions of an alignment with a known risk of error. The

3D_ali structure-based alignments were used once more to validate this approach. Generally speaking, a high residue score will indicate that the pairs in which a given residue is involved are also found in the pairwise library. On the other hand, if none of the pairings in which a given residue is involved are found in the library, this residue will be considered unaligned. We evaluated the COFFEE score of each sequence in each alignment. In each of these sequences the (E+H) average accuracy was also measured. The graph in Figure 2b shows the relationship between sequence score and (E+H) average accuracy. The correlation between these two quantities is reasonable (r = 0.65). When considering the values used for this graph, we found that for >85% of the sequences it is possible to predict the actual accuracy of the alignment with a ±10% error rate. In terms of prediction, this is a substantial improvement over what can be obtained when measuring the average level of identity between one sequence and its multiple alignment, as shown in Figure 2a.

416

COFFEE: an objective function for multiple sequence alignments

Table 4. Average accuracy of the alignment of each sequence as a function of its sequence score (3D_ali test cases) Sequence score N. residues (%) ClustalW 5.8 36.8 57.4 19 242 residues N. sequences (%) ClustalW 6.7 40.3 53.0 Accuracy (E+H) (%) ClustalW 14.3 63.2 82.0

SAGA 2.6 33.7 63.7 134 sequences

SAGA 3.0 38.1 59.0

SAGA 9.9 67.2 82.5

[0.00–0.33] [0.33–0.66] [0.66–1.00] TOTAL

Sequence score: minimum and maximum score of the sequences in each category. N. residues: percentage of residues belonging to each category estimated on SAGA or ClustalW alignments. N. sequences: percentage of the total sequences belonging to each category of score as measured on the SAGA and the ClustalW alignments. Accuracy (E+H): accuracy associated with each category of score in the SAGA and ClustalW alignments. TOTAL: total number of residues and sequences in the comparison. Table 5. Accuracy of the prediction made on the category 5 of substitution Test case Accuracy (%) ClustalW 56.8 64.3 96.2 81.5 75.5 97.2 88.8 91.5 78.0 82.2 89.7 Correct substitution (%) ClustalW SAGA 9.6 31.5 72.1 13.8 63.4 63.1 39.0 61.3 34.3 45.2 85.2 15.7 40.0 73.5 15.6 74.5 66.5 42.3 61.5 40.0 50.1 87.0

SAGA 68.2 61.4 93.9 77.7 77.4 95.0 85.5 91.8 72.5 82.4 89.7

ac prot binding cytc fniii gcr globin igb lzm phenyl s prot sbt

Test case: generic name of the test case taken from 3D_ali. Accuracy: ratio between the total number of substitutions in category 5 (in SAGA and ClustalW alignments) and the number of these substitutions present in the reference alignment. % Correct substitutions: percentage of the correct substitutions (over the total number, all substitution categories included) identified in the category 5 of substitution.

The correlation between score and accuracy becomes slightly more apparent when looking at the data in a more global way (Table 4). In this case, the sequences have been grouped according to their score, and the accuracy of their alignment was measured. One can see that the higher the score of a sequence, the higher its average alignment accuracy. We also found that the distribution of the sequences among the three categories was modified when using ClustalW instead of SAGA. SAGA produces more sequences with a high score than does ClustalW. This means that not only are SAGA alignments more accurate than ClustalW, it is also possible to identify them for being so. In practice, the sequence score, as imperfect as it may seem, provides a fast and simple way to identify sequences that do not really belong to a set or that are so remotely related to the rest that their alignment should be considered with care.

The sequence score is a global measure. It does not reflect the local variations that occur at the residue level. To analyse these kinds of data and assess their utility for predicting correct portions of an alignment, the score of each residue in each multiple sequence alignment was evaluated using equation (3). These scores were scaled in a range varying from 0 to 9. A residue has a score of 9 if >90% of the pairs in which it is involved are also present in the reference library, and so forth for 8 ([80–90%]), etc. Once residue scores have been evaluated, substitution classes can be defined. For instance, the class 5 of substitutions includes all the residues of a multiple alignment having a residue score superior to or equal to 5 (Figure 3a), the class 0 of substitution includes all the residues in the alignment. Figure 3a gives an example of such an evaluation. In this alignment, each residue is replaced by its score, and the residues that belong to the category 5 of substitution are boxed. Figure 3b shows the correctly aligned residues in this category. It is possible to see that using our measure, one can identify some of the correct portions in the SAGA fniii alignment. As can be gathered from Table 1, fniii is a very demanding test case. Except for the two first sequences, which are almost identical, all the other members of this set have a very low level of identity with one another. This is especially true for the sequence 2hft_1 which illustrates well the advantages and limits of our method. This sequence is not the most remotely related to the set. It has an average identity of 14%, whereas two other members (3hhr_c and 2hft) are more distantly related with 12% average identity. Despite this fact, 2hft_1 gets the lowest sequence score in the multiple alignment (0.29). This correlates well with the fact that it also has the lowest alignment accuracy of the set [18% overall, 20% for the (E+H) category]. Similarly, the only non-terminal stretch of this sequence that belongs to the category 5 is one of the only portions to be correctly aligned (Figure 3a and b). The same type of analyses were carried out on the 10 other test cases (Table 5). Our measures indicate that using the category 5 of substitution, a substantial portion of correctly aligned residues can be identified. When comparing Clus-

417

C.Notredame, L.Holm and D.G.Higgins

Fig. 3. Evaluation of the accuracy of the fniii test case (fibronectin type III family). (a) Sequences in the fniii test case were aligned by SAGA-COFFEE using a ClustalW library. The alignment obtained that way was evaluated locally. The sequences names are the PDB id entifiers. The suffix _1, _2.. indicates that several portions of the same sequence have been used (cf. 3D_ali for further details). In th is alignment, each residue has been replaced by its score. The gray boxes indicate all the residues that belong to category 5 of substitution (i.e . having a score ≥5). The sequence score box lists the values measured on each sequence. (b) The accuracy in the category 5 of substitution (boxes) was evaluated by comparison with the reference alignment. Residues shadowed in gray in the boxes are correctly aligned to one another. Boxed residues not shadowed are not correctly aligned with each other or with the rest of the category 5 residues. Residues not contained in the boxes are not taken into account for this evaluation.

talW and SAGA, we found that more correct residues can be identified with SAGA. This improvement is sometimes achieved at the cost of a slightly lower accuracy (more false positives) in the SAGA alignments. A global estimation was made to evaluate the accuracy that can be expected when using any of the 10 substitution categories on a SAGA alignment. The proportion of correct substitutions predicted that way was also measured. These results are presented in Figure 4a and b, respectively. Residues are grouped in three classes, depending on the score of the sequences in which they occur. Figure 4a confirms that high-scoring residues are usually correctly aligned (high accuracy). However, the higher the substitution category, the smaller the number of residues on which a prediction can be made, as shown in Figure 4b. These graphs confirm that the residue score can be used as a measure for predicting accuracy; they also indicate that the sequence score is informative when making a prediction on a residue.

Making a multiple structural alignment
The analysis carried out with the ClustalW libraries represents only one possible application for the COFFEE function. Generally speaking, the COFFEE scheme allows the

combination of the information contained by any reference library, regardless of the method used for its construction. To illustrate this fact, we show that it is possible to build a structure-based multiple sequence alignment when a library of high-quality pairwise structural alignments is available. We used COFFEE on two sets of proteins (vjs and ceo) using appropriate FSSP libraries. It was impossible to improve significantly over FSSP for the ceo test case, made of endoglucanases and other related carbohydrate degradation enzymes. This can be explained by the fact that the FSSP alignment with the best DALI score (the one using 1ceo as a guide) already has a high level of consistency with the library (COFFEE score = 0.82). This shows quite clearly in the fact that this alignment is 88% similar to the SAGA-COFFEE one. The second set is made of amylases and other carbohydrate degradation enzymes. Table 6 is used to compare the SAGACOFFEE alignment of these sequences with the corresponding FSSP pairwise-based multiple alignments. These results clearly indicate that the alignment produced by SAGA is better than any of the FSSP multiple alignments, regardless of the criterion used to evaluate this improvement (DALI score, consensus length or RMS). This result was to be expected since SAGA makes use of much more information

418

COFFEE: an objective function for multiple sequence alignments

Fig. 4. Prediction of correctly aligned residues using the residue COFFEE score. (a) The accuracy of the alignments (number of correct substitutions in one of the categories divided by the total number of substitutions in this category) of each sequence was meas ured. To do so, sequences were divided into three groups, depending on their sequence score. The graph was made for each of the three groups. (b) For each sequence, the number of correct substitutions contained in each category was evaluated and divided by the overall number of substitutions involving that sequence. This value was plotted against the category of substitution.

than any of the FSSP alignments. In Table 6, entries are sorted according to the DALI score. This allows one to see that the DALI and COFFEE scores correlate well for the
Table 6. Comparison of FSSP and SAGA multiple alignments Guide sequence 2ebn 1cnv 1vjs 1ctn 1smd 2amg 2aaa 1pamA SAGA-COFFEE Average DALI score 1152.6 1205.2 1258.4 1331.2 1667.1 1672.9 1766.8 1786.3 1860.0

FSSP alignments, and supports the idea that the COFFEE score is also a good indicator of the alignment quality when the library is based on structural alignments.

Average consensus length 186.5 196.4 198.8 196.9 219.4 217.7 224.9 225.8 230.9

Average RMS (Å) 3.73 3.63 3.62 3.53 3.40 3.42 3.45 3.30 3.20

COFFEE score 0.53 0.59 0.50 0.60 0.65 0.67 0.69 0.70 0.79

Guide sequence: sequence used as a guide in the FSSP multiple alignment (SAGA indicates the alignment obtained with SAGA-COFFEE). Average DALI score: average DALI score of each pair of sequences in the alignment. The table is sorted according to the values of these entries. Average consensus length: average of the number of residues superimposable by DALI in each pair of sequence. Average RMS: the average of the RMS values measured by DALI on each pair of the alignment in Ångströms. COFFEE score: score given by SAGA to the multiple alignments using the same library.

419

C.Notredame, L.Holm and D.G.Higgins

In theory, we could have used the DALI score as an objective function, and optimized it with SAGA. In such a context, DALI would be used to evaluate all the pairwise projections in order to give a score similar to the one shown in the ‘DALI score’ column of Table 6. However, this is not possible in practice because the computation of a DALI score is much more expensive than the evaluation of a COFFEE score. DALI score used on a multiple alignment is quadratic with the number of sequences and quadratic with the length of the alignment. The COFFEE score is also quadratic with the number of sequences, but only linear with the length of the alignment. In consequence, even if we were to assume a global DALI score to be biologically more realistic than the FSSP library-based COFFEE score, COFFEE still appears as a good trade-off between approximating DALI and saving on computational cost.

Discussion
In this work, we show that alignments can be evaluated for their MWT score using the COFFEE function and subsequently optimized with the genetic algorithm package SAGA. Given a heterogeneous, non-consistent collection of pairwise alignments, one can extract the corresponding multiple alignment with COFFEE and SAGA. We have shown here that the SAGA-COFFEE scheme outperforms most of the commonly used alternative packages when applied to sequences having low levels of identity. The comparison made with other global optimization techniques such as SAGA-MSA and PRRP indicates that the method is not only better because it does a global optimization, but also probably because of the way it uses information, filtering some of the noise through the library of pairwise alignments. The weighting scheme also plays a role in this improvement. It helps turning the relationship between the sequences into some of the constraints that define the optimal alignment. It is because all these constraints (library and weights) are unlikely to be consistent that the genetic algorithm strategy proves to be a very appropriate mean of optimization. There is little doubt that the performances of our method will also depend on the relationship between the sequences. Sets with a lot of intermediate sequences (i.e. a dense phylogenetic tree) are likely to lead to more accurate alignments. However, the fact that COFFEE proves able to deal with sequences having a very low level of identity is quite encouraging regarding the robustness of the method. One of the main advantages of the COFFEE strategy is the flexibility given to the user for defining the library. Here, by using two completely different pairwise alignment methods, we managed to produce high-quality multiple alignments in both cases. This is interesting, but constitutes only a first step. The structure of the libraries we have been using is very simple. They only rely on an ‘all-against-all’ comparison

using one type of pairwise alignment algorithm per library. In practice, this scheme can easily be extended to much more complex library structures. It is common sense to have a higher confidence in results that can be reproduced using independent methods. Some prediction methods rely on this type of assumption, such as the block definition strategy described by Henikoff et al. (1995).These methods usually limit themselves to identifying highly conserved patterns among a set of solutions. With the COFFEE strategy, we go much further and make it possible to find a consensus solution whatever the number of constraints and whatever their relative compatibility. Of course, it is not enough for a solution to exist, one also needs to know how accurate this solution is. In this work, we have shown that the level of consistency of a solution is a good indicator of such accuracy. This accuracy prediction constitutes the other main aspect of the COFFEE function. Several methods have been proposed that attempt to predict correct portions of a pairwise alignment given a set of sub-optimal alignments (Gotoh, 1990; Vingron and Argos, 1990; Mevissen and Vingron, 1996). Using these methods, libraries could be designed with large numbers of sub-optimal alignments. Here again, the difference between the COFFEE method and other previously proposed approaches is that not only is it possible to predict the correct portions in an alignment, but it is also possible to optimize a multiple alignment for having as many reliable regions as possible. SAGA-COFFEE still needs to be improved on several accounts. For instance, further approaches will involve the definition of more complex libraries that will hopefully lead to more meaningful consistency indices. The main source of inspiration when doing so will be the work done on pairwise alignment stability (Mevissen and Vingron, 1996). The other direction we plan to take has to do with the combination of scoring schemes. We have seen here that there is no uniform solution to the multiple sequence alignment problem. For this reason, it would make sense to generate libraries containing alternative alignments made by all the available methods (PRRP, ClustalW, HMM, etc.). COFFEE could then be used to merge this information and hopefully extract the best of each alignment. This will require some improvement of the COFFEE function and its adaptation to highly redundant library. Another crucial aspect will be increasing the efficiency of the algorithm. At present, SAGA-COFFEE is an extremely slow method; however, we hope to improve on this by using a more appropriate type of seeding. Finally, another important aspect of our approach will involve the refinement of the method used here for building multiple structural alignments. The project will be based on a procedure similar to the one described above: the design of more efficient weights and an attempt to use the alternative

420

COFFEE: an objective function for multiple sequence alignments

structural alignments that can be produced by the DALI method, using a wider range of DALI test cases.

Acknowledgements
The authors wish to thank Miguel Andrade and Thure Etzold for very useful comments and corrections. They also wish to thank the referees for their useful remarks and interesting suggestions.

References
Altschul,S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol., 138, 297–309. Altschul,S.F. and Erickson,B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603–616. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St Louis, MO. Carrillo,H. and Lipman,D.J. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48, 1073–1082. Davis,L. (1991) The Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. Dayhoff,M.O. (1978) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC. Eddy,S.R. (1995) Multiple alignment using hidden Markov models. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Feng,D.-F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Godzik,A. and Skolnik,J. (1994) Flexible algorithm for direct multiple alignment of protein structures and sequences. Comput. Applic. Biosci., 10, 587–596. Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, New York. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gribskov,M., McLachlan,M. and Eisenberg,D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. Henikoff,S., Henikoff,J., Alford,W. and Pietrokovsky,S. (1995) Automated construction and graphical representation of blocks from unaligned sequences. Gene, 163, GC17–26.

Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. and Sander,C. (1996a) Alignment of three-dimensional protein structures: network server for database searching. Methods Enzymol., 266, 653–662. Holm,L. and Sander,C. (1996b) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res., 24, 206–210. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Applic. Biosci., 12, 95–107. Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M. (1993a) Multiple sequence alignment by parallel simulated annealing. Comput. Applic. Biosci., 9, 267–273. Ishikawa,M., Toya,T. and Tokoti,Y. (1993b) Parallel iterative aligner with genetic algorithm. In Artificial Intelligence and Genome Workshop, 13th International Conference on Artificial Intelligence, Chambery, France. Kececioglu,J.D. (1993) The maximum weight trace problem in multiple sequence alignmnet. Lecture Notes Comput. Sci., 684, 106–119. Kim,J., Pramanik,S. and Chung,M.J. (1994) Multiple sequence alignment using simulated annealing. Comput. Applic. Biosci., 10, 419–426. Krogh,A. and Mitchison,G. (1995) Maximum entropy weighting of aligned sequences of proteins or DNA. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. Mevissen,H.T. and Vingron,M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, 127–132. Morgenstern,B., Dress,A. and Wener,T. (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl Acad. Sci. USA, 93, 12098–12103. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Rost,B. (1997) AQUA Server. http://www.ebi.ac.uk/∼rost/Aqua/ aqua.html Sibbald,P.R. and Argos,P. (1990) Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J. Mol. Biol., 216, 813–818.

421

C.Notredame, L.Holm and D.G.Higgins

Sjolander,K., Karplus,K., Brown,M., Huguey,R., Krogh,A., Saira,M. and Haussler,D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Applic. Biosci., 12, 327–345. Taylor,W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161–169. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gab excision. Comput. Applic. Biolsci., 10, 19–29.

Vingron,M. and Argos,P. (1990) Determination of reliable regions in protein sequence alignment. Protein Eng., 3, 565–569. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. Vingron,M. and Sibbald,P. (1993) Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc. Natl Acad. Sci., 90, 8777–8781. Vogt,G., Etzold,T. and Argos,P. (1995) An assessement of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol., 299, 816–831. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol., 1, 337–348.

422

BIOINFORMATICS

Vol. 22 no. 14 2006, pages e35–e39 doi:10.1093/bioinformatics/btl218

The iRMSD: a local measure of sequence alignment accuracy using structural information
´ Fabrice Armougom1, Sebastien Moretti1, Vladimir Keduas1 and Cedric Notredame1,Ã
1

´ Laboratoire Information Genomique et Structurale, CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientiﬁque de Luminy, case 934, 163 Avenue de Luminy, FR-13288, Marseille cedex 09
ture based alignments as unambiguous and unquestionable gold standards, a closer look reveals a much less clear cut situation. More than 20 structure alignment packages have been developed (Goldsmith-Fischman and Honig, 2003). All these packages tend to produce different alignments because of their different underlying optimization algorithms. Furthermore, the lack of a universally accepted criterion for describing the quality of a structural alignment makes it difﬁcult to determine the relative merits of all these packages (Kolodny, et al., 2005). The most common procedure to evaluate structure superpositions is to use the root mean square distance deviation (RMSD) of superposed atoms. This measure estimates the mean square distance between the equivalent alpha carbons of the two superposed structures. It can be ambiguous because of its dependence on two critical parameters: the minimization method and the procedure used to exclude structurally non equivalent regions (loops for instance). Having several methods that deliver structure based sequence alignments and not knowing which one does best is a major issue in a context where structure-based alignments are routinely used to improve and guide the development of sequence alignment methods (Wallace, et al., 2005). A direct consequence of this situation has been the development of at least ﬁve collections of reference structure based sequence alignments (Edgar, 2004; Mizuguchi, et al., 1998; O’Sullivan, et al., 2004; Raghava, et al., 2003; Thompson, et al., 2005; Van Walle, et al., 2005). These collections are all used for a similar purpose: the benchmark of sequence alignment algorithms. Since it is virtually impossible to compare these datasets and decide whether some are more informative than others, the most common practice is to use them all, and look for common trends in the global results (Katoh, et al., 2005). While results measured on these reference collections tend to agree for datasets with more than 30% identity, variations appear when considering sets of remote homologues (Katoh, et al., 2005). Aside from potential accuracy problems, the simplest explanation for these discrepancies is the possibility for alternative sequence alignments to be structurally equivalent, especially when considering remote homologues (Lackner, et al., 2000). In this context, setting one speciﬁc alignment as a reference becomes an arbitrary choice and therefore a bias toward speciﬁc alignment methods. In practice, the authors try to minimize that effect by specifying the core regions that should be used for the comparison, but this choice is also difﬁcult and somehow arbitrary. We suggest in this paper that replacing the reference alignments with an RMSD measure would

ABSTRACT Motivation: We introduce the iRMSD, a new type of RMSD, independent from any structure superposition and suitable for evaluating sequence alignments of proteins with known structures. Results: We demonstrate that the iRMSD is equivalent to the standard RMSD although much simpler to compute and we also show that it is suitable for comparing sequence alignments and benchmarking multiple sequence alignment methods. We tested the iRMSD score on 6 established multiple sequence alignment packages and found the results to be consistent with those obtained using an established reference alignment collection like Prefab. Availability: The iRMSD is part of the T-Coffee package and is distributed as an open source freeware (http://www.tcoffee.org/). Contact: cedric.notredame@gmail.com; cedric.notredame@igs. cnrs-mrs.fr

1

INTRODUCTION

The computation of accurate sequence alignments constitutes a pre-requisite for an ever increasing number of biological analyses. These include phylogenetic reconstruction, structure prediction, domain based analysis, function prediction and comparative genomics. In all these cases, the purpose of the alignment is to exploit evolutionary variations in order to reveal biologically meaningful patterns. The discovery and the proper analysis of these patterns depend entirely on the alignment correctness. In many cases, an alignment is considered to be biologically correct when it accurately reﬂects the structural relationship between the considered sequences. This result is achieved by matching structurally equivalent residues. Assembling such an alignment is trivial when the sequences are highly similar but becomes harder for remote homologues. When considering alignments of sequences with less than 25% identity (the so-called twilight zone), standard scoring schemes like substitution matrices become uninformative and it can be difﬁcult to determine the alignment accuracy, or even whether the sequences are truly related or not. So far, the most satisfying way of aligning remote homologues has been to use structural information whenever possible (Huang and Bystroff, 2006; Lesk and Chothia, 1980). The use of structural information, however, carries its own peril, and while the sequence analysis community tends to consider strucÃ

To whom correspondence should be addressed.

Ó The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

F.Armougom et al.

aligned pair Z and W within a sphere of radius (r) centered on X and Y that veriﬁes the equation: dðXWÞ < r AND dðYZÞ < r ð1Þ

The ensemble of pairs ZW that verify equation 1 is named the neighborhood and noted N(XY). The default value of r is 10 s (O’Sullivan, et al., 2003), which corresponds to a neighborhood size of 20-40 residues. The local iRMSD can be estimated as follows: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2 ZW ðdðXZÞÀdðYWÞÞ iRMSDðXYÞ ¼ ð2Þ NðXYÞ The summation is made over all the aligned ZW pairs within the neighborhood (Equation 1). Pairs XY with an empty neighborhood have their local iRMSD undeﬁned. The global measure is obtained by summing on every pair XY and dividing by the number of pairs with a non empty neighborhood (N): P XY iRMSdðXYÞ ð3Þ iRMSD ¼ N The iRMSD thus deﬁned is not suitable for comparing alternative alignments, as it tends to give a better score to alignments with long gaps and few well aligned residues. In order to simultaneously take into account the superposition accuracy and the extent of the alignment (i.e. the number of matched residues), we adapted the CI formula of Kleywegt and Jones (Kleywegt and Jones, 1994) to turn the iRMSD into a Normalized iRMSD (NiRMSD): NiRMSd ¼ iRMSDÃ MINðL1‚L2Þ N ð4Þ

Fig 1. Basic principle of the iRMSD. Equivalences implied by the sequence alignment are tested on the structure. The assumption is that if XY and ZW are correctly aligned, then the distance between residues XZ and YW must be similar. ZW pairs are only considered if they are within a sphere of radius R, centered on X and Y.

be a more objective way to evaluate the sequence alignments of proteins. The RMSD has two advantages over standard methods: no dependence on a reference alignment and the possibility to quantify the structural correctness of any protein sequence alignments (provided the protein structures are known). The main drawback, however, is the reliance of the RMSD on a structure superposition strategy. This key step affords many alternative solutions whose relative merits are difﬁcult to estimate (Kolodny, et al., 2005). We redesigned the RMSD measure to make it independent from any structure superposition procedure. We named this measure iRMSD because it is an RMSD based on intra-molecular distance comparisons. The iRMSD is a follow up of the APDB measure (O’Sullivan, et al., 2003), designed to evaluate alignments for their compatibility with the structural superposition they imply. While APDB was a complex measure depending on three semi arbitrary parameters, the new iRMSD algorithm only requires one parameter. We show here that the iRMSD behaves just like a standard RMSD both numerically (values range) and structurally (similar structural meaning). We ﬁnally show that a straightforward normalization makes the iRMSD perfectly suitable for evaluating and comparing sequence alignment methods without the need of pre-established reference alignment collections.

L1 and L2 are the respective lengths of the two sequences, and N the number of residue pairs with a non empty neighborhood. This formula amounts to incorporating a gap penalty that deals with indels and aligned pairs whose neighborhood is empty.

2.2

Validation procedure using Prefab

2 2.1

METHODS The iRMD measure

We used the Prefab (Edgar, 2004) collection of reference alignments to analyze the iRMSD. Prefab is an extensive collection of 1682 pairwise structural alignments obtained by combining the output of two structure alignment programs: CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1993). In each of these alignments the authors have deﬁned core regions where the DALI and the CE methods agree and have used these regions for evaluation purpose. Given one Prefab reference alignment and an alternative target alignment of the same sequences, the Qscore is deﬁned as the fraction of core columns in the reference alignment found aligned identically in the target. In order to evaluate multiple sequence alignment packages, Prefab also includes in each dataset a collection of about 48 sequences homologous to the two structures. When evaluating an MSA package, the large dataset is aligned and the Qscore is measured on the core regions of the induced alignment of the two structures. We evaluated the RMSD and the iRMSD of Prefab alignments. However, because of various inconsistencies between the ATOM, the SEQRES ﬁelds of the PDB entries and the sequences of the Prefab alignments, LSQMAN could only handle 587 of the original Prefab entries. This sample had roughly the same identity distribution as the entire Prefab (243 dataset having with than 20% identity (on the reference Prefab alignment), 172 between 20 and 40% identity and 171 with more than 40% identity). We believe it to be representative and large enough for the purpose of the present analysis.

The iRMSD measure follows the underlying principle of APDB: given a correct alignment of two protein sequences A and B (Figure 1), if X is aligned with Y and Z with W, then the XZ distance (d(XZ)) must be similar to d(YW). The better the alignment of A and B, the smaller the average difference between all possible pairs d(XZ) and d(YW). The iRMSD associated with the aligned pair X and Y is estimated by considering every

2.3

Evaluation of the standard RMSD

We used the LSQMAN package (Kleywegt and Jones, 1999) to estimate the standard RMSD associated with the Prefab alignments. The local RMSD was estimated by superposing the residues contained in a window of size 21 (2Ã10+1) centered on a pair of aligned residues. The superposition was

e36

The iRMSD: a local measure of sequence alignment accuracy

made using the Xalignment function of the LSQMAN package. The overall RMSD was obtained by sliding the window and averaging over all the windows.

2.4

Multiple sequence alignment methods

We benchmarked the iRMSD measure on the alignments produced using the public distributions of six multiple sequence alignment packages: ClustalW (Version 1.83) (Thompson, et al., 1994), DialignII (Version 2.2.1) (Morgenstern, 1999), Muscle (Version 3.6) (Edgar, 2004), Mafft (Version 5.6) (Katoh, et al., 2005), ProbCons (Version 1.10) (Do, et al., 2005) and T-Coffee (Version 3.75) (Notredame, et al., 2000).

2.5

Availability

The iRMSD package is part of the t_coffee package. It is an open source freeware that can be downloaded on http://www.tcoffee.org/. It comes along with an extensive documentation.

3

RESULTS

We started by comparing the iRMSD with the standard RMSD. We did so by measuring the scores associated with the 587 Prefab alignments. The measurements were either made on core regions (Figure 2a) or on the entire Prefab Alignments (Figure 2b). Both ﬁgures indicate a very strong correlation between the two measures. The core analysis gave an r2 correlation coefﬁcient of 0.92 while the measure on the entire alignments gave an r2 of 0.93. As expected, the dispersion increases with the RMSD values. The Prefab alignments are high quality structure based alignments, but we also checked the behavior of the methods when analyzing alignments of lower quality (Figure 2c). We selected the Dialign method whose alignments have an average Prefab Qscore of 0.65 on the entire dataset (0.32 in the [0-20] identity range). Figure 2c shows that the two measures remain correlated up to an RMSD of 2.5 s (r2 ¼ 0.75), indicating a saturation of the iRMSD measure for values above 1.6 s. This apparent saturation is a consequence of the different local substructures compared by each method (windows for the RMSD and sphere for the iRMSD) and it does not occur when measuring the standard RMSD on spheres of radius 10 s rather than on windows. When doing so the correlation is very good (r2 ¼ 0.91 over the full range, data not shown). We further checked the local aspect of the measures by plotting both the local iRMSD and the local RMSD against several Prefab alignments. The 1aoh_1anu example is displayed on Figure 3 and clearly shows that both measures are well coordinated all along the alignment. While the iRMSD indicates two narrow peaks not found in the RMSD, both methods agree on the ﬁnal series of peaks. We used LSQMAN to superpose the two structures and were satisﬁed to ﬁnd that the peaks showing in the iRMSD curve effectively correspond to regions poorly superposed. Although the iRMSD seems to reveal more sharply these locations, it is fair to say that the standard RMSD could probably be parameterized to yield similar results (for instance by lowering the window size). Having established that the iRMSD behaves like a standard RMSD measure we then estimated whether that measure is suitable for evaluating the relative accuracy of multiple sequence alignment packages. For that purpose, we aligned the Prefab datasets with six MSA methods and for each of these methods we evaluated the Qscore, the Normalized iRMSD (NiRMSD, Equation 3) and estimated the fraction of alignments having a NiRMSD better or

Fig 2. Correlation between the iRMSD and a standard LSQMAN RMSD. 1a) RMSD versus iRMSD of 587 of Prefab reference Alignments. The (i)RMSDs were only measured on the regions annotated as core in Prefab. The iRMSD is on the vertical axis and the regular RMSD, as obtained from LSQMAN, is on the horizontal axis. Each dot corresponds to one dataset. 2a) RMSD versus iRMSD on 587 Prefab reference Alignments. The (i)RMSDs were measured on the entire alignments. 2c) RMSD versus iRMSD on 587 Prefab datasets, aligned by Dialign. The dataset is the same as before and the (i)RMSDs were measured on the entire alignments.

e37

F.Armougom et al.

Fig 3. Local Comparison of the iRMSD against a standard LSQMAN RMSD. The comparison was made on the Prefab reference alignment of 1aohA_1anu. The two structures were superposed by LSQMAN (1aohA: violet, 1anu:blue). The alignment was then evaluated locally using either LSQMAN to measure the RMSD (Blue line) or T-Coffee/iRMSD to measure the local iRMSD. The (i)RMSDs values were plotted on the vertical axis against the alignment positions. Portion of the superposition corresponding to the peak were extracted and encapsulated.

Table 1. Average Qscore jRange N 0-20 20-40 40-100 Total 243 171 173 587

Table 2. Consistency between the NiRMSD and the Qscore j Range 0-20 20-40 40-100 Total

Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 0.32 0.80 0.96 0.65 0.34 0.83 0.96 0.67 0.43 0.86 0.97 0.71 0.44 0.87 0.98 0.73 0.48 0.88 0.97 0.74 0.49 0.88 0.98 0.75 -------------

Npair 7290 5130 5190 17610

Consistent 0.86Ã 0.90Ã 0.94Ã 0.90Ã

Inconsistent 0.14Ã 0.10Ã 0.06Ã 0.10Ã

a) Average Qscore: Range is the range of identity of the considered Prefab datasets, as measured on the reference alignments. N is the number of Prefab datasets in each range. Dialign, ClustalW, Muscle, TCoffee, ProbCons and Mafft are the average Qscores as measured on the alignments produced by these packages. The entries corresponding to the best performance for each category are underlined and in bold. The best Qscore are the highest.

a) Core Regions: Range is the range of identity of the considered Prefab datasets, as measured on the reference alignments, Np is the number of pairs on which the comparison was carried out. Consistent is the fraction of pairs for which the Qscore and the NiRMSD score were consistent. For the purpose of this table, two pairs were considered consistent whenever their Qscore differed by less than 1 point percent and their NiRMSD by less Ã than 0.05 s. A binomial test was carried out on the results and entries marked with indicate results whose p-value is lower than 0.000001.

jRange N 0-20 20-40 40-100 Total 243 171 173 587

Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 3.46 0.91 0.44 1.83 2.10 0.82 0.58 1.28 1.82 0.80 0.44 1.11 2.16 0.79 0.44 1.25 1.85 0.77 0.44 1.12 1.76 0.77 0.43 1.08 0.85 0.67 0.43 0.67

j Range 0-20 20-40 40-100 Total

Npair 7290 5130 5190 17610

Consistent 0.79Ã 0.84Ã 0.84Ã 0.82Ã

Inconsistent 0.21Ã 0.16Ã 0.16Ã 0.18Ã

b) Average NiRMSD: The labels are the same. The measure is the average NiRMSD as measured on the core regions of the alignments. The Prefab column corresponds to the evaluation of the Prefab reference alignments. The best NiRMSD scores are the lowest.

b) Same as a) but with the NiRMSDs measured on the entire alignments.

jRange N 0-20 20-40 40-100 Total 243 171 173 587

Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 0.02 0.36 0.86 0.36 0.10 0.36 0.89 0.40 0.05 0.46 0.89 0.42 0.09 0.56 0.92 0.47 0.06 0.57 0.89 0.45 0.10 0.54 0.91 0.47 -------------

c) Best NirRMSD Fraction: fraction of alignments having a NiRMSD better or equal to the Prefab reference as measured on the core regions. The labels are the same.

equal to the Prefab reference (Best NiRMD fraction), as measured on the core regions. The results (Table 1a,b and c) are unambiguous and clearly show a high correlation between the Qscore, the average NiRMSD and the Best NiRMSD fraction. As expected, the Prefab reference alignments outperform every other method (Table1b, Prefab), with a NiRMSD always lower than the rest, especially in the distant homologue category (Table 1b, Prefab, [0-20]). The rankings suggested by each score are in broad agreement when considering equivalent lines in each table. We looked at the statistical signiﬁ-

e38

The iRMSD: a local measure of sequence alignment accuracy

cance of all these analyses. For doing so we considered every dataset individually and estimated the consistency between the Qscore and the NiRMSD measured on two alternative alignments. For instance, given a dataset and two alignments (aln1 and aln2) generated by two different methods, the Qscore and the NiRMSD are consistent if they indicate the same relationship between the two alignments (e.g. aln1 better than aln2 according to Qscore AND NiRMSD). This measure was used to analyze every possible pair of methods (Table 2a,b). The results show that Qscore and NiRMSD are highly correlated with 90% consistency between the two measures on core regions and 82% when considering entire alignments. The correlation is not affected by the level of identity between the considered sequences. These ﬁgures were measured on more than 17000 pairs of alignments. We checked these results for statistical signiﬁcance, using a binomial test and assuming an equal probability of 0.5 for consistency and inconsistency. The results are highly signiﬁcant on each category, with P-Values systematically lower than 10-6. These results conﬁrm that the NiRMSD measure is at least as discriminative as Prefab.

also thank Dr Phillip Bucher who provided many of the original ideas through useful discussions.

REFERENCES
Do,C.B., Mahabhashyam,M.S., Brudno,M. and Batzoglou,S. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15, 330–340. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32, 1792–1797. Print 2004. Goldsmith-Fischman,S. and Honig,B. (2003) Structural genomics: computational methods for structure analysis. Protein Sci, 12, 1813–1821. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol, 233, 123–138. Huang,Y.M. and Bystroff,C. (2006) Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics, 22, 413–422. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33, 511–518. Kleywegt,G.J. and Jones,T.A. (1994) Superposition. CCP4/ESF-EACBM Newsletter Protein Crystallog., 31, 9–14. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta Crystallogr D Biol Crystallogr, 55 (Pt 4), 941–944. Kolodny,R., Koehl,P. and Levitt,M. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol, 346, 1173–1188. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a reﬁned tool for protein structure alignment. Protein Eng, 13, 745–752. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol, 136, 225–270. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci, 7, 2469–2471. Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment [In Process Citation]. Bioinformatics, 15, 211–218. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302, 205–217. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J Mol Biol, 340, 385–395. O’Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 Suppl 1, i215–221. Raghava,G.P., Searle,S.M., Audley,P.C., Barber,J.D. and Barton,G.J. (2003) OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, 11, 739–747. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Koehl,P., Ripp,R. and Poch,O. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. Van Walle,I., Lasters,I. and Wyns,L. (2005) SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267–1268. Wallace,I.M., Blackshields,G. and Higgins,D.G. (2005) Multiple sequence alignments. Curr Opin Struct Biol, 15, 261–266.

CONCLUSION
We describe the iRMSD, a measure with all the advantages and properties of a standard RMSD without requiring any structure superposition. A simple normalization makes it possible to use the iRMSD for evaluating the accuracy of structure based sequence alignments. This measure, named NiRMSD, was applied on the alignments produced by 6 popular multiple sequence alignment packages. In 90 % of the cases the NiRMSD measure was in agreement with the Prefab ranking (Qscore). These ﬁndings, highly signiﬁcant from a statistical point of view, suggest the suitability of this new measure for evaluating sequence alignments accuracy whenever structural information is available. We also expect that the method can easily be extended to sequences having a close homologue with a known structure. Future developments will involve applying the iRMSD to Multiple Structure Alignment analysis. We are also planning to use the NiRMSD measure to compare structure alignment packages and check whether some methods clearly outperform the others or whether some structure alignment meta-method should be designed instead. Further reﬁnement could also involve exploring the capacity of the iRMSD measure to automatically identify and exclude unalignable positions.

ACKNOWLEDGEMENTS
The development of this project was supported by CNRS (Centre National de la Recherche Scientifique), Sanofi-Aventis ´ Pharma SA., Marseille-Nice Genopole and the French National Genomic Network (RNG). We thank Prof. Jean-Michel Claverie (head of IGS) for useful discussions and material support. We

e39

BIOINFORMATICS

Vol. 17 no. 0 2001 Pages 1–3

Mocca: semi-automatic method for domain hunting
´ Cedric Notredame
Information Genetique et Structurale, CNRS-UMR 1889, 31 Ch. Joseph Aiguier, 13 402 Marseille, France
Received on Month xx, 2000; revised and accepted on Month xx, 2000

ABSTRACT Motivation: Multiple OCCurrences Analysis (Mocca) is a new method for repeat extraction. It is based on the TCoffee package (Notredame et al., JMB, 302, 205–217, 2000). Given a sequence or a set of sequences, and a library of local alignments, Mocca extracts every segment of sequence homologous to a pre-speciﬁed master. The Q: implementation is meant for domain hunting and makes Please it fast and easy to test for new boundaries or extend known supply recevied repeats in an interactive manner. Mocca is designed to date Q: deal with highly divergent protein repeats (less than 30% Applications amino acid identity) of more than 30 amino acids. Note? Q: Availability: Mocca is available on request (cedric. There are some notredame@gmail.com). The software is free of charge differences in the and comes along with complete documentation.
electronic version and hardcopy. We follow electronic version

information concerning the whereabouts of one of the repeats (master repeat), it allows the user to tune the parameters describing the repeat family (i.e. start position, length of the master repeat and stringency of the search), and extract other occurrences of that repeat within the dataset. The procedure is fast and simple.

INTRODUCTION Many proteins consist of separately evolved, independent structural units called modules or domains. The great diversity of protein functions is partly due to the vast number of possibilities to arrange a ﬁnite number of those basic units (Chothia, 1992). It is generally agreed that a domain is a self-folding unit made of a minimum of 25 amino acids (Bairoch et al., 1997; Corpet et al., 1998). Many of these domains appear as homologous subsequences repeated within a sequence or within a set of sequences, hence the importance of repeats identiﬁcation in the course of domain hunting. Many tools exist for discovering and extracting these repeats and without being exhaustive, one can cite PSi-Blast (Altschul et al., 1997), Dot matrices (Junier and Pagni, 2000); Repro (Heringa and Argos, 1993) and the Gibbs Sampler (Lawrence et al., 1993). More recently, Heger and Holm developed a method meant to scan databases for repeats without manual intervention (Heger and Holm, 2000). These automatic methods all share the same drawback: while none of them is 100% accurate, they give the user little scope for testing his own hypothesis in a seamless manner. Multiple OCCurrences Alignment (Mocca) addresses that speciﬁc problem. Given some approximate
c Oxford University Press 2001

METHODS Mocca uses a pair-wise sequence alignment algorithm (Durbin et al., 1998). The cost associated with the alignment of each pair of residues uses the ‘library extension’ developed for T-Coffee (Notredame et al., 1998, 2000). Figure 1 outlines the strategy used to generate the T-Coffee scoring scheme. Firstly, a primary library is compiled; it contains a series of local alignments obtained using Lalign, an implementation of the Sim algorithm (Huang and Miller, 1991). Given two sequences, Lalign extracts the N top scoring non-overlapping local alignments. We used a modiﬁed version that compares two sequences (or a sequence with itself), and extracts every top scoring alignment having a length longer than ten residues and an average level of identity higher than 30%. Lalign reports each alignment along with a score that indicates its statistical signiﬁcance. In our primary library, such local alignments appear as a series of pairs of residues where each pair receives a weight equal to the score of the alignment it comes from. Given a set of N sequences, the library contains the result of all the possible pair-wise comparisons (including the self-comparisons). This library is fed into T-Coffee to generate the position speciﬁc scoring scheme using the ‘library extension’ algorithm (Notredame et al., 2000). In Mocca, a pre-requisite to repeat extraction is the estimation of at least one basic unit repeat among the sequences being analysed (master repeat). In the context of this work, we made the estimation using dotlet, a Java-based dot matrix method (Junier and Pagni, 2000). The master repeat is a sub-string selected within the sequence(s) used to build the library. Mocca extracts every sub-string homologous to the master in a single pass over the target sequences. It is the library extension that
1

C.Notredame

putational requirement is the Lalign library O(N2 L2 ), the motif extraction itself only requires little time (12 s on an IRIX O2 station for 20 sequences totalling 5000 residues). If the position of one of the repeats is known, the procedure can also be run automatically from the command line. It is recommended to use Mocca in conjunction with other means for the initial estimation of the repeat boundaries (PSi-Blast, Altschul et al., 1997; Dotlet, Junier and Pagni, 2000; Dotter, Sonnhammer and Durbin, 1995;. . . ). Our tests show that Mocca can properly deal with sets of repeats whose multiple alignment indicate less than 15% average identity. While we currently use Lalign as a source of local information, any other sensible source could be considered. For instance, structural information could easily be added to our procedure, using off the shelf libraries of local structural similarities such as the Dali Domain Dictionary (Holm and Sander, 1998). The input format of Mocca is straightforward and well documented. Mocca is a reﬁnement tool for the discovery and the establishment of new domains. If the master repeat is replaced with a proﬁle or a collection of known characterized repeats, Mocca could also be used to improve the model of a given repeat family and extend the predictive power of its proﬁles.

Fig. 1. Layout of the Mocca strategy. The main steps required to extract a repeat with Mocca method are shown. Square blocks designate procedures while rounded blocks indicate data structures.

ACKNOWLEDGEMENTS The author wishes to thank the following people: Des Higgins for very helpful comments. Jaap Heringa, Philipp Bucher and Kay Hoffmann for useful discussions and advice at an early stage of the project, Hiroyuki Ogata for helpful comments on the program. REFERENCES
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. Chothia,C. (1992) Proteins: 1000 families for the molecular biologist. Nature, 357, 543–544. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of protein domain families. Nucleic Acids Res., 26, 323–326. Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis. 1 vols, Cambridge University Press, Cambridge. Heger,A. and Holm,L. (2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins, 41, 224–237. Heringa,J. and Argos,P. (1993) A method to recognise distant repeats in protein sequences. Proteins: Struct. Funct. Genet., 17, 391–411. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.

makes it possible for a single repeat to ‘recognize’ each of its homologues (even the distant ones). The extraction process relies on a very efﬁcient dynamic programming procedure known as repeated matches (Durbin et al., 1998). This algorithm reports a series of non-overlapping sub-strings each of them having an alignment to the master associated with a score higher than some pre-speciﬁed threshold T h. T h is empirically set to be a function of the maser repeat length (L): Th = S ∗ L S has a value between 0 and 1. By default, S = 0.05, but its value can be modiﬁed interactively. Two other parameters can also be modiﬁed to increase sensitivity and accuracy: the gap opening penalty and the gap extension. Mocca is part of the T-Coffee package. It is written in Perl and ANSI C. It runs on any UNIX or LINUX platform. It is available free of charge along with documentation. Copies can be obtained on request by sending an e-mail to cedric.notredame@gmail.com. The main com2

Mocca and domain hunting

Huang,X. and Miller,W. (1991) A time-efﬁcient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. Junier,T. and Pagni,M. (2000) Dotlet: diagonal plots in a web browser. Bioinformatics, 16, 178–179. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel algorithm for multiple sequence alignment. JMB, 302, 205–217. Sonnhammer,E.L. and Durbin,R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene, 167, GC1-10.

To be balanced at ﬁnal stage

3

Optimization of ribosomal RNA profile alignments
    ¢ ¡   ¢ © £     ©  ¢ !     $  $ ¢   © # ¨ § ¥ ¦ " ¥  ¤  £  ¢ ¡ ¡ ! ¢ ¡  

Motivation: Large alignments of ribosomal RNA sequences are maintained at various sites. New sequences are added to these alignments using a combination of manual and automatic methods. We examine the use of profile alignment methods for rRNA alignment and try to optimize the choice of parameters and sequence weights. Results: Using a large alignment of eukaryotic SSU rRNA sequences as a test case, we empirically compared the performance of various sequence weighting schemes over a range of gap penalties. We developed a new weighting scheme which gives most weight to the sequences in the profile that are most similar to the new sequence. We show that it gives the most accurate alignments when combined with a more traditional sequence weighting scheme. Availability: The source code of all software is freely available by anonymous ftp from chah.ucc.ie in the directory /home/ftp/pub/emmet, in the compressed file PRNAA.tar. Contact: emmet@chah.ucc.ie, des@chah.ucc.ie

Introduction
Ribosomal RNA sequences (rRNA) are widely used to estimate the phylogenetic relatedness of groups of organisms (e.g. Sogin et al., 1986; Pawlowski et al., 1996), especially that of the small subunit (SSU rRNA). The SSU rRNA has been sequenced from thousands of different species and large alignments are maintained at several sites (Maidak et al., 1997; Van de Peer et al., 1997). The alignments are large and complex and the addition of new sequences is a demanding task, either for the alignment curators or for individuals who wish to align new sequences with existing aligned sequences. In simple cases, automatic alignment programs such as Clustal W (Thompson et al., 1994a) may be used to align groups of closely related sequences or as a prelude to manual refinement. There may be large stretches of unambiguous alignment with high sequence identity which may be useful for phylogenetic purposes. The fully automated, accurate alignment of rRNA sequences remains a difficult problem, however. In principle, one can use profile alignment methods (Gribskov, 1987) which use dynamic programming algorithms (Needleman and Wunsch, 1970, Gotoh, 1982) to align a new sequence against an existing ‘expert’ alignment. For example,

one could take an alignment of all SSU rRNA sequences from one of the rRNA collections and one could use this as a guide; aligning each new sequence in turn, treating the large alignment as a profile. This approach has the advantage of simplicity and speed but the final accuracy may be limited by the lack of any ability to use secondary structure information. The RNALIGN approach (Corpet and Michot, 1994) or the stochastic context free grammar approach (Eddy and Durbin, 1994; Sakakibara et al., 1994) provide elegant methods for the alignment of rRNA sequences taking both primary sequence and secondary structure into account. These methods, however, are very demanding in computer resources and cannot deal easily with pseudoknots so that their immediate application to the alignment of SSU rRNA sequences is not trivial. In this paper, we examine, empirically, the effectiveness of profile alignment methods for the alignment of RNA sequences. We remove test sequences from existing ‘expert’ alignments and measure the extent to which they can be realigned with the original alignment, automatically. We use the eukaryotic SSU rRNA sequences from Van de Peer et al. (1997) as a test case. For a range of test sequences, we measure the number of positions that can be correctly realigned over a range of different parameters (gap opening and gap extension penalties). Sequence weighting has been shown to increase the reliability of profile alignments using amino acid sequences (Thompson et al., 1994b). This can be used to give less weight to clusters of closely related sequences and increased weight to sequences with no close relatives in order to counteract the effect of unequal sampling across a phylogenetic tree of possible sequences. We examine the effectiveness of one commonly used scheme (Thompson et al., 1994b). We also propose a new weighting scheme which is designed to give increased weight to those sequences in the profile (reference alignment) which are closest (highest sequence identity) to the new sequence being aligned. If a new mammalian sequence is being aligned, for example, it makes most sense to give a high weight to other mammalian sequences and decreasing weights to sequences that are more and more distantly related. Some sections of SSU rRNA sequences are from regions whose secondary structure is conserved across many species. These conserved, ‘core’, regions are relatively easy to align

332

Oxford University Press





h

j

2

i

(

h

&

h

'

g

3

)

f

U

e

Q

d

T



BIOINFORMATICS























 S



 5 R Q k P I 2 ( I w 2 c  (  E B y & A x ) H  b v A a G  ) Y u 3  `  D q Y  A 5  & D q F p & & E E i F 3 e s I  D 6 ) e @ f X 0 f 6  1 9 ) ( i & q D C  6 A 2 i 2 e B 3  0 g A h W @ e ) 2 u 0 6  9 V 6   A 1  & 0 y & x U 8 0 w 7 6 0 v 3 u 9 6 e 2 5 H t 4 p s 9 3 f 7 0 6 0 r 2 ( q & p 1 ) 1 i 0 3 e ) 4 h g ( 2 e 6 ' f 3 & 6 e d 5 %

rRNA profiles

with high accuracy but are interspersed with less conserved regions that may be very difficult to align. We empirically determine which regions of the eukaryotic reference alignment can be aligned with high accuracy by a simple jack-knife experiment. We remove each sequence, one at a time, and try to realign it with the rest. It is then a simple matter to count how often each nucleotide of each sequence is correctly realigned. This gives a definition of conserved core regions that is purely empirical and which can be used by users to delimit regions of alignment which can be safely used in phylogenetic research. Finally, we examine the effect of G+C content of each sequence on the accuracy of alignment. Sequences of high or low G+C may be expected to be more difficult to align than those with more balanced nucleotide compositions.

umn in the profile (just one of the four residues), with no gaps will get a score of 1.0 when aligned with the same residue in the test sequence and a score of 0 otherwise. Other columns score in proportion to the frequency of each of the four residue types. In positions in the profile where one or more of the sequences has a gap, gaps were treated as a class of residue for frequency calculations. Other methods have been proposed for generating profiles using the natural logarithms of residue frequencies which may be normalized by overall residue frequencies to give log-odds scores (see Henikoff and Henikoff, 1996 for a review). We carried out some tests using the latter scheme and found that performance was comparable although slightly inferior to that using simple frequencies. Therefore we only present results obtained using the frequencies.

Gap penalties System and methods Small subunit ribosomal RNA
An alignment of eukaryotic, nuclear SSU rRNA sequences (that dated May 6, 1997) was obtained from the World Wide Web server at http://www-rrna.uia.ac.be/ssu/index.html (Van de Peer et al., 1997). After removal of columns which consist only of gaps, the two incomplete sequences of Butomus umbellatus and the unaligned sequence Babesia bovis 4 the alignment contains 1517 sequences and is 5370 characters long. Individual sequences vary widely in length, from<1300 nucleotides to >2500. Sixteen test sequences were removed from and realigned with the reference alignment in order to measure the accuracy with which it was possible to recreate their original alignment. The sequences used were Drosophila melanogaster, Xenopus laevis, Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Oryza sativa, Dictyostelium discoideum, Euglena gracilis, Ammonia beccarii, Physarum polycephalum, Entamoeba histolytica 1, Vahlkampfia lobospinosa, Giardia sp., Naegleria gruberi, Hexamita sp. and Trypanosoma brucei. These sequences were chosen based on a phylogenetic tree of all the sequences in the alignment, in order to give a spread of test cases over a wide range of different positions in the tree. Re-alignment was carried out over a range of gap penalties and using a number of sequence weighting schemes as described below. A range of gap opening and extension penalties were used in alignment generation. For each test sequence and each weighting scheme, a total of 81 alignments were carried out. Gap opening penalties were used ranging from 1 to 9 in increments of 1, and gap extension penalties ranging from 0.1 to 0.9 in increments of 0.1. This range of ratios between gap penalties and residue match scores was chosen as it encompasses values empirically shown to give alignments of biological relevance. Terminal gaps were penalized solely with an extension penalty. Position-specific gap opening penalties were derived from the frequency of gaps at each position along the alignment. At each position, a value equal to the number of residues (nongap characters) in the column divided by the number of sequences in the alignment was derived. This value was then multiplied by the gap opening penalty, as taken from the range above, to give a specific gap opening penalty at each position. This gives gap opening penalties which are higher in positions at which residues mostly occur in comparison with positions which are occupied mostly by gaps.

Sequence weighting
By default, each sequence in the existing alignment will have an equal effect on the alignment of new sequences with the profile. If additional information is available concerning the relationships of sequences within the alignment to each other and to the sequence being aligned, this may not be optimal. For example, if a new sequence is identical to a sequence already in the alignment, the correctly aligned position of each residue in the new sequence could be deduced solely from that one identical sequence, and no information concerning the other sequences is necessary. Further, sampling bias can lead to an unequal representation of taxa within the alignment (e.g. there might be very many sequences from some taxa and very few from others), and it is possible to use sequence weighting to correct for this also. Three different weighting schemes

Dynamic programming
The reference alignment was converted into a profile (Gribskov et al., 1987) which contains information on the frequency of each residue and gaps at each position. The test sequences were aligned with this using a dynamic programming algorithm (Needleman and Wunsch, 1970). We used Gotoh’s algorithm (Gotoh, 1982) and maximized the similarity between the sequence and the profile. A homogenous col-

333

E.A.O’Brien, C.Notredame and D.G.Higgins

Fig. 1. Tree of the sequences that were used as test cases. The weights for these sequences under different weighting schemes are given in Table 1.

were applied to the sequences in the SSU rRNA alignment, and compared with the default of equal weights. The first weighting scheme, referred to as tree-based weights, is based on a phylogenetic tree of the sequences in the alignment. A neighbour-joining tree (Saitou and Nei, 1987) of all the sequences in the profile was generated using the DNADIST and NEIGHBOR programs of the PHYLIP package (Felsenstein, 1989). Weights were then derived from the branch lengths as described by Thompson et al. (1994b). These weights are then normalized to have a mean of 1.0. This gives a total weight for the profile equal to that where each sequence is weighted equally, which is necessary in order to keep the effects of changing gap penalties congruent across the different schemes. The general effect of these tree-based weights is to downweight sequences with many close relatives in order to prevent the more densely populated regions of the tree exerting a disproportionate effect on the alignment of sequences from other regions of the tree. The second weighting scheme is based on the level of similarity between the sequence being aligned and each individual sequence in the alignment, and is referred to as identitybased weighting. The new sequence is first aligned with the profile using equal weights. A distance is then calculated between the new sequence and each other sequence in the alignment equal to the mean number of differences per site in this initial approximate alignment. This is percent difference divided by 100 and there is no correction for multiple hits or unequal rates of transition and transversion. The recip-

rocal of this distance is used as a weight for each sequence and these are again normalized to give a mean of 1.0. This weighting scheme has the effect of upweighting sequences more similar to the sequence being added relative to those that are more distantly related. The upweighting effect increases as the sequences become more similar to the sequence being aligned. The third scheme is a combination of these weighting schemes, in which the weight derived for each sequence based on branch lengths is multiplied by the weight derived from sequence identities, and the values are again renormalized. This scheme is referred to as combination weights. Table 1 shows the values given by the various weighting schemes for the case shown in the example tree in Figure 1. The tree-based weights are independent of the new sequence that is to be added, being derived wholly from the structure of the existing data. Weights are calculated using the method of Thompson et al. (1994b), which are then renormalized to give a mean of 1, leaving the values shown. The identitybased weights are derived by taking the distance of each sequence in the tree from the new sequence, defined as the mean number of differences per aligned pair of residues, ignoring any pairs with a gap in either sequence. The reciprocals of these values are renormalized around 1 to give the figures shown. For the final set of combination weights, the product is taken of the weights in each of the preceding columns and again renormalized to give a mean of 1.

334

rRNA profiles

Table 1. The weights assigned to the sequences in the test tree shown in Figure 1 when the sequences Mus musculus and Plasmodium gallinaceae were added a Ammonia beccarii Caenorhabditis elegans Dictyostelium discoideum Drosophila melanogaster Entamoeba histolytica Euglena gracilis Giardia sp. Hexamita sp. Homo sapiens Naegleria gruberi Oryza sativa Physarum polycephalum Saccharomyces cerevisiae Trypanosoma brucei Vahlkampfia lobospinosa Xenopus laevis 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 b 0.746 0.974 0.875 0.727 1.194 1.519 1.340 1.266 0.411 1.212 0.511 1.435 0.516 1.488 1.398 0.383 c 0.273 0.289 0.250 0.349 0.225 0.198 0.193 0.206 10.628 0.204 0.390 0.205 0.377 0.211 0.196 1.798 d 1.256 1.008 1.049 1.054 0.984 0.809 0.773 0.854 1.053 0.942 1.235 0.856 1.302 0.846 0.889 1.082 e 0.379 0.522 0.406 0.470 0.500 0.557 0.481 0.484 8.088 0.459 0.370 0.547 0.361 0.583 0.508 1.278 f 0.991 1.038 0.968 0.809 1.241 1.298 1.094 1.141 0.456 1.205 0.667 1.298 0.708 1.329 1.313 0.438

Columns represent the following schemes: (a) equal sequence weights, (b) tree-based sequence weights, (c) identity-derived weights for each sequence for the alignment of Mus musculus, (d) identity-derived sequence weights for each sequence for the alignment of Plasmodium gallinaceae, (e) combination of tree and identity-derived weights for Mus musculus, (f) combination of tree and identity-derived weights for Plasmodium gallinaceae

For each of the three defined weighting schemes and the default of equal weights, alignments were generated using position-specific gap-opening penalties across the range of gap extension penalties and base gap opening penalties described above. This procedure was repeated for each of the test sequences. The number of residues correctly placed in each alignment was determined by comparison with the sequence as originally aligned in the reference alignment, and this was then divided by the total number of residues in the sequence to give a percentage score for the alignment. From the scores for the alignments across the range of gap opening and gap extension penalties for each test case, the gap penalties giving the best performance across all or most of the test cases were obtained.

Results
The performance of a set of weights was judged by its efficacy across the range of gap opening and gap extension penalties used. The peak score and the range of gap penalties giving a comparable score were taken into account in making this judgement (Table 2). For scoring purposes, each residue is counted as distinct, and is only considered correctly aligned if it is in the same position as the same residue in the reference sequence. The score for a sequence is counted as the percentage of the total number of residues in the sequence that have been correctly realigned. The main results are presented in Table 2. In the first column, the percentage accuracy of alignment scores are given for each of the 16 test cases. These scores are the best obtained across the range of gap opening and extension penalties with no sequence weights. The scores are low and range from 43% (Euglena) up to 88% (Oryza). The addition of position specific gap penalties has a dramatic effect. The scores all increase by about 10–15% which represents an improvement of several hundred residues in the original sequences that have been correctly aligned. The use of sequence weights yields further improvements, although not as dramatically as this. It should be noted that an improvement in score of just 1% is the equivalent of 20 residues in a molecule of 2000 nucleotides. We only give the peak scores from across the full range of gap opening and extension penalties. These were all obtained with a gap opening penalty of between 5.0 and 7.0 and a gap extension penalty of either 0.1 or 0.2.

Implementation
Programs were developed and/or run on DEC Alpha workstations running DEC UNIX. All new code was written in the C programming language and is freely available by anonymous FTP (login as anonymous to chah.ucc.ie and transfer the compressed tar archive PRNAA.tar). The code is not designed for portability and users will have to down load their own rRNA alignments and build their own profiles; a JAVA version of the programs is being developed which will be used to provide future access to all the methods via the Internet.

335

E.A.O’Brien, C.Notredame and D.G.Higgins

Table 2.The highest % identity between the reference alignment and the realigned sequence obtained using each of the weighting schemes a A.beccarii C.elegans D.discoideum D.melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberi O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 71.65 69.26 64.42 70.14 55.68 43.12 55.00 56.13 79.88 50.37 88.85 53.62 86.71 47.62 46.23 82.47 b 84.19 83.98 78.95 82.72 73.50 60.22 73.89 73.10 91.01 63.60 97.08 65.02 93.94 62.86 56.20 93.59 c 83.66 83.98 78.95 82.97 74.83 60.22 73.96 73.61 92.88 63.74 97.13 64.66 94.55 63.39 55.69 95.18 d 84.05 86.99 79.59 81.11 75.04 60.22 76.81 78.39 91.49 67.81 96.69 68.64 93.38 64.77 56.20 94.25 e 83.96 87.84 79.06 84.02 78.17 61.08 77.29 77.16 92.30 67.86 97.35 67.52 94.10 65.04 58.96 95.07

(a) Fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based weights, (d) position-specific gap penalties and tree-based weights, (e) position-specific gap penalties and combination weights. The underlined values are the absolute maximum scores obtained for each sequence Table 3. Alignment percentage accuracy scores for various weighting schemes and gap penalties Gap extension penalty (a) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 58 58 59 58 58 58 58 57 57 59 59 60 59 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 62 62 62 62 62 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Cont.... 46 32 16 13 4 2 2 1 1 47 32 17 12 4 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 45 31 17 10 6 5 4 3 3 46 33 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 Trypanosoma brucei gap opening penalty 1 2 3 4 5 6 7 8 9 Vahlkampfia lobospinosa gap opening penalty 1 2 3 4 5 6 7 8 9

336

rRNA profiles

Table 3. Continued Gap extension penalty (c) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (d) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (e) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 62 60 61 60 60 59 59 59 59 64 62 62 62 60 61 61 61 62 64 63 63 62 62 63 62 62 62 65 64 64 64 63 63 63 62 62 65 64 64 64 64 63 63 62 62 65 64 64 64 64 62 63 62 62 65 64 64 64 64 62 63 62 62 65 63 64 64 64 63 63 62 62 65 63 64 64 64 63 63 62 62 56 56 55 55 55 55 55 55 55 58 57 57 57 57 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 62 61 60 61 60 59 59 59 59 63 63 63 62 61 61 61 61 61 65 64 63 63 62 62 61 61 61 65 64 64 64 64 63 62 62 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 55 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 58 59 59 58 58 58 58 57 57 59 59 60 58 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 61 62 62 62 62 62 63 63 63 63 62 62 62 61 62 63 63 63 63 62 62 62 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 54 53 53 54 54 55 54 54 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Trypanosoma brucei gap opening penalty Vahlkampfia lobospinosa gap opening penalty

Italics represent those regions at or above the highest score attainable with equal sequence weights. Underlining represents the highest score attained across all the different parameters. Parameter sets are: (a) fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based sequence weights, (d) position-specific gap penalties and tree-derived sequence weights, (e) position-specific gap penalties and weights derived from combination of tree-based and identity-based weights.

In nine out of the 16 test cases, the single best alignment score generated across the ranges of gap penalties was obtained using the combined weights (the last column of Table 2). In three of the remaining cases, tree-based weights give the best performance (column c). The identity weights give the highest score in three cases, and Ammonia beccarii is aligned most accurately with equal weights. Both identitybased and tree-based methods of sequence weighting are shown to improve over equal weights in most cases, with the combination of both these weights giving the best overall performance. Two examples are shown in detail in Table 3. Here the scores for all values of gap opening and gap extension penalties are given for each weighting scheme for just two of the

test cases: Vahlkampfia lobospinosa and Trypanosoma brucei. In both cases, the results with uniform gap penalties, shown in row (a), are very poor and depend strongly on the exact value of the parameters. There is a huge improvement in row (b) where the values for position specific gap penalties are shown. Here, the values are much higher than in row (a) and there is almost no dependence on the exact values chosen for the gap penalties. In the case of Vahlkampfia there is no noticeable difference between the use of tree-based or identity-based weights [the results are shown in rows (c), (d) and (b)]. Use of the combined weighting scheme, as seen in row (e), gives a consistent improvement, showing increase of 2% across the entire range of gap penalties. In the case of Trypanosoma the relative performance of each weighting scheme is more dis-

337

E.A.O’Brien, C.Notredame and D.G.Higgins

tinct. In comparing identity weights to equal weights in this case, there is improvement for some values of gap penalty. The effect of using tree-based weights is to produce improvement across a larger range of gap penalties, particularly for gap extension penalties <0.3. The combination of the two weighting schemes again shows a synergistic effect, with a further increase visible across the range of gap penalties. The values of gap opening and gap extension penalties giving the maximum scores for each test case are given in Table 4. These are the optimum parameters when using the combined weighting scheme with position specific gap penalties. They all fall in a very narrow range.
Table 4. Gap opening and extension penalties giving optimum alignment scores for each test case using combined weights Gap opening A.beccarii C.elegans D.discoideum D. melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberii O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 6.0 6.0 6.0 5.0 6.0 6.0 7.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 Gap extension 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.2

sequences with average G+C contents (∼50%). As expected, sequences with extreme nucleotide compositions (very high or very low G+C content) tend to be less easy to align accurately. High levels of a particular nucleotide increase the chance that a residue in the sequence being aligned may align with the wrong column in the profile. The test cases cover a range of G+C content from 38.4% (Entamoeba histolytica) to 68.5% (Giardia sp.).

Discussion
The generation of alignments under various parameters shows that position-specific gap opening penalties have a very strong positive effect on the accuracy with which alignments can be generated. Fixed gap penalties perform extremely poorly, particularly at high values of gap extension penalty. This corresponds to situations in which the long gaps that occur in virtually all sequences in certain regions of the alignment, which correspond to long insertions in a few sequences, are penalized very heavily and do not occur in an alignment giving an optimum score. Experimentation with position-specific gap extension penalties did not give any further improvement. Sequence weighting can have a further positive effect on alignment quality. Both weighting schemes based on sequence identity and those based on the tree structure and branch lengths are seen to have generally positive effects. As expected, the tree-based weights are seen to perform at their best in the case of sequences which are quite distant from the main taxa, with few or no close relatives, such as Hexamita, and to be of least benefit to alignment quality with sequences which have many close relatives such as O.sativa. With identity-based weights the greatest positive effects are seen in sequences within highly represented taxa such as S.cerevisiae. These two weighting schemes have opposing effects on the values of the sequence weights in the case of sequences aligning into densely populated regions of the tree, and so the net result of combining them, in cases such as S.cerevisiae, may not perform any better than either of the weighting schemes used individually. The examples given (Table 3) indicate that there are cases where tree-based and identitybased weights show a synergistic effect when combined, the combination outperforming either of the schemes applied individually. The combined weights give the best result in more than half of the test cases, and the average difference between the score generated with the combined weights and the overall best score is substantially less than the difference between the scores from any of the other weighting schemes and the overall best score in each case. This synergy is seen to occur most strongly in sequences which are distant from the main bulk of the alignment and therefore more difficult to align correctly. Those which are located in highly repre-

In order to tell which sections of the reference alignment may be reliably aligned, each of the 1517 sequences in turn was removed from the alignment and re-aligned with the remaining sequences. Each column of the original, reference alignment was scored depending on what percentage of the residues in it can be realigned in the correct positions. Figure 2 shows the estimated secondary structure of the Saccharomyces cerevisiae nuclear SSU rRNA with those positions from the full alignment which can be realigned with ≥95% accuracy marked in black and those which realign with <95% accuracy in grey. Stems forming pseudoknots are not displayed in this representation. This is a conservative estimate of the regions that may be reliably aligned as there are some positions that are not found in this molecule and sequences from some taxonomic groupings may be aligned almost perfectly. Figure 3 shows the accuracy with which each sequence can be realigned compared to its original alignment as a function of G+C content. The re-alignment accuracy is greatest for

338

rRNA profiles

Fig. 2. Secondary structure of Saccharomyces cerevisiae SSU rRNA with stable regions indicated in black., generated using the ESSA program (Chetouani et al., 1997).

339

E.A.O’Brien, C.Notredame and D.G.Higgins

Fig. 3. Graph of percentage of sequence correctly re-aligned against G+C content for each of the 1517 sequences in the reference alignment.

sented taxa do not show such strong effects from any of the weighting schemes, but these tend to be those sequences which have the best alignments initially. We have shown how to improve the accuracy of alignment of rRNA sequences using some simple methods. It is quite possible that alignments of 100% accuracy will not be possible due to the existence of errors introduced manually into the reference alignment. Nonetheless, we can already see that some sequences may be aligned with >95% accuracy (Oryza and Xenopus), and across the entirety of the alignment 89.84% of all residues can be realigned correctly. Some sequences are still disappointing and this can partially be explained by very biased G+C content (e.g. Giardia). Others come from poorly sampled parts of the overall Eukaryote phylogenetic tree and these will become easier to align as new sequences are added. Nonetheless, it may be difficult for users to evaluate the quality of a new alignment. We provide one, extremely simple method for choosing regions of the overall alignment that can be reliably aligned in almost all cases. This covers about half of the positions in any given molecule and provides a selection of sites which can be reliably chosen for phylogenetic purposes. This site selection can be fine-tuned by looking at regions which may be reliably aligned in specific taxa. Finally, it is very obvious that these methods could benefit from some consideration of secondary structure, which

could be used for evaluation of alignments or as part of the alignment process. We are investigating the use of genetic algorithms to optimize the quality of profile alignments where secondary structure is considered (Notredame et al., 1997). We will use a genetic algorithm to optimize the quality function of Corpet and Michot (1994) but based on profiles rather than pairs of sequences.

Acknowledgements
The authors thank Richard Durbin for suggesting the use of the 1/d weights. We also thank Manolo Gouy for his help with rRNA sequences in general. This work was supported by a grant (BIO4-CT95–0130) from the EU Biotechnology programme.

References
Chetouani,F.,Monestie,P.,Thebault,P.,Gaspin,C. and Michot,B. (1997) ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. Nucleic Acids Res., 25, 2514–3522. Corpet,F. and Michot,B. (1994) RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comput. Applic. Biosci., 10, 389–399. Eddy,S. and Durbin,R. (1994) RNA sequence analysis using covariance models. N ucleic Acids Res., 22, 2079–2088. Felsenstein,J. (1989) Cladistics, 5, 164–166.

340

rRNA profiles

Gotoh,O. (1982) J. Mol. Biol., 162, 705. Gotoh,O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput. Applic. Biosci., 11, 543–551. Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,J. and Henikoff,S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Applic. Biosci., 12, 135–143. Luthy,R., Xenarios,I., and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, 139–146. Maidak,B., Olsen,G.,Larsen,N., Overbeek,R.,McCaughey,M. and Woese,C. (1997) The Ribosomal Database Project (RDP). Nucleic Acids Res., 25, 109–111. Needleman,S. and Wunsch,C. (1970)A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Neefs,J.-M., Van de Peer,Y., Hendriks,L. and De Wachter,R. (1990) Database on the structure of small subunit ribosomal RNA. N ucleic Acids Res., 18, 2237–2217. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580.

Pawlowski,J., Bolivar,I., Fahrni,J.F., Cavalier-Smith,T. and Gouy,M. (1996) Early origin of Foraminifera suggested by SSU rRNA gene sequences. Mol. Biol. Evol., 13, 445–450. Saitou,N. and Nei,M. (1987) The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Sakakibara,Y.,Brown,M.,Hughey,R., Mian,I.S.,Sjolander,K, Underwood,R.C., and Haussler,D. (1994) Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res., 22, 5112–5120. Sogin,M.,Elwood,H, and Gunderson,J. (1986) Evolutionary diversity of eukaryotic small-subunit rRNA genes. Proc. Natl Acad. Sci. USA, 83, 1383–1387. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Thompson,J., Higgins,D. and Gibson,T. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Applic. Biosci., 10, 19–29. Van de Peer,Y.,Jansen,J.,De Rijk,P. and De Wachter,R. (1997) Database on the structure of small ribosomal subunit RNA. Nucleic Acids Res., 25, 111–116.

341

Long Noncoding RNAs with Enhancer-like Function in Human Cells
Ulf Andersson Ørom,1 Thomas Derrien,2 Malte Beringer,1 Kiranmai Gumireddy,1 Alessandro Gardini,1 Giovanni Bussotti,2 Fan Lai,1 Matthias Zytnicki,2 Cedric Notredame,2 Qihong Huang,1 Roderic Guigo,2 and Ramin Shiekhattar1,2,3,*
1The

Wistar Institute, 3601 Spruce Street, Philadelphia, PA 19104, USA for Genomic Regulation (CRG), UPF, Barcelona, Spain 3Institucio Catalana de Recerca i Estudis Avancats (ICREA), Barcelona, Spain ´ ¸ *Correspondence: shiekhattar@wistar.org DOI 10.1016/j.cell.2010.09.001
2Centre

SUMMARY

While the long noncoding RNAs (ncRNAs) constitute a large portion of the mammalian transcriptome, their biological functions has remained elusive. A few long ncRNAs that have been studied in any detail silence gene expression in processes such as X-inactivation and imprinting. We used a GENCODE annotation of the human genome to characterize over a thousand long ncRNAs that are expressed in multiple cell lines. Unexpectedly, we found an enhancer-like function for a set of these long ncRNAs in human cell lines. Depletion of a number of ncRNAs led to decreased expression of their neighboring protein-coding genes, including the master regulator of hematopoiesis, SCL (also called TAL1), Snai1 and Snai2. Using heterologous transcription assays we demonstrated a requirement for the ncRNAs in activation of gene expression. These results reveal an unanticipated role for a class of long ncRNAs in activation of critical regulators of development and differentiation.

INTRODUCTION Recent technological advances have allowed the analysis of the human and mouse transcriptomes with an unprecedented resolution. These experiments indicate that a major portion of the genome is being transcribed and that protein-coding sequences only account for a minority of cellular transcriptional output (Bertone et al., 2004; Birney et al., 2007; Cheng et al., 2005; Kapranov et al., 2007). Discovery of RNA interference (RNAi) (Fire et al., 1998) in C. elegans and the identiﬁcation of a new class of small RNAs known as microRNAs (Lee et al., 1993; Wightman et al., 1993) led to a greater appreciation of RNA’s role in regulation of gene expression. MicroRNAs are endogenously expressed noncoding transcripts that silence gene expression by targeting speciﬁc mRNAs on the basis of sequence recognition (Carthew and Sontheimer, 2009). Over 1000 microRNA loci are estimated
46 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

to be functional in humans, modulating roughly 30% of proteincoding genes (Berezikov and Plasterk, 2005). While microRNAs represent a minority of the noncoding transcriptome, the tangle of long and short noncoding transcripts is much more intricate, and is likely to contain as yet unidentiﬁed classes of molecules forming transcriptional regulatory networks (Efroni et al., 2008; Kapranov et al., 2007). Long ncRNAs are transcripts longer than 100 nts which in most cases mirror the features of protein-coding genes without containing a functional open reading frame (ORF). Long ncRNAs have been implicated as principal players in imprinting and X-inactivation. The imprinting phenomenon dictates the repression of a particular allele, depending on its paternal or maternal origin. Many clusters of imprinted genes contain ncRNAs, and some of them have been implicated in the transcriptional silencing (Yang and Kuroda, 2007). Similarly, the X chromosome inactivation relies on the expression of a long ncRNA named Xist, which is thought to recruit, in a cis-speciﬁc manner, protein complexes establishing repressive epigenetic marks that encompass the chromosome (Heard and Disteche, 2006). There is also a report indicating that a long ncRNA expressed from the HOXC locus may affect the expression of genes in the HOXD locus which is located on a different chromosome (Rinn et al., 2007). More recently, a set of long ncRNAs has been identiﬁed in mouse, through the analysis of the chromatin signatures (Guttman et al., 2009). There has also been reports of divergent transcription of short RNAs ﬂanking transcriptional start sites of the active promoters (Core et al., 2008; Preker et al., 2008; Seila et al., 2008). In search of a function for long ncRNAs, we used the GENCODE annotation (Harrow et al., 2006) of the human genome. To simplify our search we subtracted transcripts overlapping the protein-coding genes. Moreover, we ﬁltered out the transcripts that may correspond to promoters of protein-coding genes and the transcripts that belong to known classes of ncRNAs. We identiﬁed 3019 putative long ncRNAs that display differential patterns of expression. Functional knockdown of multiple ncRNAs revealed their positive inﬂuence on the neighboring protein-coding genes. Furthermore, detailed functional analysis of a long ncRNA adjacent to the Snai1 locus using reporter assays demonstrated a role for this ncRNA in an RNAdependent potentiation of gene expression. Our studies suggest

B 1.0 Transcripts
GeneID coding potential Cumulative frequency

Figure 1. Identiﬁcation of Novel Long ncRNAs in Human Annotated by GENCODE
(A) Analysis of coding potential using Gene ID for ancestral repeats (AR), long ncRNAs annotated by GENCODE and protein-coding genes. (B) Conservation of the genomic transcript sequences for AR, long ncRNAs, protein-coding genes, and (C) of their promoters. (D) Expression analysis of 3,019 long ncRNA in human ﬁbroblasts, HeLa cells and primary human keratinocytes, showing numbers for transcripts detected in each cell line and the overlaps between cell lines. All microarray experiments have been done in four replicates. See also Figure S1 and Table S1 and Table S2.

A
200 100 0 AR Long Protein-coding ncRNAs genes

0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 Normalized phastCons score 1.0 Protein-coding genes long ncRNAs AR

C
1.0 Promoters Cumulative frequency 0.8 0.6

D
Fibroblasts 976 HeLa 937 222

126

coding potential, expressed from 2286 unique loci (some loci display multiple 0.4 Protein-coding genes alternative spliced transcripts) of the 91 576 human genome (Experimental Procelong ncRNAs 0.2 38 dures, Table S1 available online). The AR 0.0 average size of the noncoding transcripts 24 52 0.0 0.2 0.4 0.6 0.8 1.0 is about 800 nts with a range from 100 nts Normalized phastCons score Keratinocytes to 9100 nts. Interestingly, the long 690 ncRNAs display a simpler transcription unit than that of protein-coding genes (Figure S1A). Nearly 50% of our long a role for a class of long ncRNAs in positive regulation of protein- ncRNAs contain a single intron in their primary transcript (Figure S1A). Moreover, analysis of their chromatin signatures indicoding genes. cated similarities with protein-coding genes. Transcriptionally active ncRNAs display histone H3K4 trimethylation at their RESULTS 50 -end (Figure S1B) and histone H3K36 trimethylation in the body of the gene (Figure S1C). Noncoding RNAs Are Expressed and Respond to Cellular Analysis of protein coding potential of the ncRNAs using Differentiating Signals To assign a function to uncharacterized human long ncRNAs, we GeneID (Blanco et al., 2007; Parra et al., 2000) shows ncRNAs identiﬁed unique long noncoding transcripts using the annota- coding potential comparable to that of ancestral repeats (Lunter tion of the human genome provided by the GENCODE (Harrow et al., 2006), supporting the HAVANA annotation of these tranet al., 2006) and performed by human and vertebrate analysis scripts as noncoding (Figure 1A). Moreover, comparison of and annotation (HAVANA) group at Sanger Institute. Such ncRNAs with protein-coding genes and control sequences corregenomic annotation is being produced in the framework of the sponding to ancestral repeats (Lunter et al., 2006) reveals that ENCODE project (Birney et al., 2007). At the time of our analysis, ncRNA sequence conservation is lower than that of proteinthe GENCODE annotation encompassed about one third of the coding genes, but higher than that of ancestral repeats (Figure 1B). human genome. Such an annotation relies on the human expert A similar case is seen with the promoter regions (Figure 1C). These curation of all available experimental data on transcriptional results are in concordance with previous observations in the evidence, such as cloned cDNA sequences, spliced RNAs and mouse genome (Guttman et al., 2009; Ponting et al., 2009). Next we used custom-made microarrays (Experimental ESTs mapped on to the human genome. We focused on ncRNAs that do not overlap the protein-coding Procedures) which were designed to include an average of six genes in order to simplify the interpretation of our functional anal- probes (nonrepetitive sequences) against each ncRNA transcript ysis of ncRNAs. This included the subtraction of all transcripts to detect their expression. We analyzed the expression pattern mapping to exons, introns and the antisense transcripts overlap- of ncRNAs using three different human cell lines (Figure 1D). ping the protein-coding genes. We also excluded transcripts Overall, we detected 1167 ncRNAs expressed in at least one within 1 kb of the ﬁrst and the last exons as to avoid promoter of the three cell types and 576 transcripts common among the and 30 -associated transcripts (Fejes-Toth et al., 2009; Kapranov three cell types (Figure 1D). We validated the expression of 16 et al., 2007), that display a complicated pattern of short tran- ncRNAs that mapped to the 1% of the human genome investiscripts (Core et al., 2008; Preker et al., 2008; Seila et al., 2008). gated by the original ENCODE study (Birney et al., 2007) using Furthermore, we excluded all known noncoding transcripts quantitative polymerase chain reaction (qPCR) in three different from our list of putative long ncRNAs. This analysis resulted in cell lines (Table S2). Furthermore, we could ﬁnd evidence for 3019 ncRNAs, which are annotated by HAVANA to have no expression of 80% of our noncoding transcripts in at least one
Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 47

human tissue in a recent high throughput sequencing of the human transcriptome (Wang et al., 2008). To assess whether ncRNAs respond to cellular differentiating signals, we induced the differentiation of human primary keratinocytes using 12-O-tetradecanoylphobol 13-acetate (TPA). We monitored the expression of ncRNAs using custom microarrays. Expression of protein-coding genes was monitored using conventional Agilent arrays containing nearly all human mRNAs. We prepared RNA from human primary keratinocytes before and following treatment with TPA. As shown in Figure 2A and Table S3, we could detect 687 ncRNAs in keratinocytes, where 104 (or 15.1%) respond to TPA treatment by over 1.5-fold. Similarly, 21.3% of protein coding-genes display a change in expression of over 1.5-fold (Figure 2B). While around half of the TPA-regulated protein-coding genes increase and a similar proportion decrease their expression following differentiation, 70% of the TPA-regulated ncRNAs increase their expression whereas only 30% show a decrease (Figures 2A and 2B). Furthermore, analysis of the protein-coding genes in the 500 kb window surrounding the TPA-regulated ncRNAs indicates a signiﬁcant enrichment in genes involved in differentiation and morphogenesis (Figure 2C). An example of such change in expression of an important gene involved in extra-cellular matrix is shown in Figure 2D. Extracellular Matrix Protein 1 (ECM1) gene and an ncRNA adjacent to it displayed a 5 and 1.7 fold induction following TPA treatment, respectively. (Figure 2D, upper panel). qPCR analysis shows the TPA-mediated induction of ECM1 and the ncRNA as 14 and 4 fold, respectively (Figure 2D, bottom panel). Taken together, we found that many of the GENCODE annotated transcripts are expressed in multiple cell lines and that they display gene expression responsiveness to differentiation signals. Noncoding RNAs Display a Transcriptional Activator Function To assess the function of our set of long ncRNAs, we reasoned that similar to long ncRNAs function at the imprinting loci, our collection of ncRNAs may act to regulate their neighboring genes. To test this hypothesis, we used RNA interference to deplete a set of ncRNAs. We initially chose ncRNAs that showed a differential expression following keratinocyte differentiation. However, to obtain a reproducible knockdown we had to use cell lines that are permissive to transfection by siRNAs. We used ﬁve different cell lines for our analyses in which the candidate ncRNAs display a detectable expression (Figure 3). We validated the expression of our experimental set of ncRNAs and the absence of protein-coding potential using rapid ampliﬁcation of 50 and 30 complementary DNA ends (50 and 30 RACE), PCR and in vitro translation (Figure S3). These experiments conﬁrmed the expression of ncRNAs and showed that they do not yield a product in an in vitro translation assay (Figures S3A and S3B), supporting the noncoding annotation of our set of ncRNAs. In two cases, the ncRNAs adjacent to Snai2 and TAL1 loci, we found evidence of a longer ncRNA transcript than that annotated by HAVANA (Figure S3). We began by examining small interfering RNAs (siRNAs) against the ncRNA next to ECM1 in order to assess its functional role following its depletion (for reasons that will follow, this class of RNA is designated as noncoding RNA-activating1 through 7,
48 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

ncRNA-a1-7). HEK293 cells were used for these experiments because of the ease of functional knockdown and the detectable amounts of ncRNA-a1 and ECM1 in this cell line. We compared the results obtained using two siRNAs against ncRNA-a1 to data obtained following the transfection of two control siRNAs (for the visual simplicity only one siRNA is shown (Figure 3A), the values for both siRNAs can be seen in Table S4). The two siRNAs produced comparable results. We interrogated a 300 kb window around the ncRNA-a1 containing six protein-coding genes using qPCR. Surprisingly, unlike the silencing action of long ncRNAs in imprinting and X-inactivation, depletion of ncRNA-a1 adjacent to ECM1 resulted in a concomitant decrease in expression of the neighboring ECM1 gene (Figure 3A). This effect was speciﬁc, as we did not detect any change in the other protein-coding genes surrounding ncRNA-a1 (Figure 3A). To ascertain that ncRNA-a1 is not a component of the ECM1 30 untranslated region, we used primer pairs spanning the ECM1 and ncRNA-a1 genes. We were not able to detect a transcript comprised of the two genes in HEK293 cells, supporting the contention that the two transcripts are independent transcriptional units (Figure S2A). Furthermore, published ChIP experiments (Euskirchen et al., 2007) show the presence of RNA polymerase II and trimethyl H3K4 peaks at the transcription start site of ncRNA-a1 in several cell lines, further attesting to an independent transcriptional start site for ncRNA-a1. Moreover, knocking down the ECM1 gene did not affect the expression level of ncRNA-a1 or any of the other protein-coding genes analyzed in the locus, further supporting the independence of ECM1 transcript from that of ncRNA-a1 (Figure S2B). Next we analyzed ncRNA-a2 ﬂanking the histone demethylase JARID1B/KDAM 5B which also shows increased expression following keratinocyte differentiation. These experiments were performed in HeLa cells as they showed detectable expression of ncRNA-a2. Interestingly, while depletion of ncRNA-a2 did not change JARID1B/KDAM 5B levels, the KLHL12, a gene known for its negative regulation of the Wnt-beta catenin pathway, on the opposite strand displayed a signiﬁcant reduction (Figure 3B). Although the decrease in KLHL12 was small (about 20%), no other protein-coding gene in the locus displayed a difference in expression (Figure 3B). To extend our ﬁndings and to determine whether regulation of neighboring protein-coding genes is a common function of ncRNAs, we interrogated the ncRNA-a3 ﬂanking the stem cell leukemia gene (SCL, also called TAL1). TAL1 is a basic helixloop-helix protein which serves as the master regulator of hematopoiesis (Lecuyer and Hoang, 2004). This locus contains two ncRNAs on different strands of DNA. We used MCF-7 cells to assess the depletion of ncRNA-a3, since the expression of ncRNA-a3 and TAL1 could be readily detected in these cells. However, neither PDZK1IP1 nor ncRNA-a4 could be detected by qPCR in MCF-7 cells. Depletion of ncRNA-a3 resulted in a speciﬁc and potent reduction of TAL1 expression (Figure 3C). While depletion of ncRNA-a3 did not affect either STIL or CMPK1 genes, a signiﬁcant reduction in CYP4A11 gene on the opposite strand of the DNA was detected (Figure 3C). We next turned our attention to ncRNA-a4 which was not expressed at a detectable level in MCF7 cells. We could reliably

A
Number of transcripts 15.1% 104
120 100 80 60 40 20 0

Repressed

29.8 %

Induced

70.2%

687

33.3% 66.7% > ±2 Long ncRNAs

> ±1.5 Long ncRNAs

B
Number of transcripts 21.3% 4107
5000 4000 3000 2000 1000 0

Repressed

47.7%

Induced

19275

52.3% > ±1.5 mRNA

43.5% 56.5% > ±2

mRNA

C

cell differentiation epidermal cell differentiation keratinization keratinocyte differentation tissue development ectoderm development endoderm development epidermis morphgenesis tissue morphogenesis

Protein-coding genes around differentiallyexpressed long ncRNAs Protein-coding genes around random positions

0

5

10

15 20 25 Number of genes

30

35

40

EC M nc 1 RN A-

AM

S2

1

TA R

AD

CL

5 0 Array quantification 15 10 5 0 qPCR quantification

Control + TPA

Figure 2. Long ncRNAs Display Responsiveness to Differentiation Signals in Human Primary Keratinocytes
(A and B) Distribution of differentially expressed transcripts (dark colors) following TPA treatment for long ncRNAs (A), and mRNAs (B). Lighter colors show total number of transcripts, darker colors and percentage show number of differentially expressed transcripts. Bar-plots show number and fractions of transcripts induced (red) or repressed (green) at different fold-change cut-offs. (C) Gene onthology analysis of genes ﬂanking the differentially expressed long ncRNAs (red) compared to genes ﬂanking random positions (black). (D) Graphic representation of a locus with induction of the long ncRNA ncRNA-a1 and the adjacent ECM1 gene, with expression values from microarrays (upper panel) and qPCR quantiﬁcation of transcripts (lower panel). Microarray experiments and qPCR validation are done in four replicates. Data shown are mean ± SD. See also Figure S2 and Table S3.

detect ncRNA-a4 in Jurkat cells. While we could not efﬁciently knockdown ncRNA-a3 in Jurkat cells, siRNAs speciﬁc to ncRNA-a4 reproducibly reduced its levels by about 50%

(Figure 3D). Importantly, reduced levels of ncRNA-a4 resulted in a consistent and signiﬁcant decrease in the level of the gene CMPK1 which is over 150 kb downstream of ncRNA-a4
Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 49

EN

M

SA

D

TS L

a1

4

-a 1 AD AM M TSL CL 4 1

RP RD 2

TA R EC S2 nc M 1 RN A

ST IL

* **

EN SA

*

HEK293

control siRNA ncRNA-a1 siRNA

Jurkat

control siRNA ncRNA-a4 siRNA

D1 B

AD IP OR

PQ LC 3

CK

JA RI

***

*

*

N.D.

***

HeLa

control siRNA ncRNA-a2 siRNA

HeLa control siRNA ncRNA-a5 siRNA

P4 A nc 11 ncRNA R PD NA-a3 ZK -a4 1 TA IP1 L1

*

* N.D. N.D.

**

CM

CY

ST

*

MCF-7

control siRNA ncRNA-a3 siRNA

A549

control siRNA ncRNA-a6 siRNA

100 kb
Figure 3. Stimulation of Gene Expression by Activating RNAs
The thick black line representing each gene shows the span of the genomic region including exons and introns. The targeted activating RNAs are shown in red. Bar-plots show RNA levels as determined by qPCR. (A) ncRNA-a1 locus in HEK293 cells. (B) ncRNA-a2 locus in HeLa cells. (C) ncRNA-a3 locus in MCF-7 cells. (D) ncRNA-a4 locus in Jurkat cells. (E) ncRNA-a5 locus in HeLa cells. (F) ncRNA-a6 locus in A549 cells. All values are relative to GAPDH expression and relative to control siRNA transfected cells set to an average value of 1. The scale bar represents 100 kb and applies to all ﬁgure panels. Error bars show mean ± SEM of at least three independent experiments. *p < 0.05, **p < 0.01, ***p < 0.001 by two-tailed Student’s t test. See also Figure S3 and Table S4. The results represent at least six independent experiments. See also Figure S3 and Table S4.

(Figure 3D). We do not detect any changes in the other proteincoding genes surrounding ncRNA-a4. Next we depleted ncRNA-a5 which is adjacent to the E2F6 gene, an important component of a polycomb-like complex (Ogawa et al., 2002). Knockdown of ncRNA-a5 did not affect the E2F6 gene. However, depletion of ncRNA-a5 resulted in a speciﬁc reduction in ROCK2 expression levels in HeLa cells, which is located upstream of ncRNA-a5 (Figure 3E). Finally, we examined the Snai1 and Snai2 loci in A549 cells (Figure 3F and Figure 4). The Snail family of transcription factors are implicated in the differentiation of epithelia cells into mesenchymal cells (epithelial-mesenchymal transition) during embry50 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

onic development (Barrallo-Gimeno and Nieto, 2005; Savagner, 2001). Snai2 shows a signiﬁcant reduction in expression when the adjacent ncRNA-a6 is depleted, an effect that is not seen on EFCAB1, the only other protein-coding gene within 300 kb of the ncRNA-a6 (Figure 3F). In total, we have examined 12 loci where we were able to efﬁciently knockdown the ncRNAs using siRNAs (Table S5). We were able to show that in 7 cases, the ncRNA acts to potentiate the expression of a protein-coding gene within 300 kb of the ncRNA. It is possible that the remaining ncRNAs which did not display a positive effect on the neighboring genes within the 300 kb window, exert their action over longer distances which was not assessed in our analysis. Taken

Sn ai 2
*

RN

1

PK

nc

EF

IL

CA

C

F

a6

B1

A-

E2 F6

B

nc LORNA C6 -a 41 2 51 AX 5 NR71 12 _0 1 RA 02 8 BI 92 F 9 KL HL 12

E

O 00 TTH 00 U 32 MT 40 00 50 nc RN Aa5

1

RO

2

CM
*

PK 1

A

CY P4 A1 1 nc ncRN RN APD A- a3 ZK a4 TA 1IP L1 1

D

A

Figure 4. Knockdown of ncRNA-a7 Speciﬁcally Targets Snai1 Expression

nc RN Aa7

TM EM 18 9

1.0 0.5 0

**

**

A549
control siRNA

(A) As in Figure 3, the ncRNA-a7 locus is depicted showing effects on RNA levels for the surrounding genes with and without knockdown of ncRNA-a7. The results represent mean ± SEM of at least six independent experiments. **p < 0.01 by one-tailed Student’s t test. (B) Migration assay of A549 cells with control (right panel) or ncRNA-a7 (left panel) siRNA transfections. (C) Quantiﬁcation of the data shown in (B). Experiments in (B) and (C) are done in three replicates and are shown as mean ± SEM. ***p < 0.001 by two-tailed Student’s t test. See also Figure S4 and Table S5.

F1 14

E2 VI

ai

100 kb
B

C
Number of cells migrated 5000 4000 3000 2000 1000 0
*** ***

together, our results indicate that a subset of ncRNAs has activating functions and therefore we have named them ncRNAactivator (ncRNA-a) followed by a number to distinguish each activating long ncRNA. ncRNA-a7 Is a Regulator of Snai1 As mentioned above, Snai1 is a member of the Snail zinc-ﬁnger family, which comprises transcription factors with diverse functions in development and disease (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). The Snail gene family is conserved among species from Drosophila to human and has been shown to function as mesodermal determinant genes (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). Snail genes are the regulators of cell adhesion, migration and epithelial-mesenchymal transition (EMT) (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). Analysis of the ncRNA close to the Snai1 gene provided us with an opportunity to combine our gene expression analysis with analysis of changes in cellular migration. Knockdown of ncRNA-a7 resulted

in a speciﬁc reduction in Snai1 levels (Figure 4A). The expression of the four other protein-coding genes in this locus does not change following the depletion of ncRNA-a7. Concomitantly, knockdown of ncRNA-a7 has a signiﬁcant phenotypic effect in cell migration assays, reducing the number of migrating cells to about 10% of that of the control (Figures 4B and 4C), consistent with the phenotypic changes following the depletion of Snai1 (Figures 4B and 4C). Since the knockdown of ncRNA-a7 or Snai1 had similar consequences on cellular migration, we assessed their depletion on gene expression in A549 cells using Agilent arrays. We could not detect the basal level of Snai1 on the array, while Snai1 was readily detectable using quantitative PCR. Interestingly, depletion of Snai1 or ncRNA-a7 resulted in similar changes in gene expression proﬁles (Figure 5A and Table S6). Not only did we observe a similar trend in genes that were affected upon the knockdown of either gene but also a signiﬁcant number of genes that were upregulated were in common in both treatments (Figures 5A and 5B). Since Snai1 is a known transcriptional repressor, depletion of Snai1 or ncRNA-a7 should result in an upregulation of Snai1 target genes. Indeed, a number of genes that were commonly upregulated were direct targets of Snai1 (Figure 5C, upper panel) (De Craene et al., 2005). Depletion of either ncRNA-a7 or Snai1 also resulted in downregulation of a set of genes with a partial overlap between the genes downregulated following the two treatments (Figure 5B). Interestingly, Aurora-kinase A a gene that is 6 MB down-stream of ncRNA-a7 was speciﬁcally downregulated following the depletion of ncRNA-a7, suggesting a long range effect for ncRNA-a7 (Figure 5C). Taken together, these results indicate that while the depletion of ncRNA-a7 partially mimic the gene expression proﬁle observed following Snai1 depletion, there are a number of gene expression changes resulting from the ncRNA-a7 depletion that occur independently
Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 51

ncRNA-a7 siRNA

Co nt ro ls iR NA Sn ai 1 siR nc NA RN Aa7 siR NA

Snai1 siRNA

ncRNA-a7 siRNA

Control siRNA

CE B

RN

UB

Sn

PB

1

ncRNA-a7 siRNA

Control siRNA

Snai1 siRNA

Figure 5. Microarray Analysis of Snai1 and ncRNA-a7 Knockdown
Snai1 or ncRNA-a7 were knocked down using siRNA in A549 cells and the isolated RNA analyzed on microarrays in duplicate experiments. (A) All genes differentially expressed (>1.5-fold or <0.6-fold compared to control) in either Snai1 or ncRNA-a7 knockdown, or both, are shown clustered in a heat map according to expression proﬁle. Numbers are log(2) transformed and color scale is shown below the heat map. (B) Analysis of genes showing upregulation (>1.5 fold) or downregulation (<0.6 fold) in both Snai1 and ncRNA-a7 knockdown. Numbers represent number of genes regulated in the indicated condition. (C and D) (C) Validation of microarray data by qPCR and (D) analysis of the Snai1 locus and targets of Snai1 upon overexpression of ncRNA-a7. ncRNA-a7 was overexpressed from a vector in A549 cells and expression of select genes were measured by qPCR. Y-axes show expression value relative to GAPDH of the indicated gene. Values are normalized to those of control siRNA transfected cells, set to 1. **p < 0.01, ***p < 0.001 by one-tailed Student’s t test. See also Table S6.

A
Differentially expressed in Snai1 or ncRNA-a7 knock-down

B
Snai1 siRNA

Expression > 1.5 fold ncRNA-a7 siRNA

C
Relative/Gapdh

4

4

4

Control siRNA Snai1 siRNA ncRNA-a7 siRNA

3

3

3

124

135

206

2

2

2

1

1

1

0

0 CDH1 PKP2

0 PLOD2

ncRNA-a7 siRNA

Expression < 0.6 Snai1 siRNA

1 Relative/ Gapdh

1

1

1

168

42

112

0 Snai1

0 ncRNA-a7

0 RNF114

0 AURKA

Log2 -2 +2

14

VI

F1

E2

EM

1

ai

RN

UB

TM

CE

Sn

BP

B
Control ncRNA-a7

D

nc RN Aa7

2.0 1.5 1.0 0.5

2.0 1.5 1.0 0.5

200 150 100 50

2.0 1.5 1.0 0.5

2.0 1.5 1.0 0.5

18
2.0 1.5 1.0 0.5

9

2.0 1.5 1.0 0.5

2.0 1.5 1.0 0.5

2.0 1.5 1.0 0.5

2.0 1.5 1.0 0.5

tionally dissect the inﬂuence of the ncRNA activation on the expression of 0 0 0 0 0 0 0 0 0 0 an adjacent gene, we constructed AURKA CDH1 PKP2 PLOD2 vectors with inserts containing either ncRNA-a3 and -a4 from a bidirectional of changes in Snai1. Therefore, it is likely that depletion of promoter, ncRNA-a5 or ncRNA-a7, and placed them downncRNA-a7 may have other effects on gene expression which stream of Fireﬂy luciferase driven by a thymidine kinase (TK) promoter in a reporter vector (pGL3-TK-ncRNA-a), (Figure 6A). may be mediated through other targets in trans. To speciﬁcally address whether ncRNA-a7 may exert its We included 1–1.5 kb upstream of the ncRNA-as to contain effects in trans, we assessed the gene expression changes in their endogenous promoters and 500 bps downstream in the Snai1 locus as well as some of the targets that were changed reporter vector. We also produced a control vector (pGL3by depletion of ncRNA-a7 or Snai1 following the overexpression TK-control) in which 4 kb of DNA without transcriptional potenof ncRNA-a7 (Figure 5D). Overall, we did not observe changes in tial was cloned down-stream of Fireﬂy luciferase similar to the gene expression for any of the ncRNA-a7 targets following its ncRNA activation reporters (Figure 6B). A vector containing Reoverexpression (Figure 5D, ncRNA-a7 was overexpressed 150 nilla luciferase was used to control for transfection efﬁciency. fold). While these results suggest that ncRNA-a7 exerts its local Importantly, inclusion of either of the three ncRNA-a inserts gene expression changes in cis, it is likely that other targets may result in an enhancement of transcription ranging from 2- to be inﬂuenced in trans. Taken together, these experiments reveal 7-fold (Figures 6C–6E). This effect is speciﬁc as pGL3-TKa role for ncRNA-a in positive regulation of expression of neigh- control vector do not enhance the basal TK promoter activity boring protein-coding genes and show that this effect is not (Figures 6C–6E). To demonstrate that the observed potentiation speciﬁc to any one locus and may represent a general function of gene expression is mediated through the action of ncRNA-a, we knocked down the ncRNA-a in question for each reporter for ncRNAs in mammalian cells. construct using speciﬁc siRNAs (Figures 6C–6E). Interestingly while depletion of ncRNA-a7 and ncRNA-a5 completely abolncRNA Activation of Gene Expression of a Heterologous ished the increased transcription, depletion of ncRNA-a3 Reporter Previous studies have shown that distal activating sequences/ and/or ncRNA-a4 resulted in a partial decrease in transcripenhancers can stimulate transcription when placed adjacent tional enhancement (Figures 6C–6E). These results suggest to a heterologous promoter, a methodology widely used to vali- that while ncRNA-a play a major role in transcriptional activadate potential enhancers (Banerji et al., 1983, 1981; Gillies tion, other DNA elements in the cloned ncRNAa-3/4 region et al., 1983; Heintzman et al., 2009; Kong et al., 1997). To func- may also contribute to increased transcription.
52 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

A

ncRNA insert

TK promoter

Firefly luciferase

pGL3-TK-ncRNA-a reporter
B
pGL3-TK
No insert

SV40 p(A)

pGL3-TK-Control
4 kb insert with no known transcription

C
pGL3-TK pGL3-TK-control Control siRNA **

pGL3-TK-ncRNA-a3/4
pGL3-TK-ncRNA-a7 pGL3-TK-control ncRNA-a3 ncRNA-a4 2.7 kb insert including ncRNA-a3 and ncRNA-a4 pGL3-TK pGL3-TK-ncRNA-a7 0 1 2 FL/RL (normalized units) 3 **

ncRNA-a7 siRNA

pGL3-TK-ncRNA-a5

ncRNA-a5 4 kb insert including ncRNA-a5

E
pGL3-TK pGL3-TK-Control Control siRNA ***

pGL3-TK-ncRNA-a7

ncRNA-a7 2.7 kb insert including ncRNA-a7

pGL3-TK-ncRNA-a3/4 pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a3/4 ***
ncRNA-a3 siRNA

D
pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a5 pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a5 0 1 FL/RL (normalized units) 2 * ncRNA-a5 siRNA Control siRNA *

pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a3/4 pGL3-TK-ncRNA-a3/4 0 2 4 FL/RL (normalized units) siRNA ncRNA-a3 and ncRNA-a4 6 8 ncRNA-a4 siRNA

Figure 6. ncRNA-Activators Potentiate Transcription of a Reporter Gene
(A) ncRNA-a 3/4, 5 and 7 were cloned and inserted downstream of luciferase driven by a TK-promoter in a reporter plasmid as shown. (B) Graphical representation of the inserts in the various vectors used. The pGL3-TK-Control vector contains an insert of approximately 4 kb containing no annotated evidence of transcription. The depicted inserts show exons and transcriptional direction of the ncRNA-a. (C–E) Luciferase reporter assays. The Fireﬂy luciferase vectors were cotransfected with a Renilla luciferase vector (pRL-TK) for transfection control. (C) The vector containing ncRNA-a3 and ncRNA-a4 from a bidirectional promoter, with control siRNA or siRNAs toward either of the two ncRNA-a, or both. (D) Reporter with ncRNA-5, and (E) the reporter with the ncRNA-a7 inserted downstream of luciferase. X axes show relative Fireﬂy (FL) to Renilla (RL) luciferase activity. Cotransfected siRNAs are indicated to the right of the bars. All data shown are mean ± SE from six independent experiments. *p < 0.05, **p < 0.01, ***p < 0.001 by onetailed Student’s t test.

Dissection of the ncRNA-a7 in a Reporter Construct An important property of enhancing sequences is their orientation independence (Imperiale and Nevins, 1984; Khoury and Gruss, 1983; Kong et al., 1997). We designed reporter constructs

(Figure 7A) in which the ncRNA-a7 sequence is reversed (pGL3TK-ncRNA-a7-RV) in order to assess its orientation independence. The ncRNA-a7-RV construct displayed a similar transcriptional enhancing activity as the construct containing the
Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 53

A

ncRNA-a7 insert

Figure 7. RNA-Dependent Activation of a Reporter Gene by ncNRA-a7
(A) Properties of the ncRNA-a7 containing luciferase reporter vector. (B, C, E, and F) Luciferase reporter assays. The Fireﬂy luciferase vectors were cotransfected with a Renilla luciferase vector (pRL-TK) for transfection control. (D) Semiquantitative PCR of ncRNA-a7. (B) Reporter experiments with the ncRNA-a7 insert reversed as indicated in the left panel. (C) The TK-promoter driving luciferase expression was deleted from the construct and expression values are shown relative to the pGL3-TK control plasmid as a reference. (E) Truncated reporter constructs containing the ncRNA-a7 promoter and downstream sequences, but not the ncRNA-a7 sequence [pGL3-TK-delta(ncRNA-a7)], or one with a poly(A) signal in the beginning of the ncRNA-a7 to induce premature polyadenylation [pGL3-TK-ncRNA-a7-p(A)]. See also (D) for analysis of expression from these plasmids. (F) Protein coding sequences were inserted in place of ncRNA-a7 downstream of the ncRNA-a7 promoter. Full-length GTSF1L or ID1 sequences are used. X axes show relative Fireﬂy (FL) to Renilla (RL) luciferase activity. All data shown are mean ± SE from six independent experiments. ***p < 0.001 by one-tailed Student’s t test.

TK promoter

Firefly luciferase

pGL3-TK-ncRNA-a7
B

SV40 p(A)

Exon 2

Exon 1

No insert

pGL3-TK *** *** pGL3-TK-control pGL3-TK-ncRNA-a7 pGL3-TK-ncRNA-a7-RV 0 2 3 FL/RL (normalized units) 1 4

C

D
No promoter pGL3-Basic

PCR

ncRNA-a7
pG pG L3 -T K- pG L3-T pG nc L3 RN -T K-n L3-T A- K-n cRN K a7 Ac a7 + RN nc Aa7 RN A- -p( a7 A) siR NA
pGL3-TK *** pGL3-TK-ncRNA-a7 *** *** pGL3-TK-delta(ncRNA-a7) pGL3-TK-ncRNA-a7-p(A) 0 2 3 FL/RL (normalized units) 1

pGL3-Basic-ncRNA-a7 pGL3-Basic-ncRNA-a7-RV pGL3-TK 0

10 FL/RL (arbitrary units)

20

E
No insert

SV40 p(A)

4

F
No insert pGL3-TK *** pGL3-TK-control pGL3-TK-ncRNA-a7 *** *** ORF ORF pGL3-TK-GTSF1L pGL3-TK-ID1 0 2 3 FL/RL (normalized units) 1

54 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

ncRNA-a7 insert in its endogenous orientation with respect to the regulated gene (Figure 7B). To show that luciferase expression requires a promoter and that ncRNA-7a cannot act as a proximal promoter, we deleted the TK promoter from the reporter vectors. As shown in Figure 7C, ncRNA-a7 cannot drive transcription of the Fireﬂy luciferase in the absence of a proximal TK promoter. These experiments demonstrate that sequences corresponding to ncRNA-a7 transcription unit can function to activate expression of a heterologous promoter in an orientation-independent manner, but cannot act as a promoter itself. To further verify that ncRNA-a7 is the active component of the transcriptional enhancement, we constructed two reporters in which ncRNA-a7 sequences are either deleted or shortened by placing a strong polyadenylation signal within the ncRNAa genomic sequence but close to the transcriptional start site, to induce premature polyadenylation (Figures 7D and 7E). Both modiﬁcations result in loss of the increased gene expression (Figure 7E) compared to constructs where ncRNA-a7 is expressed. Finally, to assess whether the RNA corresponding to ncRNA-7a is critical for increased gene expression, we developed constructs where DNA sequences corresponding to two different protein-coding genes were positioned in the place of ncRNA-a7 (Figure 7F), keeping the endogenous ncRNA-a7 promoter. Neither of these constructs displayed an increased gene expression compared to that of the control constructs (Figure 7F). Taken together, these experiments demonstrate that the potentiation of gene expression is signaled by the ncRNA-a and is not merely the result of the transcription of the ncRNA. DISCUSSION We used the annotation of the human genome performed by GENCODE to arrive at a collection of long ncRNAs that are expressed from loci independent of those of protein-coding genes or previously described nc RNAs. GENCODE annotation encompasses both protein-coding and noncoding transcripts and relies on experimental data obtained through the analysis of cDNAs, ESTs and spliced RNAs. Our collection of 3,000 transcripts correspond to the manual curation of about a 1/3 of the human genome. Analysis of the GENCODE data indicates that nearly all of their noncoding annotated transcripts are spliced (Figure S1A). Importantly, the median distance of an ncRNA transcript to a protein-coding gene is over a 100 kb making it an unlikely scenario for the ncRNA to be an extension of protein-coding transcripts (Figures S2C and S2D). Moreover, transcriptionally active ncRNAs display similar chromatin modiﬁcations seen with expressed protein-coding genes (Figures S1B and S1C). Furthermore, the analyzed ncRNAs display RNA pol(II), p300 and CBP occupancy at levels similar to those of the surrounding protein coding genes, consistent with their transcriptional independence (Figure S4). Although our analysis is focused on understanding the function of a set of ncRNAs annotated by GENCODE, the human transcriptome includes other forms of ncRNAs with important regulatory functions that have not been included in our study. These include the antisense transcripts

arising from protein-coding genes, precursors of microRNAs as well as a wealth of unspliced transcripts described in multiple studies (Guttman et al., 2009; Kapranov et al., 2007; Rinn et al., 2007). Taken together, the novelty of our work lies in the following. First, we show that at multiple loci of the human genome depletion of a long ncRNA leads to a speciﬁc decrease in the expression of neighboring protein-coding genes. Previous studies analyzing the function of long ncRNAs in X-inactivation or the imprinting phenomenon point to their role in silencing of gene expression (Mattick, 2009). Second, we show that the enhancement of gene expression by ncRNAs is not cell speciﬁc as we observe the effect in ﬁve different cell lines. Third, this enhancement of gene expression is mediated through RNA, as depletion of such activating ncRNAs abrogate increased transcription of the neighboring genes. Fourth, through the use of heterologous reporter assays, we suggest that activating ncRNAs mediate this RNA-dependent transcriptional responsiveness in cis. Fifth, we show that similar to classically deﬁned distal activating sequences, ncRNA-mediated activation of gene expression is orientation independent. Sixth, we present evidence that similar to deﬁned activating sequences, ncRNAs cannot drive transcription in the absence of a proximal promoter. Finally, we demonstrate that the activation of gene expression in the heterologous reporter system is mediated through RNA as multiple approaches depleting the RNA levels lead to abrogation of the stimulatory response. Therefore, we have uncovered a new biological function in positive regulation of gene expression for a class of ncRNAs in human cells. There are previous reports of individual ncRNAs having a positive effect on gene expression. The 3.8 kb Evf-2 ncRNA was shown to form a complex with the homeodomain-containing protein Dlx2 and lead to transcriptional enhancement (Feng et al., 2006). Similarly, the ncRNA HSR1 (heat-shock RNA-1) forms a complex with HSF1 (heat-shock transcription factor 1), resulting in induction of heat-shock proteins during the cellular heat-shock response (Shamovsky et al., 2006) and an isoform of ncRNA SRA (steroid receptor RNA activator) functions to coactivate steroid receptor responsiveness (Lanz et al., 1999). Our ﬁndings that activating ncRNAs positively regulate gene expression extend these previous studies and demonstrate that the activation of gene expression by long ncRNA may be a general function of a class of long ncRNAs. Moreover, whether ncRNA effects seen in our study are mediated through association with speciﬁc transcriptional activators is not known. However, this is a likely scenario given previous examples of an RNA-mediated responsiveness. Other possibilities include a formation of an RNA-DNA hybrid at the locus of the ncRNA or the protein-coding gene which may result in enhanced binding of the sequence speciﬁc DNA binding proteins or chromatin modifying complexes. A recent study uncovers a set of bidirectional transcripts (termed eRNA) that are derived from sites in the human genome that show occupancy by CBP, RNA polymerase II and are decorated by monomethyl Histone H3 lysine 4 (H3K4) (Kim et al., 2010). Moreover, they show that the expression of such transcripts is correlated with their nearest protein-coding genes. There are fundamental differences between their collection of 2000 transcripts and our GENCODE set of transcripts. First,
Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 55

while all their eRNAs are bidirectional, only about 1% of our ncRNAs show evidence of bidirectionality (see the example shown in the TAL1 locus). Second, our analysis of the histone modiﬁcations of a subset of ncRNAs that are expressed in lymph (Barski et al., 2007) indicates the presence of H3K4 trimethylation at the transcriptional start sites and H3K36 trimethylation at the body of the gene (Figures S1B and S1C). This is in stark contrast to eRNA loci where there is an absence of H3K4 trimethyl marks and the predominant chromatin signature is the monomethyl H3K4. Third, eRNAs are reported to be predominantly not polyadenylated. The majority of our collection of ncRNAs show evidence of polyadenylation as they were ampliﬁed using oligo-dT-primed reactions and furthermore 41% display the presence of a canonical polyadenylation site. Analysis of the protein-coding transcripts revealed that a similar proportion (52%) to that of our ncRNAs contain the canonical polyadenylation sites. Finally, while we show that a set of our ncRNAs function to enhance gene expression, there is no evidence provided for eRNAs exerting a biological function. While we believe that eRNAs designate a different class of ncRNAs than ncRNA-a described in our study, it is temping to speculate that many of the ncRNA-a and their promoters may correspond to mammalian enhancers or polycomb/trithorax response elements (PRE/TREs). In such a scenario, binding of polycomb or trithorax proteins to proximal promoters of ncRNA-a will regulate the expression of ncRNA-a which in turn impact the expression of the protein-coding gene at the distance. Another set of recently published ncRNAs were termed long intervening noncoding RNA or lincRNAs (Guttman et al., 2009). The comparison of our ncRNAs and the lincRNAs show that about 13% of the ncRNAs deﬁned by ENCODE overlap the broad regions encoding a set of recently identiﬁed human lincRNAs (Khalil et al., 2009). The overlap between our ncRNAs and lincRNAs are even smaller (4%) if one considers only the exons corresponding to lincRNAs. Importantly, the studies with lincRNAs did not reveal any transcriptional effects in neighboring genes (Khalil et al., 2009). Therefore, it is likely that lincRNAs describe a distinct set of ncRNAs compared to those annotated by GENCODE. Similar to the diverse functions for proteins, ncRNAs such as lincRNAs may play other roles in regulating gene expression. The GENCODE annotation used in this study encompasses only a third of the human genome. Therefore, the number of ncRNAs in human cells is likely to grow and may equal or even surpass the number of protein-coding genes. Our considerations for selection of ncRNAs excluded all ncRNAs associated with protein-coding genes and their promoters, as well as known ncRNAs. Therefore, the repertoire of the noncoding transcripts in human cells contains many more transcripts than those cataloged in this study. Speciﬁcally, there have been reports of pervasive amount of antisense transcription as well as transcription mapping to promoter regions of protein-coding genes (Core et al., 2008; Denoeud et al., 2007; Kapranov et al., 2007; Preker et al., 2008; Seila et al., 2008). Whether such transcripts will have biological functions similar to those described for activating ncRNAs in our study is not known. However, it is clear that future genome-wide genetic analysis of ncRNAs in mammalian cells will begin to shed light on different classes of the ncRNAs.
56 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

The precise mechanism by which our ncRNAs function to enhance gene expression is not known. We envision a mechanism by which ncRNAs by virtue of sequence or structural homology targets the neighboring protein-coding genes to bring about increased expression. Our experimental evidence using a heterologous promoter point to the mechanism of action for activating ncRNAs operating in cis. However, genome-wide analysis following depletion of ncRNA-a7 suggested changes in gene expression that may not be related to the action of ncRNA-a7 on its local environment and may be a result of wider trans-mediated effects of ncRNA-a7. Such regulatory functions of ncRNAs could be achieved through an RNA-mediated recruitment of a transcriptional activator, displacement of a transcriptional repressor, recruitment of a basal transcription factor or a chromatin-remodeling factor. While we favor a transcriptional based mechanism for ncRNA activation, effects on RNA stability cannot be excluded. Taken together, the next few years will bring about new prospects for the long ncRNAs as central players in gene expression.
EXPERIMENTAL PROCEDURES Extracting Long ncRNA Data The HAVANA annotation has been downloaded using the DAS server provided by the Sanger institute (version July,16th 2008). We removed all annotated biotypes or functional elements belonging to speciﬁc categories such as pseudogenes or protein-coding genes. We excluded all transcripts overlapping with known protein coding loci annotated by HAVANA, RefSeq or UCSC. Transcripts falling into a 1 kb window of any protein-coding gene were also removed. Finally, we excluded all transcripts covered by known noncoding RNAs such as miRNAs (miRbase version 11.0 April 2008). To estimate the evolutionary constraints among mammalian sequences we constructed the cumulative distribution of PhastCons scores for ancestral repeats (ARs), RefSeq genes and long ncRNAs. The cumulative distributions of these transcripts or repeats are plotted using a log-scale on the y axis. Cell Culture and siRNA Transfections Human primary keratinocytes from four different biological donors were grown in Keratinocyte medium (KFSM, Invitrogen). Differentiation was induced by 2.5 ng/ml 12-O-tetradecanoylphorbol-13-acetate (TPA) during 48 hr. HEK293, A549, HeLa, and MCF-7 cells were cultured in complete DMEM media (GIBCO) containing 10% FBS, and 13 Anti/Anti (GIBCO). Jurkat cells were cultured in complete RPMI media (GIBCO) containing 10% FBS and 13 Anti/Anti (GIBCO). Migration assays were performed as previously described(Gumireddy et al., 2009). For transfections of 293, HeLa, A549, and MCF-7 cells we used Lipofectamine 2000 (Invitrogen) according to the manufacturer’s recommendations and an siRNA concentration of 50 nM. Jurkat cells were transfected using HiPerFect (QIAGEN) according to the manufacturer’s recommendations and an siRNA concentration of 100 nM. RNA Puriﬁcation, cDNA Synthesis, and Quantitative PCR Cells were harvested and resuspended in TRIzol (Invitrogen) and RNA extracted according to the manufacturer’s protocol. cDNA synthesis was done using MultiScribe reverse transcriptase and random primers from Applied Biosystems. Quantitative PCR was done using SybrGreen reaction mix (Applied Biosystems) and an HT7900 sequence detection system (Applied Biosystems). For all quantitative PCR reactions Gapdh was measured for an internal control and used to normalize the data. Cloning of pGL3-TK Reporters and Luciferase Assay pGL3-Basic was digested with BglII and HindIII and the TK promoter from pRL-TK was inserted into these sites. Inserts were ampliﬁed from genomic

DNA and cloned into the BamHI and SalI sites 50 to the luciferase gene. Luciferase assays were performed in 96-well white plates using Dual-Glo (Promega) according to the manufacturer’s protocol. Microarrays Custom-made microarrays (Agilent) were designed based on the library of 3019 long ncRNA sequences, with on average six probes targeting each transcript. Human whole genome mRNA arrays were from Agilent (G4112F). Total RNA samples were converted to cDNA using oligo-dT primers. Labeling of the cDNA and hybridization to the microarrays were performed according to Agilent standard dye swap protocols. Data analysis was done using the AFM 4.0 software. All microarrays were done in four biological replicates. SUPPLEMENTAL INFORMATION Supplemental Information includes Extended Experimental Procedures, four ﬁgures, and six tables and can be found with this article online at doi:10. 1016/j.cell.2010.09.001. ACKNOWLEDGMENTS Thanks to the HAVANA team for use of their genome annotation. We also thank the CRG Genomic Facility and the Functional Genomics Core Facility at Wistar and UPenn for expertise in microarray analysis. We thank Dr. Ken Zaret for helpful discussions. U.A.O. is supported by a grant from the Danish Research Council; M.B. is supported by an HFSPO fellowship; A.G. was supported by a fellowship from the American Italian Cancer Foundation; R.G. was supported through Spanish ministry, GENCODE U54 HG004555-01, and NIH; and R.S. was supported by a grant from NIH, GM 079091. Received: April 23, 2010 Revised: July 1, 2010 Accepted: August 13, 2010 Published: September 30, 2010 REFERENCES Banerji, J., Olson, L., and Schaffner, W. (1983). A lymphocyte-speciﬁc cellular enhancer is located downstream of the joining region in immunoglobulin heavy chain genes. Cell 33, 729–740. Banerji, J., Rusconi, S., and Schaffner, W. (1981). Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308. Barrallo-Gimeno, A., and Nieto, M.A. (2005). The Snail genes as inducers of cell movement and survival: implications in development and cancer. Development 132, 3151–3161. Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. (2007). High-resolution proﬁling of histone methylations in the human genome. Cell 129, 823–837. Berezikov, E., and Plasterk, R.H. (2005). Camels and zebraﬁsh, viruses and cancer: a microRNA update. Hum. Mol. Genet 14, Spec No. 2, R183-190. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. (2004). Global identiﬁcation of human transcribed sequences with genome tiling arrays. Science 306, 2242–2246. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., et al. (2007). Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Blanco, E., Parra, G., and Guigo, R. (2007). Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4 3. Carthew, R.W., and Sontheimer, E.J. (2009). Origins and Mechanisms of miRNAs and siRNAs. Cell 136, 642–655. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. (2005). Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154.

Core, L.J., Waterfall, J.J., and Lis, J.T. (2008). Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322, 1845–1848. De Craene, B., Gilbert, B., Stove, C., Bruyneel, E., van Roy, F., and Berx, G. (2005). The transcription factor snail induces tumor cell invasion through modulation of the epithelial cell differentiation program. Cancer Res. 65, 6237–6244. Denoeud, F., Kapranov, P., Ucla, C., Frankish, A., Castelo, R., Drenkow, J., Lagarde, J., Alioto, T., Manzano, C., Chrast, J., et al. (2007). Prominent use of distal 50 transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759. Efroni, S., Duttagupta, R., Cheng, J., Dehghani, H., Hoeppner, D.J., Dash, C., Bazett-Jones, D.P., Le Grice, S., McKay, R.D., Buetow, K.H., et al. (2008). Global transcription in pluripotent embryonic stem cells. Cell Stem Cell 2, 437–447. Euskirchen, G.M., Rozowsky, J.S., Wei, C.L., Lee, W.H., Zhang, Z.D., Hartman, S., Emanuelsson, O., Stolc, V., Weissman, S., Gerstein, M.B., et al. (2007). Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res. 17, 898–909. Fejes-Toth, K., Sotirova, V., Sachidanandam, R., Assaf, G., Hannon, G.J., Kapranov, P., Foissac, S., Willingham, A.T., Duttagupta, R., Dumais, E., and Gingeras, T.R. (2009). Post-transcriptional processing generates a diversity of 50 -modiﬁed long and short RNAs. Nature 457, 1028–1032. Feng, J., Bi, C., Clark, B.S., Mady, R., Shah, P., and Kohtz, J.D. (2006). The Evf2 noncoding RNA is transcribed from the Dlx-5/6 ultraconserved region and functions as a Dlx-2 transcriptional coactivator. Genes Dev. 20, 1470–1484. Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E., and Mello, C.C. (1998). Potent and speciﬁc genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806–811. Gillies, S.D., Morrison, S.L., Oi, V.T., and Tonegawa, S. (1983). A tissuespeciﬁc transcription enhancer element is located in the major intron of a rearranged immunoglobulin heavy chain gene. Cell 33, 717–728. Gumireddy, K., Li, A., Gimotty, P.A., Klein-Szanto, A.J., Showe, L.C., Katsaros, D., Coukos, G., Zhang, L., and Huang, Q. (2009). KLF17 is a negative regulator of epithelial-mesenchymal transition and metastasis in breast cancer. Nat. Cell Biol. 11, 1297–1304. Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk, O., Carey, B.W., Cassady, J.P., et al. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227. Harrow, J., Denoeud, F., Frankish, A., Reymond, A., Chen, C.K., Chrast, J., Lagarde, J., Gilbert, J.G., Storey, R., Swarbreck, D., et al. (2006). GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7 (Suppl 1), S4 1–9. Heard, E., and Disteche, C.M. (2006). Dosage compensation in mammals: ﬁne-tuning the expression of the X chromosome. Genes Dev. 20, 1848–1867. Heintzman, N.D., Hon, G.C., Hawkins, R.D., Kheradpour, P., Stark, A., Harp, L.F., Ye, Z., Lee, L.K., Stuart, R.K., Ching, C.W., et al. (2009). Histone modiﬁcations at human enhancers reﬂect global cell-type-speciﬁc gene expression. Nature 459, 108–112. Imperiale, M.J., and Nevins, J.R. (1984). Adenovirus 5 E2 transcription unit: an E1A-inducible promoter with an essential element that functions independently of position or orientation. Mol. Cell. Biol. 4, 875–882. Kapranov, P., Cheng, J., Dike, S., Nix, D.A., Duttagupta, R., Willingham, A.T., Stadler, P.F., Hertel, J., Hackermuller, J., Hofacker, I.L., et al. (2007). RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488. Khalil, A.M., Guttman, M., Huarte, M., Garber, M., Raj, A., Rivea Morales, D., Thomas, K., Presser, A., Bernstein, B.E., van Oudenaarden, A., et al. (2009). Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci. U S A. Khoury, G., and Gruss, P. (1983). Enhancer elements. Cell 33, 313–314.

Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 57

Kim, T.K., Hemberg, M., Gray, J.M., Costa, A.M., Bear, D.M., Wu, J., Harmin, D.A., Laptewicz, M., Barbara-Haley, K., Kuersten, S., et al. (2010). Widespread transcription at neuronal activity-regulated enhancers. Nature. Kong, S., Bohl, D., Li, C., and Tuan, D. (1997). Transcription of the HS2 enhancer toward a cis-linked gene is independent of the orientation, position, and distance of the enhancer relative to the gene. Mol. Cell. Biol. 17, 3955– 3965. Lanz, R.B., McKenna, N.J., Onate, S.A., Albrecht, U., Wong, J., Tsai, S.Y., Tsai, M.J., and O’Malley, B.W. (1999). A steroid receptor coactivator, SRA, functions as an RNA and is present in an SRC-1 complex. Cell 97, 17–27. Lecuyer, E., and Hoang, T. (2004). SCL: from the origin of hematopoiesis to stem cells and leukemia. Exp. Hematol. 32, 11–24. Lee, R.C., Feinbaum, R.L., and Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854. Lunter, G., Ponting, C.P., and Hein, J. (2006). Genome-wide identiﬁcation of human functional DNA using a neutral indel model. PLoS Comput. Biol. 2, e5. Mattick, J.S. (2009). The genetic signatures of noncoding RNAs. PLoS Genet. 5, e1000459. Nieto, M.A. (2002). The snail superfamily of zinc-ﬁnger transcription factors. Nat. Rev. Mol. Cell Biol. 3, 155–166. Ogawa, H., Ishiguro, K., Gaubatz, S., Livingston, D.M., and Nakatani, Y. (2002). A complex with chromatin modiﬁers that occupies E2F- and Myc-responsive genes in G0 cells. Science 296, 1132–1136. Parra, G., Blanco, E., and Guigo, R. (2000). GeneID in Drosophila. Genome Res. 10, 511–515.

Ponting, C.P., Oliver, P.L., and Reik, W. (2009). Evolution and functions of long noncoding RNAs. Cell 136, 629–641. Preker, P., Nielsen, J., Kammler, S., Lykke-Andersen, S., Christensen, M.S., Mapendano, C.K., Schierup, M.H., and Jensen, T.H. (2008). RNA exosome depletion reveals transcription upstream of active human promoters. Science 322, 1851–1854. Rinn, J.L., Kertesz, M., Wang, J.K., Squazzo, S.L., Xu, X., Brugmann, S.A., Goodnough, L.H., Helms, J.A., Farnham, P.J., Segal, E., and Chang, H.Y. (2007). Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell 129, 1311–1323. Savagner, P. (2001). Leaving the neighborhood: molecular mechanisms involved during epithelial-mesenchymal transition. Bioessays 23, 912–923. Seila, A.C., Calabrese, J.M., Levine, S.S., Yeo, G.W., Rahl, P.B., Flynn, R.A., Young, R.A., and Sharp, P.A. (2008). Divergent transcription from active promoters. Science 322, 1849–1851. Shamovsky, I., Ivannikov, M., Kandel, E.S., Gershon, D., and Nudler, E. (2006). RNA-mediated response to heat shock in mammalian cells. Nature 440, 556–560. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476. Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855–862. Yang, P.K., and Kuroda, M.I. (2007). Noncoding RNAs and intranuclear positioning in monoallelic gene expression. Cell 128, 777–786.

58 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc.

8VLQJ PXOWLSOH DOLJQPHQW PHWKRGV WR DVVHVV WKH TXDOLW\ RI

JHQRPLF GDWD DQDO\VLV

Cédric Notredame and Chantal Abergel

Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille
Email: cedric.notredame@igs.cnrs-mrs.fr, chantal.abergel@igs.cnrs-mrs.fr

1

$%675$&7

The analysis of multiple sequence alignments can generate essential clues in genomic data analysis. Yet, to be informative such analyses require some mean of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence alignment algorithm. We provide two examples of applications: one where the CORE index is used to identify correct blocks within a difficult multiple alignment and another where the CORE index is used on genomic data to identify the proper start codon and a frameshift within one of the sequence.

2



,QWURGXFWLRQ

Biological analysis largely relies on the assembly of elaborate models meant to summarize our knowledge of life complex mechanisms. For that purpose, vast amounts of data are collected, analyzed, validated and then integrated within a model. In an ideal world, an existing model would be available to explain every bit of experimental data. In the real world, this is rarely the case, and every day, existing models need to be modified to accommodate new findings. Sometimes, data that cannot be explained is kept at bay until the accumulation of new evidences prompts the design of an entirely new model. Unaccountable data can be viewed as the stuff inflating an inconsistency bubble. Eventually, the bubble bursts and a new model is designed.

A multiple alignment is nothing less than such a model. Given a series of sequences and an alignment criteria (structure similarity, common phylogenetic origin) the multiple alignment contains a series of hypothesis regarding the relationship between the sequences it is made of. This alignment can accommodate data generated experimentally (e.g. alignment of two homologous catalytic residues) or combine the results of various sequence analysis methods. The importance of the use of multiple sequence alignments in the context of sequence analysis has been recognized for a long time and it is so well established that most bioinformatics protocols make use of it. Multiple alignments have been turned into profiles (Gribskov et al., 1987) and hidden Markov models (Krogh et al., 1994) to enhance the sensitivity and the specificity of database searches (Altschul et al., 1997). State of the art methods for protein structure prediction depend on the proper assembly of a multiple

3

sequence alignment (Jones, 1999) as do phylogenetic analysis (Duret et al., 1994). Over the last years multiple sequence alignment techniques have been instrumental to improvements made in almost every key area of sequence analysis. Yet, despite its importance, the accurate assembly of a multiple sequence alignment is a complex process, the biological knowledge and the computational abilities it requires are far beyond our current capacities. As a consequence, biologists are left to use approximate programs that attempt to assemble proper alignments without providing any guaranty they may do so. The lack of a ‘perfect’ or at least reasonably robust method explains why so many multiple sequence alignment packages exist. The variations among these packages are not only cosmetic; they include the use of very different algorithms, different parameters and generally speaking different paradigms. For a recent review of state-of-the-art techniques, see (Duret and Abdeddaim, 2000).

Database searches, structure predictions, phylogenetic analysis are enough on their own to make multiple alignment compulsory in a genome analysis task. Yet, thanks to the sanity checks they provide, multiple alignments can also be instrumental at tackling the plague of genomic analysis: faulty data. When dealing with genomes, faulty data arises from two major sources: sequencing errors and wrong predictions. The consequence is that a predicted protein sequence may have accumulated errors both at the DNA level and when its frame was predicted (this will be especially true in eukaryotic genes where exons may be missed, added or improperly predicted). In the worst cases, the effect of such errors will be amplified in the high level analysis, leading to an improper analysis of the available data. On the other hand, once they have been identified, these errors are usually easily corrected either by extra sequencing or data extrapolation. Therefore, any method providing a reasonable sanity-check that earmarks areas of a genome likely to be problematic would be a major improvement. In this chapter we will show how multiple sequence alignments can be used to carry out part of

4

this task. For that purpose we will focus on the applications of T-Coffee, a recently described method (Notredame et al., 2000).



*HQHUDWLQJ 0XOWLSOH $OLJQPHQWV :LWK 7&RIIHH

Despite the large variety of multiple sequence alignment methods publicly available, the number of packages effectively used for data analysis is surprisingly small and a vast majority of the alignments found in the literature are produced using only two programs: ClustalW (Thompson et al., 1994) and its X-Window implementation ClustalX. ClustalW uses the progressive alignment strategy described by Taylor (Taylor, 1988) and Doolitle (Feng and Doolittle, 1987), refined in order to incorporate sequence weights and a local gap penalty scheme. Recently, the ClustalW algorithm was further modified in order to improve the accuracy of the produced alignments by making the evaluation of the substitution costs position dependant. This improved algorithm is implemented in the T-Coffee package (Notredame et al., 2000).

The aim of T-Coffee is to build a multiple alignment that has a high level of consistency with a library of pre-computed pair-wise alignments. This library may contain as many alignments as one wishes and it may also be redundant and inconsistent with itself. For instance it may contain several alternative alignments of the same sequences aligned using various gap penalties. It may also contain alternative alignments obtained by applying different methods onto the sequences. Overall, the library is a collection of alignments believed to be correct. Within this library, each alignment receives a weight that is an estimation of its biological likeliness (i.e. how much trust does one have in this alignment to be correct). For that

5

purpose, one may use any suitable criteria such as percent identity, P-Value estimation or any other appropriate method. The T-Coffee algorithm uses this library in order to compute the score for aligning two residues with one another in the multiple alignment. This score is named the extended weight because it requires an extension of the library. The extended weight takes into account the compatibility of the alignment of two residues with the rest of the alignments observed within the library, its derivation is extensively described in (Notredame et al., 2000). The principle is straightforward: in order to compute the extended weight associated with two residues R and S of two different sequences, one will consider whether when R is found aligned in the library with some residue X of a third sequence, S is also found aligned with that same residue X in another entry of the library. If that is the case, then the weight associated with R and S will be increased by the minimum of the two weights RX and SX. The final extended weight will be obtained when every possible X has been considered and the resulting contributions summed up. Although this operation seems to be very expensive from a computational point of view, its effective computational cost is kept low thanks to the scarceness of the primary library (i.e. for most pairs of residues RS, very few Xs need to be considered). In the end, a pair of residues is highly consistent (and has a high extended weight) if most of the other sequences contain at least one residue that is found aligned both to R and to S in two different pair-wise alignments. A key property of this weight extension procedure is to concentrate information: the extended score of RS incorporates some information coming from all the sequences in the set and not only from the two sequences contributing R and S.

The main advantage of the extended weights is that they can be used in place of a substitution matrix. While standard substitution matrices do not discriminate between two identical residues (e.g. all the cysteins are the same for a Pam (Dayhoff et al., 1979) or a Blosum

6

(Henikoff and Henikoff, 1992)), the extended weights are truly position specific and make it possible to discriminate between two identical residues that only differ by their positions. Once the library has been assembled (potential ways of assembling that library are described later) and the extended weights computed, T-Coffee closely follows the ClustalW procedure using the extended weights instead of a substitution matrix. The overall T-Coffee strategy is outlined in Figure 1. All the sequences are first aligned two by two, using dynamic programming (Needleman and Wunsch, 1970) and the extended library in place of a substitution matrix. The distance matrix thus obtained is then used to compute a neighborjoining tree (Saitou and Nei, 1987). This tree guides the progressive assembly of a multiple sequence alignment: the two closest sequences are first aligned by normal dynamic programming using the extended weights to align the residues in the two sequences, no gap penalty is applied (because it has already been applied to generate the alignments contained in the library). This pair of sequences is then fixed and any gaps that have been introduced cannot be shifted later. Then the program aligns the next closest two sequences or adds a sequence to the existing alignment of the first two sequences, depending which is suggested by the guide tree. The procedure always joins the next two closest sequences or pre-aligned group of sequences. This continues until all the sequences have been aligned. To align two groups of pre-aligned sequences one uses the extended weights, as before, but taking the average library scores in each column of the existing alignments.

The key feature of T-Coffee is the freedom given to the user to build his own library following whatever protocol may seem appropriate. For this purpose, one may mix structural information with database results, knowledge-based information or pre-established collections of multiple alignments. It may also be necessary to explore a wide range of parameters given some computer package. A simple library format was designed to fit that purpose, it is shown

7

on Figure 2. A library is a straightforward ASCII file that contains a listing of every pair of aligned residue that needs to be described. Any knowledge-based information can easily be added manually to an automatically generated library or the other way round. This figure also shows clearly that the library can contain ambiguities and inconsistencies (i.e. two alignments possible for the first residue of Seq1 with Seq2). These ambiguities will be resolved while the alignment is being assembled, on the basis of the score given by the extended weights. The library does not need to contain a weight associated with each possible pair of residues. On the contrary, an ideal library only contains pairs that will effectively occur in the correct multiple alignment (i.e. N2L pairs rather than N2L2 pairs). While this flexibility to design and assemble one’s own library is a very desirable property, in practice it is also convenient to have a standard automatic protocol available. Such a protocol exists and is fully integrated within the T-Coffee package. It is ran with the default mode and does not require the user to be aware of T-Coffee underlying concepts (Library, extension, progressive alignment). This default protocol extensively described and validated in (Notredame et al., 2000) requires two distinct libraries to be compiled and combined within the primary library before the extension. The first one contains a ClustalW pair-wise alignment of each possible pair of sequence within the dataset. For that purpose, ClustalW (Thompson et al., 1994) is run using default parameters. This library is global because it is generated by aligning the sequences over their whole length (global alignments) using a linear space version of the Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). The second library is local: for each possible pair of sequences, it contains the ten best non-overlapping local alignments as reported by the Lalign program (Huang and Miller, 1991) run with default parameters. In the local and the global libraries, each pair of residues found aligned is associated with a weight equal to the average level of identity within the alignment it came from. When a specific pair is found more than once, the weights associated with each occurrence are added. The main strength of

8

this protocol is to combine local and global information within a multiple alignment. The level of consistency within the library will depend on the nature of the sequences. For instance, if the sequences are very diverse, the requirement for long insertions/deletions will often cause the global alignments to be incorrect and inconsistent, while the local alignments will be less sensitive to that type of problems. In such a situation, the measure of consistence will enhance the local alignment signal and let it drive the multiple alignment assembly. Inversely, if the global alignments are good enough they will help removing the noise associated with the collection of local alignments (local alignments do not have any positional constraints). Overall, the current default T-Coffee protocol contains three distinct elements that lead to the collection of extended weights: the global library, the local library and the library extension that turns the sum of the two libraries into an extended library. Earlier work demonstrated that each of these components plays a significant part in improving the overall accuracy of the program. Table 1 shows that the current version of T-Coffee (Version 1.29) outperforms other popular multiple sequence alignment methods, as judged by comparison on BaliBase (Thompson et al., 1999), a database of hand made reference structural alignments that are based on structural comparison (See Table 1 legend for a description of BaliBase and the comparison protocol).

These results illustrate well the good performances of T-Coffee on the wide range of situations that occur in BaliBase. It is especially interesting to point out that T-Coffee is the only method equally well suited to situations that require a global alignment strategy (categories 1, 2 and 3) and situations that are better served with a local alignment strategy (categories 4 and 5 with long internal and terminal insertions/deletions). The other methods are either good for global alignments (like ClustalW) or for local alignments (like Dialign2 (Morgenstern et al., 1998)). It should be noted that T-Coffee still uses ClustalW 1.69 to

9

construct the primary global library, because this was the last ‘naïve’ version of ClustalW, not tuned up on BaliBase. The latest version (1.81) has been tuned on the BaliBase references (hence its improved performances over the results originally reported for ClustalW). Using this ClustalW 1.81 version when benchmarking T-Coffee would make the process circular. Nonetheless, as good as it may seem, T-Coffee still suffers from the same shortcoming as any other package available today:
LW LV QRW DOZD\V WKH EHVW PHWKRG

. Even if on average it does

better than any of its counterparts, one cannot guaranty that T-Coffee will always generate the best alignment. For instance, although Dialign2 is significantly less good, it T-Coffee outperforms on 17 test sets (11%). ClustalW is better than T-Coffee in 24% of the cases. We may conclude from this that in practice, there will always be situations where some alternative method beats T-Coffee. Furthermore, even in cases where the T-Coffee improvement over any alternative method is very significant, it may lead to an alignment much less than 100% correct. This may not be so helpful since for practical usage, it would be much more helpful to know where the correctly aligned portions lie. This is so true that a method 20% correct and a proper estimation of its reliability would be much more useful than a method more accurate ‘on average’.

Several situations exist in which a biologist can make use of this reliability information. For instance, if the purpose of the alignment is to extrapolate some experimental data onto an otherwise un-characterized genomic sequence, one will need to be very careful not to deduce anything from an unreliable portion of the alignment. More generally, unreliable positions within a multiple sequence alignment should not be used for predictive purpose. For instance, when turning a multiple alignment into a profile in order to scan databases for remote homologues, it is essential to exclude regions whose alignment cannot be trusted and that may obscure some otherwise highly conserved position. Used in this fashion, reliability

10

information allows a significant decrease of the noise induced by locally spurious alignments. The other important application of a reliability measure is the identification of regions within a multiple alignment that are properly aligned without being highly conserved. These regions are extremely important when the alignment is used in conjunction with a predictive method that bases its analysis on mutation patterns. For instance, structure and phylogeny prediction methods require the presence of non-conserved positions to yield informative results. Any scheme that allows discriminating between positions that are degenerated but correctly aligned and positions that are simply misaligned may induce a dramatic improvement in the accuracy of these prediction methods. Furthermore a reliability measure will help identifying faulty data and provide some clues on how to correct it. In the next section, we show how consistency can be measured on a T-Coffee alignment and how that measure provides a fairly accurate reliability estimator.



0HDVXULQJ

7KH

&RQVLVWHQF\

2Q

$

0XOWLSOH

6HTXHQFH

$OLJQPHQW

T-Coffee is a heuristic algorithm that attempts to optimize the consistency between a multiple alignment and a list of pre-computed pair-wise alignments known as a library (Figure 2). By consistency we mean that a pair of residues described aligned in the library will also be found aligned in the multiple alignment. While the theoretical maximum for the consistency is 100%, the score of an optimal alignment will only be equal to the level of self-consistency within the library. Figure 2 shows the example of a library that is not self consistent because it

11

is ambiguous regarding the alignment of some of the residues it contains. Of course, the more ambiguous the library, the less consistency it will yield. For instance, given two residues and
U T

taken from two different sequences
  ¡   ¢

6

and

6

, one can easily measure the consistency

(CS( 5 1 , 5 2 ) ) between the alignment of these two residues and all the other alignments contained in the library by comparing ES( 5 1 , 5 2 ), the extended score of the pair
¡ ¢    
T

and U,

with the sum of the extended scores of all the other potential pairs that involve 6 and 6 and either U or T.

If we want to use it as a quality factor, this measure suffers from two major drawbacks. Firstly it is expensive to compute: given a multiple alignment of N sequences and of length L, each pair of residues found in the multiple alignment needs O(L) operations of extension that require a minimum of O(N) operations each. “O(L)” is a standard notation called
QRWDWLRQ ELJ2

, meaning that the computation time is proportional to L, up to a constant factor.

Since there are L*N2 pairs of residues in a multiple alignment, this leads to O(L2N3) operations for an estimate of the CS of every pair. This cubic complexity becomes problematic with large numbers of sequences. The second limitation of this measure is that with sequences rich in repeats, the summation factor can become artificially high and cause a dramatic decrease of the consistency score. In practice, we found it much more effective to use the extended score of the best scoring pair contained in the alignment as a normalization factor. This defines the aCS (approximate Consistency measure).

12

©









aCS 5 1 , 5


2

5

1

,5

2

0D[ (6 5

,5

¨ §

 §

§

§

§

§

(

) = ES(

)

{ (

)} (2)

¤

= 1

= 2

£

£

¤



¤

¥

¦

¤

¦

¥

¦

CS 5 1 , 5
¥

2

5

1

,5

2

5

1

,5

2

5

1

,5

£

£

£

£

£

£

£

£

(

) = ES(

)/  ∑ ES( 

) + ∑ ES(

2

)  (1) 


Our measurements on the BaliBase dataset indicate that the CS and the aCS are well correlated.

An important criteria, when using the aCS as a reliability measure, is its ability to discriminate between correct and incorrect alignments within the so-called twilight zone (Sander and Schneider, 1991). Given two sequences, the twilight zone is a range of percent identity (between 0 and 30%) that has been shown to be non-informative regarding the relationship that exist among two sequences. Two sequences whose alignment yields less than 30% identity can either be unrelated or related and incorrectly aligned or related and perfectly aligned. To check how good the aCS is when used as an accuracy measure, every 142 BaliBase dataset was aligned using T-Coffee 1.29 and the similarity of each pair of sequence was measured within the obtained alignments. Pairs of sequences with less than 30% identity (5088) were extracted and the accuracy of their alignment was assessed by comparison with their counterparts in the reference BaliBase alignment, the aCS score was also assessed on each pair of aligned residues and averaged along the sequences. Figure 3a shows the scattered graph Identity Vs Accuracy (See Figure legend for the definitions). Despite a weak correlation between these two measurements, the percent identity is a poor predictor of the alignment accuracy. For 75% of the sequence pairs (identity lower than 25%) the accuracy indication given by the percent identity falls in a 40% range (i.e. the average identity indicates the average accuracy +/- 20%). On the other hand, when the accuracy is plotted against the aCS score (Figure 3b) the correlation is improved and for pairs of sequences having an aCS higher than 20 (this is true for 60% of the 5088 pairs) this measure is a much better alignment accuracy predictor than the percent identity. While they do not solve the twilight zone





With

5

, 5 any two residues found aligned in the multiple alignment.

 

 

13

problem, these results indicate that the aCS measure provides us with a powerful mean of assessing an alignment reliability within the twilight zone. Nonetheless, from a practical point of view, the aCS measure is not so useful since one is often more concerned by the overall quality (i.e. is residue r of sequence S correctly aligned to the rest of the sequences?) than by pair-wise relationships. In order to answer this type of questions, the aCS measure was used to derive three very useful non pair-wise indexes.

7KH &RQVLVWHQF\ RI WKH 2YHUDOO 5HVLGXH (YDOXDWLRQ

(CORE) is obtained by averaging the

scores of each of the aligned pairs involving a residue within a column.

Where T and U are two residues found aligned in the same column.

The CORE index and equivalent approaches have been shown in the literature to be good indicators of the local quality of a multiple sequence alignment (Heringa, 1999; Notredame et al., 1998), as judged by comparison with reference biological alignments. In the T-Coffee package, an option makes it possible to output multiple alignments with the CORE index (a rounded value between 0 and 9) replacing each residue. It is also possible to produce a colorized version (pdf, postscript or html) of that same multiple alignment where residues receive a background coloration proportional to their CORE index (blue/green for low scoring residues and orange/red for the high scoring ones). Such an output is shown on Figure 5 and 6.

!

=1, ≠
"

$

%

"

%

CORE

5

D&6 5

,5

" #

! #

( )= ∑
! #

(

)/ (

1

− 1) (3)

14

The CORE index described in equation (3) is merely an average aCS measure, and whether that measure provides some indication on the multiple alignment quality is a key question. We tested that hypothesis on the complete BaliBase dataset. Given each T-Coffee alignment, residues were divided in 4 categories: (i)
WUXH SRVLWLYHV

(TP) are correctly aligned residues

rightfully predicted to be so, (ii) WUXH QHJDWLYH (TN) are incorrectly aligned residues rightfully predicted to be so, (iii) IDOVH SRVLWLYH (FP) are residues predicted to be well aligned when they are not, (iv)
IDOVH QHJDWLYH

(FN) are residues wrongly predicted to be misaligned. Following

previously described definitions (Notredame et al., 1998), a residue is said to be correctly aligned if at least 50% of the residues to which it was aligned in the reference alignment are found in the same column in the T-Coffee alignment. Each of the 10 CORE indexes (0 to 9) was used in turn as threshold to discriminate correctly and non-correctly aligned residues on the T-Coffee alignments. The BaliBase reference alignments were then used to evaluate the TP, TN, FP and FN. Sensitivity and the specificity were then computed according to Sneath and Sokal (Sneath and Sokal, 1973) and plotted on a graph (Figure 4). Our results indicate that the best trade off between sensitivity and specificity is obtained when CORE=3 is used as a threshold (i.e. every residue with a score higher or equal to 3 is considered to be properly aligned). In that case the specificity is 84% and the sensitivity is 82%. These high figures partly reflect the overall quality of the T-Coffee alignments in which 80.5% of the residues are correctly aligned according to the criteria used here. It is therefore more interesting to note that when the CORE index reaches 7, the specificity is 97.7% and the sensitivity is close to 50%. This means that thanks to the CORE index, half of the residues properly aligned in a multiple alignment can unambiguously be identified (e.g. more than 40 % of all the residues contained in BaliBase). In the next section we will see that this proper identification sometimes occurs in cases that are far from being trivial, even for an expert eye. Similar results were observed when applying the CORE index on multiple alignments obtained using

15

another method (i.e. ClustalW alignments evaluated with a standard T-Coffee library). This suggests that the CORE measure may be used to evaluate the local quality of a multiple alignment produced by any source. However, one should be well aware that the relevance of the CORE measure regarding the reliability of an alignment is entirely dependant on the way in which the library was derived. All the conclusions drawn here only apply to libraries derived using the standard T-Coffee protocol.

7KH VHTXHQFH &25( V&25(

is obtained by averaging the CORE scores over all the residues

contained within one sequence.

=1

That measure can be helpful for identifying among the sequences an outlier, a sequence that should not be part of the set either because it is not homologous or because it is too distantly related to the other members to yield an informative alignment.

7KH DOLJQPHQW &25(

(alCORE) may be obtained by averaging the sCOREs over all the

sequences. Our analysis suggest that this index may be a reasonable indicator of the alignment overall accuracy. Yet, to be fully informative, it requires the sequence set to be homogenous (i.e. the standard deviation of the sCOREs should be as low as possible).

'

52& 5

( )

sCORE(6[ ) = ∑
'

(

( ))/

&

/

(4)

16



8VLQJ

WKH

&25(

0HDVXUH

7R

$VVHVV

/RFDO

$OLJQPHQW

4XDOLW\

The driving force behind the development of the CORE index is the identification of correctly aligned blocks of residues within a multiple sequence alignment. It is common practice to identify these blocks by scanning the multiple alignment and marking highly conserved regions as potentially meaningful. ClustalW and ClustalX provide a measure of conservation that may help the user when carrying out this task. Unfortunately, situations exist where it is difficult to make a decision regarding the correct alignment of some residues within an alignment. Such an example is provided in Figure 5 with the BaliBase alignment known as 1pamA_ref1, made of 6 alpha-amylases.

This set is difficult to align because it contains highly divergent sequences. Not only have these sequences accumulated mutations while they were diverging but they have also undergone many insertion/deletions that make it difficult to reconstruct their relationships with accuracy. The average level of identity measured on the BaliBase reference is 18%, the two closest sequences being less than 20% identical. As such, 1pamA_ref1 constitutes a classic example of a test set deceptive for most multiple sequence alignment methods. The fact that less than one third of the 1pam_ref1 reference alignment is annotated as trustable in BaliBase confirms that suspicion. When ran on this test-set, existing alignment programs generate different results, Prrp finds 37% of the columns annotated as trustable in BaliBase, ClustalW (1.81) 40%, T-Coffee 54% and Dialign2 56%. Regardless of the methods used, such an alignment is completely useless unless correctly aligned portions can be identified. It is exactly the information that the CORE index provides us with. An alignment colorized according to its CORE indexes is shown on Figure 5. 17

The results are in good agreement with those reported in Figure 4. Out of the 905 correctly aligned residues (42% of the total), 267 have a score higher than 7. No incorrectly aligned residue has a score higher or equal to 7. Using 7 as a prediction threshold gives a sensitivity of 29% and a specificity of 100%. Residues with a CORE index of 3 or higher (yellow pale) yield a sensitivity of 65% and a specificity of 79%. In this alignment, the main features are the red/dark-orange blocks: they are 100% correct. These blocks could be fed as they are to any suitable method (structure prediction, phylogeny….). They are not very well conserved at the sequence level and are therefore very informative for structural and phylogenetic analysis. For instance, the block II in Figure 5 is perfectly aligned although within that block, the average pair-wise identity is lower than 30% (41 % for the two most closely related sequences). The measure of consistency can also help questioning positions that may seem unambiguous from a sequence point of view. In the column annotated as I, the position marked with a “*” could easily be mistaken to be correct: it is within a block, aromatic positions are usually fairly well conserved and owing to their relative rarity, unlikely to occur by chance. Yet the green color code indicates that this position may be incorrectly aligned (the green tyrosine has a CORE index of 1). This is confirmed by comparison with the reference that shows the correct alignment to incorporate another tyrosine at this position.

When analyzing these patterns, one should always keep in mind that the consistency information only has a positive value. In other words, inconsistent regions are those where the library does not support the alignment. This does not mean they are incorrectly aligned but rather that no information is at hand to support or disprove the observed alignment.

18



,GHQWLI\LQJ )DXOW\ *HQH 3UHGLFWLRQV

Another possible application of the T-coffee CORE index is to reveal and help resolving sequence ambiguities in predicted genes. In the structural genomic era, many projects involve hypothetical proteins, for which an accurate prediction of the start and stop codon is needed to properly express the gene product. Since over-predicted N or C-terminal are rarely conserved at the amino acid level, sequence comparison provides us with a very powerful mean of identifying this type of problems. A simple procedure consists of multiply aligning the most conserved members of a protein family before measuring the T-Coffee CORE index on the resulting alignment. Inspection of the CORE patterns offers a diagnostic regarding the correctness of the data. This approach can also be applied to frame-shift detection where the identification of abnormally low scoring segments may lead to their correction. Such an alignment will make it possible to decide if the abnormal length of a coding region could be due to a sequencing error (and the resulting frame-shift). At least the CORE measure will indicate that a thorough examination is needed. Of course, one could also detect these frameshifts using standard pair-wise comparison methods such as Gene-wise (Birney and Durbin, 2000), but the advantage of using a multiple sequence alignment is that the simultaneous comparison of several sequences can strengthen the evidence that the frame-shift is real. Furthermore, thanks to the multiple alignment, one may be able to detect mistakes in sequences that lack a very close homologue.

To illustrate this potential usage of T-Coffee, we chose the example of an (VFKHULFKL FROL . gene (Accession # U00096) predicted to encode a protein of unknown function, yifB. Orthologous genes were found in complete genomes using BLAST (Altschul et al., 1997) and the four most conserved sequences (identity >70% relative to the
(VFKHULFKLD FROL .

gene,

19

see figure for ID numbers) were retrieved along with their flanking regions (80 nucleotides on the N-terminus side) in order to check whether these supposedly non coding regions did not contain any coding information. The ‘elongated’ sequences were translated in the same frame as their core coding region, their multiple alignment was carried out using T-Coffee and the CORE indexes were measured. The resulting alignment is displayed on Figure 6 with the CORE indexes color-coded (low CORE in blue and green, high CORE in orange and red). The main feature on the N-terminus is an abrupt transition (II) from low to high CORE indexes. This position is also a conserved methionine. The combination of these two observations suggests that the starting point of these five sequences is probably where the transition occurs, ruling out other methionines as potential starting points in the first sequence (I). Another discrepancy occurs in this alignment that is also emphasized by the CORE analysis: the sequence yifB_SALTY_1 yields a very low N-terminal CORE index, relatively to the other family members. The CORE score of this sequence becomes abruptly in phase with the other sequences at the position marked III. This pattern is a clear indication of a frame-shift: a protein highly similar to the other members of its family but locally unrelated. To verify that hypothesis, we used some data provided by SwissProt (Bairoch and Boeckmann, 1992) and found that in the corresponding entry, the nucleotide sequence has been corrected to remove the frame-shift we observed (entry P57015). The corrected sequence has been added to the bottom of the alignment on Figure 6 (non-colored sequence). The position where yifB_SALTY_1 and its corrected version start agreeing is also the position where the CORE score changes abruptly from a value of 2 (yellow) to a value of 7 (orange). That position also turns out to be the one where the frame-shift occurs in the genomic sequence.

20



&RQFOXVLRQ

In this chapter, we introduced an extension of the T-Coffee multiple sequence alignment method: the CORE index. The CORE index is a mean of assessing the local reliability of a multiple sequence alignment. Using the CORE index, correct blocks within a multiple sequence alignment can be identified. This measure also makes it possible to detect potential errors in genomic data, and to correct them. The CORE index is a relatively add hoc measure and even if it may prove extremely useful from a practical point of view, it still needs to be attached to a more theoretical framework. One would really need to be able to turn the consistency estimation into some sort of P-Value. For instance, to assess efficiently the local value of an alignment, one would like to ask questions of the following kind: what is the probability that library X was generated using dataset Y? What is the probability that alignment A yields p% consistency with library X? Altogether these questions may open more venues to the automatic processing of multiple alignments. That issue may prove crucial for the maintenance of resources that rely on a large scale usage of multiple sequence alignments such as Hobacgene (Perriere et al., 2000), Hovergene (Duret et al., 1994)or Prodom (Corpet et al., 2000).

21

)LJXUH /HJHQGV

Figure 1
/D\RXW RI WKH 7&RIIHH DOJRULWKP

22

This figure indicates the chain of events that lead from unaligned sequences to a multiple sequence alignment using the T-Coffee algorithm. Data processing steps are boxed while data structures are indicated by rounded boxes.

23

Figure 2
/LEUDU\ )RUPDW

An example of a library used by T-Coffee. The header contains the sequences and their names. ‘# 1 2’ indicates that the following pairs of residues will come from sequences 1 and 2. Each pair of aligned residues contains three values: the index of residue 1, the index of residue 2 and the weight associated with the alignment of these two residues. No order or consistency is expected within the library.

24

Figure 3 a) 3HUFHQWDJH LGHQWLW\ 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH: the 5088 pairs of sequences that have less than 30% identity in the BaliBase reference alignments were extracted. The accuracy of

25

their alignment was measured by comparison with the reference, and the resulting graph was plotted. b)
$SSUR[LPDWH &RQVLVWHQF\ 6FRUH D&6 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH

: the aCS was

measured on the 5088 pairs of sequences previously considered and was plotted against the average accuracy previously reported. The vertical line indicates aCS=25 and separates the pairs for which the aCS is informative from those whose aCS seems to be non-informative.

26

27

Figure 4
6SHFLILFLW\ DQG 6HQVLWLYLW\ RI WKH &25( PHDVXUH

28

The sensitivity and the specificity of the CORE index used as an alignment quality predictor were evaluated on the BaliBase test-sets. Measures were carried out on the entire BaliBase dataset. The sensitivity (z) and the specificity () were measured on the T-Coffee alignments after considering that every residue with a CORE index higher than x was properly aligned (see text for details).

29

Figure 5
,GHQWLI\LQJ FRUUHFW EORFNV ZLWK WKH &25( PHDVXUH

30

An example of the T-Coffee output on a BaliBase test set (1pamA_ref1) that contains five alpha amylases. This alignment was produced using T-Coffee 1.29 with default parameters and requesting the score_pdf output option. The color scale goes from blue (CORE=0, bad) to red (CORE=9, good). The residues in capital are correctly aligned (as judged by comparison with the BaliBase reference). Those in lower case are improperly aligned. Box I indicates a conserved position that is not properly aligned, box II indicates a block of distantly related segments that is correctly aligned by T-Coffee.

31

Figure 6
,GHQWLI\LQJ IUDPH VKLIWV DQG VWDUW FRGRQV

32

The chosen sequences came are YifB_ECOLIA YifB_SALTY_1 (6DOPRQHOOD (+DHPRSKLOXV
LQIOXHQ]DH WLSK\

(VFKHULFKLD FROL

, accession # AE005174),

, C18 chromosome, Sanger Center), YifB_HAIN
PXOWRFLGD

Accession # L42023), YifB_PASMU (3DVWHXUHOOD
DHUXJLQRVD

,

Accession # AE004439) and YifB_PSEAE (3VHXGRPRQDV

, Accession #

AE004091), they were aligned using the standard T-Coffee alignment procedure and requesting the score_pdf output option. The corrected sequence of
6DOPRQHOOD WLSK\

YifB

protein sequence was later added for further reference (YifB_SALTY, SP: P57015) but it was not used for coloring the residues (Non colored sequence) or improving the multiple alignment. The figure only shows the N-terminal portion of the alignment, and the arrow indicates the positions annotated as starting codons in SwissProt (except for salmonella tiphy). Box I indicates a putative starting codon in YifB_ECOLIA, Box II indicates the true starting codon in most sequences, and Box III indicates the position where the frame-shift occurs in YifB_SALTY_1.

33

7DEOH 

cat 1

cat 2

cat 3

cat 4

cat 5

avg 1

avg 2

------------------------------------------------------------------------cw prrp dialign2 T-Coffee 79.53 78.62 70.99 32.91 32.45 25.21 48.72 50.14 35.12 74.02 51.12 74.66 67.84 82.72 80.38 67.89 66.45 61.54 61.82 60.25 57.99

To produce this table each dataset contained in BaliBase was aligned using one of the methods (cw: ClustalW 1.81 (Thompson et al., 1994), Prrp (Gotoh, 1996), dialign2 (Morgenstern et al., 1998) and T-Coffee 1.29 (Notredame et al., 2000). In BaliBase, reference alignments are classified in 5 categories: category 1 contains closely related sequences, category 2 contains a group of closely related sequences and an outsider category 3 contains two groups of sequences that are distantly related, category 4 contains families with long internal indels, Category 5 contains sequences with long terminal indels. The resulting

alignments were then compared to their reference counterpart in BaliBase, only using the regions annotated as trustable in BaliBase. Under this scheme we define the accuracy of an alignment to be the percentage of columns that are found totally conserved in the reference divided by the total number of columns within that reference. The comparison is restricted to the portions annotated as trustworthy in the reference alignment. results obtained on each of the 142 test cases,
DYJ  DYJ 

is the average of the results obtained in

each category. Bold numbers indicate the best performing method.

34

E E 3 S 5 QVB2TQDU87 0 C 3 H

Q64QTQDGQFPIG2F8D3BA9@864120 S 5 3 C S C R 3 9 0 1 H 3 9 E E C 5 7 5 3

is the average of the

%LEOLRJUDSK\

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. nucleic acids res. 25: 2289-3402. Bairoch, A. and Boeckmann, B., 1992. The SWISS-PROT protein sequence data bank. Nucleic Acids Res: 2019-2022. Birney, E. and Durbin, R., 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res 10: 547-548. Corpet, F., Servant, F., Gouzy, J. and Kahn, D., 2000. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. nucleic acids res 28: 267269. Dayhoff, M.O., Schwarz, R.M. and Orcutt, B.C., 1979. A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: M.O. Dayhoff (Editor), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C., pp. 353-358. Duret, L. and Abdeddaim, S., 2000. Multiple Alignment for Structural, Functional, or phylogenetic analyses of Homologous Sequences. In: D. Higgins and W. Taylor (Editors), Bioinformatics, Sequence, structure and databanks. Oxford University Press, Oxford. Duret, L., Mouchiroud, D. and Gouy, M., 1994. HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22: 2360-2365.

35

Feng, D.-F. and Doolittle, R.F., 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25: 351-360. Gotoh, O., 1996. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264: 823-838. Gribskov, M., McLachlan, M. and Eisenberg, D., 1987. Profile analysis: Detection of distantly related proteins. Proceedings of the National Academy of Sciences 84: 43555358. Henikoff, S. and Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Computers and Chemistry 23: 341364. Huang, X. and Miller, W., 1991. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12: 337-357. Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195-202. Krogh, A., Brown, M., Mian, I.S., Sjölander, K. and Haussler, D., 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235: 1501-1531. Morgenstern, B., Frech, K., Dress, A. and Werner, T., 1998. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Needleman, S.B. and Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453.

36

Notredame, C., Higgins, D.G. and Heringa, J., 2000. T-Coffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302: 205-217. Notredame, C., Holm, L. and Higgins, D.G., 1998. COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422. Perriere, G., Duret, L. and Gouy, M., 2000. HOBACGEN: database system for comparative genomics in bacteria. Genome Research 10: 379-385. Saitou, N. and Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425. Sander, C. and Schneider, R., 1991. Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9: 56-68. Sneath, P.H.A. and Sokal, R.R., 1973. Numerical Taxonomy. Freeman, W.H., San Francisco. Taylor, W.R., 1988. A flexible method to align large numbers of biological sequences. Journal of Molecular Evolution 28: 161-169. Thompson, J., Higgins, D. and Gibson, T., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4690. Thompson, J.D., Plewniak, F. and Poch, O., 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682-2690.

37

8VLQJ *HQHWLF $OJRULWKPV IRU 3DLUZLVH DQG 0XOWLSOH 6HTXHQFH $OLJQPHQWV

Cédric Notredame
Information Génétique et Structurale CNRS-UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr
 ,QWURGXFWLRQ

1.1

Importance of Multiple Sequence Alignment

The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Given a set of homologous sequences, multiple alignments are used to help predict the secondary or tertiary structure of new sequences [51]; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families [6]; to suggest primers for the polymerase chain reaction (PCR) and as an essential prelude to phylogenetic reconstruction [19]. These alignments may be turned into profiles [25] or Hidden Markov Models (HMMs) [27, 9] that can be used to scour databases for distantly related members of the family.

Multiple alignment techniques can be divided into two categories: global and local techniques. When making a global alignment, the algorithm attempts to align sequences chosen by the user over their entire length. Local alignment algorithms automatically discard portions of sequences that do not share any homology with the rest of the set. They constitute a greater challenge since they increase the amount of decision made by the algorithm. Most multiple alignment methods are global, leaving it to the user to decide on the portion of sequences to be incorporated in the multiple alignment. To aid that decision, one often uses

local pairwise alignment programs such as Blast [3] or Smith and Waterman [56]. In the context of this chapter, we will focus on global alignment methods with a special emphasis on the alignment of protein and RNA sequences.

Despite its importance, the automatic generation of an accurate multiple sequence alignment remains one of the most challenging problems in bioinformatics. The reason behind that complexity can easily be explained. A multiple alignment is meant to reconstitute relationships (evolutionary, structural, and functional) within a set of sequences that may have been diverging for millions and sometimes billions of years. To be accurate, this reconstitution would require an in-depth knowledge of the evolutionary history and structural properties of these sequences. For obvious reasons, this information is rarely available and generic empirical models of protein evolution [18, 28, 8], based on sequence similarity must be used instead. Unfortunately, these can prove difficult to apply when the sequences are less than 30% identical and lay within the so-called “twilight zone” [52]. Further, accurate optimization methods that use these models can be extremely demanding in computer resources for more than a handful of sequences [12, 62]. This is why most multiple alignment methods rely on approximate heuristic algorithms. These heuristics are usually a complex combination of ad hoc procedures mixed with some elements of dynamic programming. Overall, two key properties characterize them: the optimization algorithm and the criteria (objective function) this algorithm attempts to optimize.

1.2

Standard Optimization Algorithms

Optimization algorithms roughly fall in three categories: the exact, the progressive, and the iterative algorithms. Exact algorithms attempt to deliver an optimal or a sub-optimal alignment within some well defined bounds [40], [57]. Unfortunately, these algorithms have

very serious limitations with regard to the number of sequences they can handle and the type of objective function they can optimize. Progressive alignments are by far the most widely used [30, 14, 45]. They depend on a progressive assembly of the multiple alignment [31, 20, 58] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [43]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity even if it is by nature a greedy heuristic that does not guarantee any level of optimization. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvement can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one from a multiple alignment and realigning them to the remaining sequences [7] [24] [29]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [39], simulated annealing (SA) [37, 36, 33] and evolutionary computation such as genetic algorithms (GAs) [44, 47, 34, 64, 5, 23] and evolutionary programming [11, 13]. Their main advantage is to allow for a good separation between the optimization process and evaluation criteria (objective function). It is the objective function that defines the aim of any optimization procedure and in our case, it is also the objective function that contains the biological knowledge one tries to project in the alignment.

1.3

The Objective Function

In an evolutionary algorithm, the objective function is the criteria used to evaluate the quality (fitness) of a solution (individual). To be of any use, the value that this function associates to an alignment must reflect its biological relevance and indicate the structural or the

evolutionary relation that exists among the aligned sequences. In theory, a multiple alignment is correct if in each column the aligned residues have the same evolutionary history or play similar roles in the three-dimensional fold of RNA or proteins. Since evolutionary or structural information is rarely at hand, it is common practice to replace them with a measure of sequence similarity. The rationale behind this is that similar sequences can be assumed to share the same fold and the same evolutionary origin [52] as long as their level of identity is above the so-called "twilight zone" (more than 30% identity over more than 100 residues).

Accurate measures of similarity are obtained using substitution matrices [18, 28]. A substitution matrix is a pre-computed table of numbers (for proteins, this matrix is 20*20, representing all possible transition states for the 20 naturally occurring amino acids) where each possible substitution/conservation receives a weight indicative of its likeliness as estimated from data analysis. In these matrices, substitutions (conservations) observed more often than one would expect by chance receive positive values while under-represented mutations are associated with negative values. Given such a matrix the correct alignment is defined as the one that maximizes the sum of the substitution (conservations) score. An extra factor is also applied to penalize insertions and deletions (Gap penalty). The most commonly used model for that purpose is named ’affine gap penalties’. It penalizes an insertion/deletion once for its opening (gap opening penalty, abbreviated GOP) and then with a factor proportional to its length (gap extension penalty, abbreviated GEP). Since any gap can be explained with one mutational event only, the aim of that scheme is to make sure that the best scoring evolutionary scenario involves only a small number of insertions or deletions (indels) in the alignment. This will result in an alignment with few long gaps rather than many short ones. The resulting score can be viewed as a measure of similarity between two sequences (pairwise). This measure can be extended for the alignment of multiple sequences in many

ways. For instance, it is common practice to set the score of the multiple alignment to be the sum of the score of every pairwise alignment it contains (sums of pairs)[1]. While that scoring scheme is the most widely used, its main drawback stems from the lack of an underlying evolutionary scenario. It assumes that every sequence is independent and this results in an overestimation of the number of substitutions. It is to counterbalance that effect that probability based schemes were introduced in the context of HMMs. Their purpose is to associate each column of an alignment with a generation probability [39]. Estimations are carried out in a Bayesian context where the model (alignment) probability is evaluated simultaneously with the probability of the data (the aligned sequences). In the end, the score of the complete alignment is set to be the probability of the aligned sequences to be generated by the trained HMM. The major drawbacks of this model are its high dependency on the number of sequences being aligned (i.e. many sequences are needed to generate an accurate model) and the difficulty of the training. More recently, new methods based on consistency were described for the evaluation of multiple sequence alignments. Under these schemes, the score of a multiple alignment is the measure of its consistency with a list of pre-defined constraints [42, 46, 10, 45]. It is common practice for these pre-defined constraints to be sets of pairwise, multiple or local alignments. Quite naturally, the main limitation of consistencybased schemes is that they make the quality of the alignment greatly dependent on the quality of the constraints it is evaluated against.

An objective function always defines a mathematical optimum, that is to say an alignment in which the sequences are arranged in such a manner that they yield a score that cannot be improved. The mathematically optimal alignment should never be confused with the correct alignment, the biological optimum. While the biological optimum is by definition correct, a mathematically optimal alignment is biologically only as good as it is similar to the biological

optimum. This depends entirely on the quality of the objective function that was used to generate it. In order to achieve this result, there is no limit to the complexity of the objective functions one may design, even if in practice the lack of appropriate optimization engines constitutes a major limitation.

What is the use of an objective function if one cannot optimize it and how is it possible to tell if an objective function is biologically relevant or not? Evolutionary algorithms come in very handy to answer these questions. They make it possible to design new scoring schemes without having to worry, at least in the first stage, about optimization issues. In the next section, we introduce one of these evolutionary techniques known as genetic algorithms (GA). GAs are described along with another closely related stochastic optimization algorithm: simulated annealing. Three examples are reviewed in details, in which GAs were successfully applied to sequence alignment problems.



(YROXWLRQDU\ $OJRULWKPV DQG 6LPXODWHG $QQHDOLQJ

An evolutionary algorithm is a way of finding a solution to a problem by means of forcing sub-optimal solutions to evolve through some perturbations (mutations and recombination). Most evolutionary algorithms are stochastic in the sense that the solution space is explored in a random rather than ordered manner. In this context, randomness provides a non-null probability to sample any potential solution, regardless of the solution space size, providing that the mutations allow such an exploration. The drawback of randomness is that all potential solutions may not be visited during the search (including the global optimum). In order to correct for this problem, a large number of heuristics have been designed that attempt to bias the way in which the solution space is sampled. They aim at improving the chances of

sampling an optimal solution. For that reason, most stochastic strategies (including evolutionary computation) can be regarded as a tradeoff between greediness and randomness. Two stochastic strategies have been widely used for sequence analysis: simulated annealing and genetic algorithms. Strictly speaking, SA does not belong to the field of evolutionary computation, yet, in practice, it has been one of the major source of inspiration for the elaboration of genetic algorithms used in sequence analysis.

2.1

Simulated Annealing

Simulated annealing (SA) [38] was the first stochastic algorithm used to attempt solving the multiple sequence alignment problem [33, 37]. SA relies on an analogy with physics. The idea is to compare the solving of an optimization problem to some crystallization process (cooling of a metal). In practice, given a set of sequences, a first alignment is randomly generated. A perturbation is then applied (shifting of an existing gap or introduction of a new one) and the resulting alignment is evaluated with the objective function. If that new alignment is better than the previous one, it replaces it, otherwise it replaces it with a probability that depends on the difference of score and on the current temperature. The higher the temperature the more likely an important score difference will be accepted. Every cycle the temperature decreases slightly until it reaches 0. From the perspective of an evolutionary algorithm, SA can be regarded as a population with one individual only. Perturbations are similar to the mutations used in evolutionary algorithms. Apart from the population size of one, the main difference between SA and any true evolutionary algorithm is the extrinsic annealing schedule.

While the principle is very sound and the adequacy to multiple alignment optimization and objective function evaluation is obvious, SA suffers from a very serious drawback: it is

extremely slow. Most of the studies conducted on simulated annealing and multiple alignments concluded that although it does reasonably well, SA is too slow to be used for abinitio multiple alignments and must be restricted to being used as an alignment improver (i.e. improvement of a seed alignment). This serious limitation makes it much harder to use it as the black box one needs to evaluate the design new objective functions.

2.2

Genetic Algorithms

It is in an attempt to overcome the limits of SA that evolutionary algorithm were adapted to the multiple sequence alignment problem. Evolutionary algorithms are parallel stochastic search tools. Unlike SA, which maintains a single line of descent from parent to offspring, evolutionary algorithms maintain a population of trials for a given objective function. Evolutionary algorithms are among the most interesting stochastic optimization tools available today. One of the reason why these algorithms have received so little attention in the context of multiple sequence alignment is probably due to the fact that the implementation of an evolutionary algorithm dedicated to multiple alignment is much less straightforward than with simulated annealing. In other areas of computational biology, evolutionary algorithms have already been established as powerful tools. These include RNA [26, 55, 48] and protein structure analysis [53, 60, 41]. Among all the existing evolutionary algorithms (genetic algorithms, genetic programming, evolution strategies, and evolutionary programming) genetic algorithms have been by far the most popular in the field of computational biology.

Although one could argue on who exactly invented GAs, the algorithms we use today were formally introduced by Holland in 1975 [32] and later refined by Goldberg to give the Simple Genetic Algorithm[22]. GAs are based on a loose analogy with the phenomenon of natural selection. Their principle is relatively straightforward. Given a problem, potential solutions

(individuals within a population) compete with one another (selection) for survival. These solutions can also evolve: they can be modified (mutations), or combined with one another (crossovers). The idea is that acting together, variation and selection will lead to an overall improvement of the population via evolution. Most of the concepts developed here about GAs are taken from [22, 16].

Two ingredients are essential to the GA strategy: the selection method and the operators. Selection is established in order to lead the search toward improvement. It means that the best individuals (as judged using the objective function) must be the most likely to survive. To serve the GA purpose, this selection strategy cannot be too strict. It must allow some variety to be maintained all along the search in order to prevent the GA population from converging toward the first local minimum it encounters. Evolution toward the optimal solution also requires the use of operators that modify existing solutions and create diversity (mutations) or optimize the use of the existing diversity (crossovers) by combining existing motifs into an optimal solution.

Given such a crude layout, the potential for variation is infinite and the study of new GA models is a very active field in its own right. This being said, the main difficulty to overcome when adapting a GA to a problem like multiple sequence alignment is not the choice of a proper model, but rather the conception of a well suited series of operators. This is a well known problem that has also received some attention in the field of structure prediction both for proteins [50] and RNA [54]. A simple justification is that the operators (and the problem representation) largely control the manner in which a solution landscape is being analyzed. For instance, the neighborhood of a solution is mostly defined by the exploration capabilities of the operators. Well chosen they can smoothen very rugged landscapes. Yet, on the other

hand, if they are too ’smart’ and too greedy, they may prevent a proper exploration to be carried out. Finding the right trade off can prove rather a complex task. When applying GAs to sequence alignments, previous work on SA proved instrumental. It provided researcher with a set of operators well tested and perfectly suitable for integration within most evolutionary algorithms.

Attempts to apply evolutionary algorithms to the multiple sequence alignment problem started in 1993 when Ishikawa et al. published a hybrid GA [34] that does not try to directly optimize the alignment but rather the order in which the sequences should be aligned using dynamic programming. Of course, this limits the algorithm to objective functions that can be used with dynamic programming. Even so, the results obtained that way were convincing enough to prompt the development of the use of GAs in sequence analysis. The first GA able to deal with sequences in a more general manner was described a few years later by Notredame and Higgins[44], shortly before a similar work by Zhang [64]. In these two GAs, the population is made of complete multiple sequence alignments and the operators have direct access to the aligned sequences: they insert and shift gaps in a random or semi-random manner. In 1997, SAGA was applied to RNA analysis [47] and parallelized for that purpose using an island model. This work was later duplicated by Anabrasu et al. [5] who carried out an extensive evaluation of this model, using ClustalW as a reference. Over the following years, at least three new multiple sequence alignment strategies based on evolutionary algorithms have been introduced [23], [13] and [11]. Each of these relies on a principle similar to SAGA: a population of multiple alignments evolves by selection, combination and mutation. The population is made of alignments and the mutations are string-processing programs that shuffle the gaps using complex models. The main difference between SAGA and these recent algorithms has been the design of better mutation operators that improve the efficiency and

the accuracy of the algorithms. These new results have strengthened the idea that the essence of the adaptation of GAs to multiple sequence alignments is the design of proper operators, reflecting as well as possible the true mechanisms of molecular evolution. In order to expose each of the many ingredients that constitute a GA specialized in sequence alignments, the example of SAGA will now be reviewed in details.



6$*$ D *$ 'HGLFDWHG WR 6HTXHQFH $OLJQPHQW

3.1

The Algorithm.

SAGA is a genetic algorithm dedicated to multiple sequence alignment [44]. It follows the general principles of the simple genetic algorithms (sGA) described by Goldberg [22] and later refined by Davis [17]. In SAGA, each individual is a multiple alignment. The data structure chosen for the internal representation of an individual is a straightforward twodimensional array where each line represents an aligned sequence and each cell is either a residue or a gap. The population has a constant size and does not contain any duplicate (i.e. identical individuals). The pseudo-code of the algorithm is given on figure 1. Each of these steps is developed over the next sections.



,QLWLDOL]DWLRQ

The challenge of the initialization (also known as seeding) is to generate a population as diverse as possible in terms of ’genotype’ and as uniform as possible in terms of scores. In SAGA, generation 0 consists of a 100 multiple alignments randomly generated that only contain terminal gaps. These initial alignments are less than twice the length of the longest sequence of the set (longer alignments can be generated later). To create one of these individuals, a random offset is chosen for each sequence (between 0 and the length of the longest sequence); each sequence is shifted to the right, according to the offset and empty

spaces are padded with null signs in order to give the same length L to all the sequences. Seeding can also be carried out by generating sub-optimal alignments using an implementation of dynamic programming that incorporates some randomness. This is the case in RAGA [47], an implementation of SAGA specialized in RNA alignment.



(YDOXDWLRQ

Fitness is measured by scoring each alignment according to the chosen objective function. The better the alignment, the better its score and the higher its fitness (scores are inverted if the OF is meant to be minimized). To minimize sampling errors, raw scores are turned into a normalized value known as the expected offspring (EO). The EO indicates how many children an alignment is likely to have. In SAGA, EOs are stochastically derived using a predefined recipe: ’the remainder stochastic sampling without replacement’ [22]. This gives values that are typically between 0 and 2. Only the weakest half of the population is replaced with the new offspring while the other half is carried over unchanged to the next generation. This practice is known as overlapping generations [16].



%UHHGLQJ

It is during the breeding that new individuals (children) are generated. The EO is used as a probability for each individual to be chosen as a parent. This selection is carried out by weighted wheel selection without replacement [22] and an individual’s EO is decreased by one unit each time it is chosen to be a parent. An operator is also chosen and applied onto the parent(s) to create the newborn child. Twenty-two operators are available in SAGA. They all have their own usage probability and can be divided in two categories: mutations that only require one parent and crossovers that require two parents. Since no duplicate is allowed in the population, a newborn child is only accepted if it differs from all the other members of the

generation already created. When a duplicate arises, the whole series of operations that lead to its creation is canceled. Breeding is over when the new generation is complete, and SAGA proceeds toward producing the next generation unless the finishing criterion is met.
 7HUPLQDWLRQ

Conditions that could guarantee optimality are not met in SAGA and there is no valid proof that it may reach a global optimum, even in an infinite amount of time (as opposed to SA). For that reason an empirical criterion is used for termination: the algorithm terminates when the search has been unable to improve for more than 100 generations. That type of stabilization is one of the most commonly used condition to stop a GA when working on a population with no duplicate (i.e. a population where all the individuals are different from one another) [17]. 3.2 Designing the Operators

As mentioned earlier, the design of an adequate set of operators has been the main point of focus in the work that lead to SAGA. According to the traditional nomenclature of genetic algorithms [22], two types of operators coexist in SAGA: crossover and mutation. An operator is designed as an independent program that inputs one or two alignments (the parents) and outputs one alignment (the child). Each operator requires one or more parameters that specify how the operation is to be carried out. For instance, an operator that inserts a new gap requires three parameters: the position of the insertion, the index of sequence to modify and the length of the insertion.

These parameters may be chosen completely at random (in some pre-defined range). In that case, the operator is said to be used in a stochastic manner [44]. Alternatively, all but one of the parameters may be chosen randomly, leaving the value of the remaining parameter to be fixed by exhaustive examination of all possible values. The value that yields the best fitness is

kept. An operator applied this way is said to be used in a semi-hill climbing mode. Most operators may be used either way (stochastic or semi-hill climbing). For the robustness of the GA, it is also important to make sure that the operators are completely independent from any characteristic of the objective function, unless one is interested in creating a very specific operator for the sake of efficiency. .
 7KH &URVVRYHUV

Crossovers are meant to generate a new alignment by combining two existing ones. Two types of crossover coexist in SAGA: the one point crossover that combines two parents through a single point of exchange (Figure 2a) and the uniform crossover that promotes multiple exchanges between two parents by swapping blocks between consistent bits (Figure 2b). The uniform crossover is much less disruptive than its one-point counterpart, but it can only be applied if the two parents share some consistency, a condition rarely met in the early stages of the search. Of the two children produced by a crossover, only the fittest is kept and inserted into the new population (if it is not a duplicate). Crossovers are essential for promoting the exchange of high quality blocks within the population. They make it possible to efficiently use existing diversity. However, the blocks present in the original population only represent a tiny proportion of all the possibilities. They may not be sufficient to reconstruct an optimal alignment, and since crossovers cannot create new blocks, another class of operators is needed: mutation.



0XWDWLRQV ([DPSOH RI WKH *DS LQVHUWLRQ 2SHUDWRU

SAGA’s mutation operators are extensively described in [44]. We will only review here the gap insertion operator, a crude attempt to reconstitute backward some of the events of insertion/deletions through which a set of sequences might have evolved. When that operator

is applied, alignments are modified following the mechanism shown on Figure 3. The aligned sequences are split into two groups. Within each group, every sequence receives a gap insertion at the same position. Groups are chosen by randomly splitting an estimated phylogenetic tree (as given by ClustalW [59]). The stochastic and the semi-hill climbing versions of this operator are implemented. In the stochastic version, the length of the inserted gaps and the two insertion positions are randomly chosen while in the semi-hill climbing mode the second insertion position is chosen by exhaustively trying all the possible positions and comparing the scores of the resulting alignments.



'\QDPLF 6FKHGXOLQJ RI WKH 2SHUDWRUV

When creating a child, the choice of the operator is just as important as the choice of the parents. Therefore, it makes sense to allow operators to compete for usage, just as the parents do for survival, in order to make sure that useful operators are more likely to be used. Since one cannot tell in advance the good operators from the bad ones, they initially all receive the same usage probability. Later during the run, these probabilities are dynamically reassessed to reflect each operator individual performances. The recipe used in SAGA is the dynamic scheduling method described by Davis [16]. It easily allows the adding and removal of new operators without any need for retuning. Under this model, an operator has a probability of being used that is a function of its recent efficiency (i.e. improvement generated over the 10 last generations). The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement over its parents, the operator that was directly responsible for its creation gets the largest part of the credit (e.g. 50%); then the operator(s) responsible for the creation of the parents also get their share on the remaining credit (50% of the remaining credit); and so on. This credit report goes on for

some specified number of generation (e.g. 4). Every 10 generations, results are summarized for each operator and the usage probabilities are reassessed based on the accumulated credit. To avoid early loss of some operators, each of them has a minimum usage probability higher than 0. It is common practice to set these minimal usage probabilities so that they sum to 0.5. To that effect one can use for each operator a minimum probability of 1/(2*number of operators)). A very interesting property of that scheme is that it helps using operators only when they are needed. For instance, uniform crossovers are generally more efficient than their one point counterpart. Unfortunately, they cannot be properly used in the optimization early stages because at that point there is not enough consistency within the population. The dynamic scheduling adapts very well to that situation by initially giving a high usage probability to the one point crossover, and by shifting that credit to the uniform crossover when the population has become tidy enough to support its usage. It is interesting to notice that these two operators are competing with one another although the GA does not explicitly know they both belong to the crossover category.

3.3

Parallelisation of SAGA

Long running times were SAGA’s main limitation. This became especially acute when aligning very long sequences such as ribosomal RNAs (>1000 nucleotides). It is common practice to use parallelisation in order to alleviate such problems. The technique applied on SAGA is specific of GAs and known as an island parallelisation model [21]. Instead of having a single GA running, several identically configured GAs run in parallel on separate processors. Every 5 generations they exchange some of their individuals. The GAs are arranged on the leaves and the nodes of a N-branched tree and the population exchange is unidirectional from the leaves to the root of the tree (Figure 4). By default, the individuals

migrating from one GA to another are those having the best score. The GA node where they come from keeps a copy of them but they replace low scoring individuals in the accepting GA [44]. Initially implemented in RAGA, the RNA version of SAGA, this model was also extended to SAGA, using a 3-branched trees with a depth of 3 that requires 13 GAs. These processes are synchronous and wait for each other to reach the same generation number before exchanging populations.

This distributed models benefits from the explicit parallelisation and is about 10 times faster than a non-parallel version (i.e. about 80% of the maximum speedup expected when distributing the computation over 13 processors). It also benefits from the new constraints imposed by the tree topology on the structure of the population. It seems to be the lack of feedback that makes it possible to retain within the population a much higher degree of diversity than a single unified population could afford. These are the terminal leaves that behave as a diversity reservoir and yield a much higher accuracy to the parallel GA than a non-parallel version with the same overall population. Nonetheless, these preliminary observations remain to be firmly established, using some extra thorough benchmarking.



$SSOLFDWLRQV &KRRVLQJ DQ 2EMHFWLYH )XQFWLRQ

The main motivation behind SAGA’s design was the creation of a robust platform or a black box on which any OF one could think of could be tested in a seamless manner. Such a black box allows discriminating between the functions that are biologically relevant and those that are not. For instance, let us consider the weighted sums of pairs. This function is one of the most widely used. It owes its popularity to the fact that algorithmic methods exist that allow its approximate optimization [43, 40]. Yet we know this function is not very meaningful from a biological point of view [4]. The three main limitations that have caught biologists’ attention

are the crude modeling of the insertions/deletions (gap), the assumed independence of each position and the fact that the evaluation cannot be made position dependant. Thanks to SAGA, it was possible to design new objective functions that make use of more complex gap penalties, take into account non-local dependencies or use position specific scoring schemes and to ask if this increased sophistication results in an improvement of the alignments biological quality. The following section reviews three classes of objective functions that were successfully optimized using SAGA [44, 47, 46].

4.1

The Weighted Sums of Pairs

MSA [40] is an algorithm that makes it possible to deliver an optimal (or a very close suboptimal) multiple sequence alignment using the sums of pairs measure. This sophisticated heuristic performs multi-dimensional dynamic programming in a bounded hyper-space. It is possible to assess the level of optimization reached by SAGA by comparing it to MSA while using exactly the same objective function.

The sums-of-pairs principle is to associate a cost to each pair of aligned residues in each column of an alignment (substitution cost), and another similar cost to the gaps (gap cost). The sum of these costs yields the global cost of the alignment. Major variations involve: i) using different sets of costs for the substitutions (PAM matrices [18] or BLOSUM tables [28]); ii) different schemes for the scoring of gaps [1]; iii) different sets of weights associated with each pair of sequence [2]. Formally, one can define the cost of a multiple alignment (A) as:
  ¡

¡

 

=1

= +1

¤

£

$/,*10(17 &267 $

ΣΣ
¡

¢

−1

¢

: , &267 $  $ 

Where N is the number of sequences, Ai the aligned sequence i, COST is the alignment score between two aligned sequences (Ai and Aj) and Wi,j is the weight associated with that pair of sequences. The COST includes the sum of the substitution costs as given by a substitution matrix and the cost of the insertions/deletions using a model with affine gap penalties (a gap opening penalty and a gap extension penalty). Two schemes exist for scoring gaps: natural affine gap penalties and quasi-natural affine gap penalties [1]. Quasi-natural gap penalties are the only scheme that the MSA program can efficiently optimize. This is unfortunate since these penalties are known to be biologically less accurate than their natural counterparts [1] because of a tendency to over-estimate the number of gaps. Under both scheme, terminal gaps are penalized for extension but not for opening.

It is common practice to validate a new method by comparing the alignments it produces with references assembled by experts. In the case of multiple alignments, one often uses structure based sequence alignments that are regarded as the best standard of truth available [24]. For SAGA, validation was carried out using 3Dali [49]. Biological validation should not be confused with the mathematical validation also required for an optimization method. In the case of SAGA, both validations were simultaneously carried out, and a summary of the results obtained when optimizing the sums of pairs is shown of Table 1.

Firstly, SAGA was used to optimize the sums of pairs with quasi-natural gap penalties, using MSA derived alignments as a reference. In two thirds of the cases, SAGA reached the same level of optimization as MSA. In the remaining test sets, SAGA outperformed MSA, and in every case that improvement correlated with an improvement of the alignment biological quality, as judged by comparison with a reference alignment. Although they fall short of a demonstration, these figures suggest that SAGA is an adequate optimization tool that

competes well with the most sophisticated heuristics. In a second aspect of that validation, SAGA was used to align test cases too large to be handled by MSA, and using as an objective function the weighted sums of pairs with natural gap penalties. ClustalW was the nonstochastic heuristic used as a reference. As expected, the use of natural penalties lead to some improvement over the optimization reached by ClustalW, and that mathematical improvement was also correlated with a biological improvement. Altogether, these results are indicative of the versatility of SAGA as an optimizer and of its ability to optimize functions that are beyond the scope of standard dynamic programming based algorithmic methods.

4.2

Consistency Based Objective Functions: The COFFEE Score

Ultimately, a multiple alignment aims at combining within a single unifying model every piece of information known about the sequences it contains. However, it may be the case that a part of this information is not as reliable as one may expect. It may also be the case that some elements of information are not compatible with one another. The model will reveal these inconsistencies and require decisions to be made in a way that takes into account the overall quality of the alignment.

A new objective function can be defined that measures the fit between a multiple alignments and the list of weighted elements of information. Of course, the relevance of that objective function will depend greatly on the quality of the pre-defined list. This list can take whatever forms one wishes. For instance, a convenient source is the generation of a list of pair wise alignments [46, 45] that given a set of N sequences will contain all the N2 possible pair wise alignments. In the context of COFFEE (Consistency Based Objective Function For alignmEnt Evaluation), that list of alignments is named a library, and the COFFEE function measures the level of consistency between a multiple alignments and its corresponding library. Evaluation

is made by comparing each pair of aligned residues observed in the multiple alignments to the list of residue pairs that constitute the library. During the comparison, residues are only identified by their index within the sequences. The consistency score is equal to the number of pairs of residues that are simultaneously found in the multiple alignment and in the library, divided by the total number of pairs observed in the multiple sequence alignment. The maximum is 1 but the real optimum depends on the level of consistency found within the library. To increase the biological relevance of this function, each pair of residues is associated with a weight indicative of the quality of the pair-wise alignment it comes from (a measure of the percentage of identity between the two sequences). The COFFEE function can be formalized as follows. Given N aligned sequences S1...SN in a multiple alignment. Ai,j is the pair wise projection (obtained from the multiple alignment) of the sequences Si and Sj. LEN (Ai,j) is the number of ungapped columns in this alignment. SCORE (Ai,j) is the overall consistency between Ai,j and the corresponding pair-wise alignment in the library and Wi,j is the weight associated with this pair-wise alignment.

COFFEE score=

[Σ Σ W
i=1 j= i+1

Ν -1

Ν

i, j

* SCORE (A i, j) /

] [Σ Σ W
i=1 j =i +1

Ν -1

Ν

i, j

*LEN(A i, j)

]

If we compare this function to the weighted sums of pairs developed earlier, we will find that the main difference is the library that replaces the substitution matrix and provides a position dependant mean of evaluation. It is also interesting to note that under this formulation an alignment having an optimal COFFEE score will be equivalent to a Maximum Weight Trace alignment using a ’pair-wise alignment graph’ [35].

Table 2 shows some of the results obtained using SAGA/COFFEE on 3Dali. For that experiment, the library of pair wise alignments had been generated using ClustalW alignments, and the resulting alignments proved to be of a higher biological quality than those obtained with alternative methods available at the time. Eventually, these results were convincing enough to prompt the development of a fast non-GA based method for the optimization of the COFFEE function. That new algorithm, named T-Coffee, was recently made available to the public [45].

4.3

Taking Non-Local Interactions Into Account: RAGA.

So far, we have reviewed the use of SAGA for sequence analysis problems that consider every position as independent from the others. While that approximation is acceptable when the sequence signal is strong enough to drive the alignment, this is not always the case when dealing with sequences that have a lower information content than proteins but carry explicit structural information, such as RNA or DNA. To illustrate one more usage of GAs it will now be interesting to examine a case where SAGA was used to optimize an RNA structure superimposition in which the OF takes into account local and non-local interactions altogether. RNA was chosen because its fold, largely based on Watson and Crick base pairings [63], generates characteristic structures (stems-loops) that are easy to predict and analyze [65]. Since the pairing potential of two RNA bases can be predicted with reasonable accuracy, the evaluation of an alignment can easily take into account structure (Se) and sequence (Pr) similarities altogether. The version of SAGA in which that new function is implemented is named RAGA (RNA Alignment by Genetic Algorithm) [47]. In RAGA, the OF evaluates the alignment of two RNA sequences, one with a known secondary structure (master) and one that is homologous to the master but whose exact secondary structure is unknown (slave). It can be formalized as follow:

Alignment score = Pr + (λ* Se) - Gap Penalty

(2)

λ is a constant (1-3) and

*DS SHQDOW\

is the sum of the affine gap penalties within the

alignment. Pr is simply the number of identities. Se uses the secondary structure of the master
sequence and evaluates the stability of the folding it induces onto the slave sequence. If two bases form a base pair (part of a stem) in the master, then the two ’slave’ bases they are aligned to should be able to form a Watson and Crick base pair as well. Se is the sum of the score of these induced pairs. The energetic model used in RAGA is very simplified and assigns 3 to GC pairs and 2 to UA and UG.

Assessing the accuracy and the efficiency of RAGA is a problem very similar to the one encountered when analyzing SAGA. In this case, the reference alignments were chosen from mitochondrial ribosomal small subunit RNA sequence alignments established by experts [61]. The human sequence was used as a master and realigned by RAGA to seven other homologous mitochondrial sequences used as slaves. Evaluation was made by comparing the optimized pairwise alignments to those contained in the reference alignment. The results on Table 3 indicate very clearly, that a proper optimization took place and that the secondary structure information was efficiently used to enhance the alignment quality. This is especially sensible for very divergent sequences that do not contain enough information at the primary level for an accurate alignment to be determined on these bases alone. It is also interesting to point out that RAGA could also take into account some elements of the tertiary structure known as pseudoknots, that were successfully added to the objective function. These elements, that are beyond the scope of most dynamic programming based methods, lead to even more accurate alignment optimization [47].



&RQFOXVLRQ *$ YHUVXV +HXULVWLF 0HWKRGV

Section 4 of this chapter illustrates three situations in which GAs proved able to solve very complex optimization problems with a reasonable level of accuracy. On its own, this clearly indicates the importance and the interest of these methods in the field of sequence analysis. Yet, when applied to that type of problems, GAs suffer from two major drawbacks: they are very slow and unreliable. By unreliable, we mean that given a set of sequences, a GA may not deliver twice the same answer, owing to the stochastic nature of the optimization process and to the difficulty of the optimization. This may be a great cause of concern to the average biologist who expects to use his multiple alignment as a prediction tool and possibly as a decision aide for the design of expensive wet lab experiments. How severe is this problem? If we consider the protein test cases analyzed here, SAGA reaches its best score in half of the runs on average. For RAGA, maybe because the solution space is more complex, this proportion goes down to 20%. If one is only interested in validating a new objective function, this is not a major source of concern since even in the worse cases the sub-optimal solutions are within a few percent of the best found solution. However, this instability is not unique to GAs and is not as severe as the second major drawback: the efficiency. Although much more practical than SA, GAs slowness means that they cannot really be expected to become part of any of the very large projects that require millions of alignments to be routinely made over a few days [15]. More robust, if less accurate, techniques are required for that purpose. Is the situation hopeless then? The answer is definitely no since two important fields of application exist for which GAs are uniquely suited. The first one is the analysis of rare and very complex problems for which no other alternative is available, such as the folding of very long RNAs. The second aspect is more general. GAs provide us with a unique way of probing very complex problems with little concern, at least in the first stages, for the algorithmic issues involved. It is quite remarkable that even with a very simple GA one can quickly ask

very important questions and decide weather a thread of investigation is worth being pursued or should simply be abandoned.

The COFFEE project is a good example of such a cycle of analysis. It followed this three steps process: (i) an objective function was first designed without any concern for the complexity of its optimization and the algorithmic issues. (ii) SAGA was used to evaluate the biological relevance of that function. (iii) This validation was convincing enough to prompt the conception of a new dynamic programming algorithm, much faster and appropriate for this function[45]. This non-GA based algorithm was named T-Coffee (Tree based COFFEE). The mere evocation of these two projects respective developing time makes a good case for the use of SAGA: the COFFEE project took four months to be carried out, while completion of the T-Coffee project required more than a year and a half for algorithm development and software engineering.



$YDLODELOLW\

SAGA, RAGA, COFFEE and T-Coffee are all available free of charge from the author either via Email (cedric.notredame@igs.cnrs-mrs.fr) or via the WWW (http://igs-server.cnrsmrs.fr/~cnotred).



$FNQRZOHGJHPHQWV

The author wishes to thank Dr Hiroyuki Ogata and Dr Gary Fogel for very helpful comments and for an in-depth review of the manuscript.



5HIHUHQFHV

[1]

S. F. Altschul, *DS FRVWV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, J. Theor. Biol., 138 (1989), pp. 297-309.

[2] [3] [4] [5]

[6] [7]

[8] [9] [10] [11]

[12] [13] [14] [15]

[16] [17] [18] [19] [20] [21] [22] [23] [24]

S. F. Altschul, R. J. Carroll and D. J. Lipman, :HLJKWV IRU GDWD UHODWHG E\ D WUHH, Journal of Molecular Biology, 207 (1989), pp. 647-653. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, %DVLF ORFDO DOLJQPHQW VHDUFK WRRO, Journal of Molecular Biology, 215 (1990), pp. 403-410. S. F. Altschul and D. J. Lipman, 7UHHV VWDUV DQG PXOWLSOH ELRORJLFDO VHTXHQFH DOLJQPHQW, SIAM J. Appl. Math., 49 (1989), pp. 197-209. L. A. Anabarasu, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ SDUDOOHO JHQHWLF DOJRULWKPV, , 7KH 6HFRQG $VLD3DFLILF &RQIHUHQFH RQ 6LPXODWHG (YROXWLRQ 6($/, Canberra,australia, 1998. A. Bairoch, P. Bucher and K. Hofmann, 7KH 3526,7( GDWDEDVH LWV VWDWXV LQ , Nucleic Acids Research, 25 (1997), pp. 217-221. G. J. Barton and M. J. E. Sternberg, $ VWUDWHJ\ IRU WKH UDSLG PXOWLSOH DOLJQPHQW RI SURWHLQ VHTXHQFHV FRQILGHQFH OHYHOV IURP WHUWLDU\ VWUXFWXUH FRPSDULVRQV, Journal of Molecular Biology, 198 (1987), pp. 327-337. S. A. Benner, M. A. Cohen and G. H. Gonnet, 5HVSRQVH WR %DUWRQ
V OHWWHU &RPSXWHU VSHHG DQG VHTXHQFH FRPSDULVRQ, Science, 257 (1992), pp. 1609-1610. P. Bucher, K. Karplus, N. Moeri and K. Hofmann, $ IOH[LEOH PRWLI VHDUFK WHFKQLTXH EDVHG RQ JHQHUDOL]HG SURILOHV, Comput Chem, 20 (1996), pp. 3-23. K. Bucka-Lassen, O. Caprani and J. Hein, &RPELQLQJ PDQ\ PXOWLSOH DOLJQPHQWV LQ RQH LPSURYHG DOLJQPHQW, Bioinformatics, 15 (1999), pp. 122-30. L. Cai, D. Juedes and E. Liakhovitch, (YROXWLRQDU\ FRPSXWDWLRQ WHFKQLTXHV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, , &RQJUHVV RQ HYROXWLRQDU\ FRPSXWDWLRQ , 2000, pp. 829-835. H. Carrillo and D. J. Lipman, 7KH PXOWLSOH VHTXHQFH DOLJQPHQW SUREOHP LQ ELRORJ\, SIAM J. Appl. Math., 48 (1988), pp. 1073-1082. K. Chellapilla and G. B. Fogel, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ HYROXWLRQDU\ SURJUDPPLQJ, , &RQJUHVV RQ (YROXWLRQDU\ &RPSXWDWLRQ , 1999, pp. 445-452. F. Corpet, 0XOWLSOH VHTXHQFH DOLJQPHQW ZLWK KLHUDUFKLFDO FOXVWHULQJ, Nucleic Acids Res., 16 (1988), pp. 10881-10890. F. Corpet, F. Servant, J. Gouzy and D. Kahn, 3UR'RP DQG 3UR'RP&* WRROV IRU SURWHLQ GRPDLQ DQDO\VLV DQG ZKROH JHQRPH FRPSDULVRQV, nucleic acids res, 28 (2000), pp. 267-9. L. Davis, 7KH KDQGERRN RI *HQHWLF $OJRULWKPV, Van Nostrand Reinhold, New York, 1991. P. J. Davis and R. Hersh, 7KH PDWKHPDWLFDO H[SHULHQFH, Birkauser, Boston, 1980. M. O. Dayhoff, $WODV RI 3URWHLQ 6HTXHQFH DQG 6WUXFWXUH, National Biomedical Research Foundation, Washington, D. C., U. S. A., 1978. J. Felsenstein, 3+</,3 SK\ORJHQ\ LQIHUHQFH SDFNDJH, Cladistics, 5 (1988), pp. 355356. D.-F. Feng and R. F. Doolittle, 3URJUHVVLYH VHTXHQFH DOLJQPHQW DV D SUHUHTXLVLWH WR FRUUHFW SK\ORJHQHWLF WUHHV, Journal of Molecular Evolution, 25 (1987), pp. 351-360. A. L. Goldberg and R. E. Wittes, *HQHWLF FRGH DVSHFWV RI RUJDQL]DWLRQ, Science, 153 (1966), pp. 420-424. D. E. Goldberg, *HQHWLF $OJRULWKPV LQ 6HDUFK 2SWLPL]DWLRQ DQG 0DFKLQH /HDUQLQJ, Addison-Wesley, New York, 1989. R. R. Gonzalez, 0XOWLSOH SURWHLQ VHTXHQFH FRPSDULVRQ E\ JHQHWLF DOJRULWKPV, , 63,( , 1999. O. Gotoh, 6LJQLILFDQW ,PSURYHPHQW LQ $FFXUDF\ RI 0XOWLSOH 3URWHLQ 6HTXHQFH
$OLJQPHQWV E\ ,WHUDWLYH 5HILQHPHQWV DV $VVHVVHG E\ 5HIHUHQFH WR 6WUXFWXUDO $OLJQPHQWV

, J. Mol. Biol., 264 (1996), pp. 823-838.

[25]

[26]

[27]

[28] [29]

[30] [31] [32] [33]

[34]

[35] [36]

[37] [38] [39]

[40] [41] [42]

[43]

M. Gribskov, M. McLachlan and D. Eisenberg, 3URILOH DQDO\VLV 'HWHFWLRQ RI GLVWDQWO\ , Proceedings of the National Academy of Sciences, 84 (1987), pp. 4355-5358. A. P. Gultayaev, F. D. H. van Batenburg and C. W. A. Pleij, 7KH FRPSXWHU 6LPXODWLRQ RI 51$ )ROGLQJ 3DWKZD\V 8VLQJ D *HQHWLF $OJRULWKP, J. Mol. Biol., 250 (1995), pp. 37-51. D. Haussler, A. Krogh, I. S. Mian and K. Sjölander, 3URWHLQ 0RGHOLQJ XVLQJ +LGGHQ 0DUNRY 0RGHOV $QDO\VLV RI *ORELQV, in L. Hunter, ed., 3URFHHGLQJV IRU WKH WK +DZDLL ,QWHUQDWLRQDO &RQIHUHQFH RQ 6\VWHPV 6FLHQFHV, Los Alamitos, CA: IEEE Computer Society Press, Wailea, HI, U.S.A., 1993, pp. 792-802. S. Henikoff and J. G. Henikoff, $PLQR DFLG VXEVWLWXWLRQ PDWULFHV IURP SURWHLQ EORFNV, Proc. Natl. Acad. Sci., 89 (1992), pp. 10915-10919. J. Heringa, 7ZR VWUDWHJLHV IRU VHTXHQFH FRPSDULVRQ SURILOHSUSURFHVVHG DQG VHFRQGDU\ VWUXFWXUHLQGXFHG PXOWLSOH DOLJQPHQW, Computers and Chemistry, 23 (1999), pp. 341-364. D. G. Higgins and P. M. Sharp, &/867$/ D SDFNDJH IRU SHUIRUPLQJ PXOWLSOH VHTXHQFH DOLJQPHQW RQ D PLFURFRPSXWHU, Gene, 73 (1988), pp. 237-244. P. Hogeweg and B. Hesper, 7KH DOLJQPHQW RI VHWV RI VHTXHQFHV DQG WKH FRQVWUXFWLRQ RI SK\ORJHQHWLF WUHHV $Q LQWHJUDWHG PHWKRG, J. Mol. Evol., 20 (1984), pp. 175-186. J. H. Holland, $GDSWDWLRQ LQ QDWXUDO DQG DUWLILFLDO V\VWHPV, University of Michigan Press, Ann Arbour, MI, 1975. M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara and M. Kanehisa, 0XOWLSOH VHTXHQFH DOLJQPHQW E\ SDUDOOHO VLPXODWHG DQQHDOLQJ, Comp. Applic. Biosci., 9 (1993), pp. 267-273. M. Ishikawa, T. Toya and Y. Tokoti, 3DUDOOHO ,WHUDWLYH $OLJQHU ZLWK *HQHWLF $OJRULWKP, , $UWLILILFLDO ,QWHOOLJHQFH DQG *HQRPH :RUNVKRS WK ,QWHUQDWLRQDO &RQIHUHQFH RQ $UWLILFLDO ,QWHOOLJHQFH, Chambery, France, 1993, pp. 13-22. J. D. Kececioglu, 7KH PD[LPXP ZHLJKW WUDFH SUREOHP LQ PXOWLSOH VHTXHQFH DOLJQPHQW, Lecture Notes in Computer Science, 684 (1983), pp. 106-119. J. Kim, J. R. Cole and S. Pramanik, $OLJQPHQW RI SRVVLEOH VHFRQGDU\ VWUXFWXUHV LQ PXOWLSOH 51$ VHTXHQFHV XVLQJ VLPXODWHG DQQHDOLQJ, Comp. Applic. Biosci., 12 (1996), pp. 259-267. J. Kim, S. Pramanik and M. J. Chung, 0XOWLSOH 6HTXHQFH $OLJQPHQW XVLQJ 6LPXODWHG $QQHDOLQJ, Comp. Applic. Biosci., 10 (1994), pp. 419-426. S. Kirkpatrick, C. D. J. Gelatt and M. P. Vecchi, 2SWLPL]DWLRQ E\ 6LPXODWHG $QQHDOLQJ, Science, 220 (1983), pp. 671-680. A. Krogh, M. Brown, I. S. Mian, K. Sjölander and D. Haussler, +LGGHQ 0DUNRY 0RGHOV LQ &RPSXWDWLRQDO %LRORJ\ $SSOLFDWLRQV WR 3URWHLQ 0RGHOLQJ, J. Mol. Biol., 235 (1994), pp. 1501-1531. D. J. Lipman, S. F. Altschul and J. D. Kececioglu, $ WRRO IRU PXOWLSOH VHTXHQFH DOLJQPHQW, Proc. Natl. Acad. Sci. USA, 86 (1989), pp. 4412-4415. A. C. May and M. S. Johnson, ,PSURYHG JHQHWLF DOJRULWKPEDVHG SURWHLQ VWUXFWXUH FRPSDULVRQV SDLUZLVH DQG PXOWLSOH VXSHUSRVLWLRQV, Protein Eng, 8 (1995), pp. 873-82. B. Morgenstern, A. Dress and T. Wener, 0XOWLSOH '1$ DQG 3URWHLQ VHTXHQFH EDVHG RQ VHJPHQWWRVHJPHQW FRPSDULVRQ, Proc. Natl. Acad. Sci. USA, 93 (1996), pp. 12098-12103. S. B. Needleman and C. D. Wunsch, $ JHQHUDO PHWKRG DSSOLFDEOH WR WKH VHDUFK IRU VLPLODULWLHV LQ WKH DPLQR DFLG VHTXHQFH RI WZR SURWHLQV, J. Mol. Biol., 48 (1970), pp. 443-53.
UHODWHG SURWHLQV

[44] [45] [46] [47] [48] [49] [50]

[51] [52]

[53]

[54] [55] [56] [57]

[58] [59]

C. Notredame and D. G. Higgins, 6$*$ VHTXHQFH DOLJQPHQW E\ JHQHWLF DOJRULWKP, Nucleic Acids Res., 24 (1996), pp. 1515-1524. C. Notredame, D. G. Higgins and J. Heringa, 7&RIIHH $ QRYHO DOJRULWKP IRU PXOWLSOH VHTXHQFH DOLJQPHQW, JMB, 302 (2000), pp. 205-217. C. Notredame, L. Holm and D. G. Higgins, &2))(( DQ REMHFWLYH IXQFWLRQ IRU PXOWLSOH VHTXHQFH DOLJQPHQWV, Bioinformatics, 14 (1998), pp. 407-22. C. Notredame, E. A. O’Brien and D. G. Higgins, 5$*$ 51$ 6HTXHQFH $OLJQPHQW E\ *HQHWLF $OJRULWKP, Nucleic Acids Res., 25 (1997), pp. 4570-4580. H. Ogata, A. Yutaka and K. Minoru, $ JHQHWLF DOJRULWKP EDVHG PROHFXODU PRGHOLQJ WHFKQLTXH IRU 51$ VWHPORRS VWUXFWXUHV, Nucleic Acids res., 23 (1995), pp. 419-426. S. Pascarella and P. Argos, $ GDWD EDQN PHUJLQJ UHODWHG SURWHLQ VWUXFWXUHV DQG VHTXHQFHV, Protein Eng., 5 (1992), pp. 121-137. A. A. Rabow and H. A. Scheraga, ,PSURYHG JHQHWLF DOJRULWKP IRU WKH SURWHLQ IROGLQJ SUREOHP E\ XVH RI D FDUWHVLDQ FRPELQDWLRQ RSHUDWRU, Protein Science, 5 (1996), pp. 1800-1815. B. Rost and C. Sander, 3UHGLFWLRQ RI SURWHLQ VHFRQGDU\ VWUXFWXUH DW EHWWHU WKDQ  DFFXUDF\, Journal of Molecular Biology, 232 (1993), pp. 584-599. C. Sander and R. Schneider, 'DWDEDVH RI KRPRORJ\GHULYHG VWUXFWXUHV DQG WKH VWUXFWXUDOO\ PHDQLQJ RI VHTXHQFH DOLJQPHQW, Proteins: Structure, Function, and Genetics, 9 (1991), pp. 56-68. S. Schulze-Kremer, *HQHWLF DOJRULWKPV IRU SURWHLQ WHUWLDU\ VWUXFWXUH SUHGLFWLRQ, in R. Männer and B. Manderick, eds., 3URFHHGLQJV RI WKH QG &RQIHUHQFH RQ 3DUDOOHO 3UREOHP 6ROYLQJ IURP 1DWXUH, Elsevier Science Publishers, Amsterdam, 1992, pp. 391-400. B. A. Shapiro and J. C. Wu, $Q DQQHDOLQJ PXWDWLRQ RSHUDWRU LQ WKH JHQHWLF DOJRULWKP IRU 51$ IROGLQJ, Comp. Applic. Biosci., 12 (1996), pp. 171-180. B. A. Shapiro and J. C. Wu, 3UHGLFWLQJ 51$ +7\SH SVHXGRNQRWV ZLWK WKH PDVVLYHO\ SDUDOOHO JHQHWLF DOJRULWKP, Comp. Applic. in Biosci., 13 (1997), pp. 459-471. T. F. Smith and M. S. Waterman, &RPSDULVRQ RI ELRVHTXHQFHV, Adv. Appl. Math., 2 (1981), pp. 482-489. J. Stoye, V. Moulton and A. W. Dress, '&$ DQ HIILFLHQW LPSOHPHQWDWLRQ RI WKH GLYLGH DQGFRQTXHU DSSURDFK WR VLPXOWDQHRXV PXOWLSOH VHTXHQFH DOLJQPHQW, Comput Appl Biosci, 13 (1997), pp. 625-6. W. R. Taylor, $ IOH[LEOH PHWKRG WR DOLJQ ODUJH QXPEHUV RI ELRORJLFDO VHTXHQFHV, Journal of Molecular Evolution, 28 (1988), pp. 161-169. J. Thompson, D. Higgins and T. Gibson, &/867$/ : LPSURYLQJ WKH VHQVLWLYLW\ RI
SURJUHVVLYH PXOWLSOH VHTXHQFH DOLJQPHQW WKURXJK VHTXHQFH ZHLJKWLQJ SRVLWLRQ VSHFLILF JDS SHQDOWLHV DQG ZHLJKW PDWUL[ FKRLFH, Nucleic Acids Res., 22 (1994), pp. 4673-4690. R. Unger and J. Moult, *HQHWLF $OJRULWKPV IRU 3URWHLQ )ROGLQJ 6LPXODWLRQV, J. Mol. Biol., 231 (1993), pp. 75-81. Y. Van de Peer, J. Jansen, P. De Rijk and R. De Watcher, 'DWDEDVH RQ WKH VWUXFWXUH RI VPDOO ULERVRPDO 51$, Nucleic Acids res., 25 (1997), pp. 111-116. L. Wang and T. Jiang, 2Q WKH FRPSOH[LW\ RI PXOWLSOH VHTXHQFH DOLJQPHQW, Journal of computational biology, 1 (1994), pp. 337-348. J. D. Watson and F. H. C. Crick, 0ROHFXODU VWUXFWXUH RI QXFOHLF DFLGV $ VWUXFWXUH IRU GHR[\ULERVH QXFOHLF DFLG, Nature, 171 (1953), pp. 737-738. C. Zhang and A. K. Wong, $ JHQHWLF DOJRULWKP IRU PXOWLSOH PROHFXODU VHTXHQFH DOLJQPHQW, Comput Appl Biosci, 13 (1997), pp. 565-81.

[60] [61] [62] [63] [64]

[65]

WKHUPRG\QDPLFV DQG DX[LOLDU\ LQIRUPDWLRQ

M. Zuker and P. Stiegler, 2SWLPDO FRPSXWHU IROGLQJ RI ODUJH 51$ VHTXHQFHV XVLQJ , Nucleic Acids Res., 9 (1981), pp. 133-148.

)LJXUH OHJHQGV

)LJXUH  /D\RXW RI WKH 6$*$ DOJRULWKP

This pseudo-code indicates the main steps that take place during the optimization carried out by SAGA. See the text for full details.

)LJXUH  &URVVRYHUV XVHG LQ 6$*$ D 2QH SRLQW FURVVRYHU between two parent alignments to produce two children. The arrows indicate the way the two parents are cut having randomly chosen a position in the left hand alignment. Child 1 is produced by combining the left side of parent 1 and the right side of

parent 2. Child 2 is produced by combining the right side of parent 1 and the left side of parent 2. Only one of these two children alignments is kept (whichever scores better). The boxed sections show some patterns from the parent alignments that are combined in the child.

E8QLIRUP FURVVRYHU. All of the positions in the two parents that are consistent between the two alignments are marked (stars). Children are produced by swapping blocks between the two parents where each block is randomly chosen between two consistent positions.

)LJXUH 

*DS LQVHUWLRQ 2SHUDWRU D The estimated phylogenetic tree connecting the five sequences is randomly divided into two sub trees. This gives two groups of sequences (G1 and G2). E Two positions P1 and P2 are randomly chosen in the alignment. A gap of random length (here 2 nulls) is inserted at position P1 in the sequences of subgroup G1, and the same number of nulls are inserted at position P2 in subgroup G2.

Figure 4. Layout of the Parallel version of RAGA. Each circle represents a RAGA process. The best individuals migrate from top to bottom. The best solution is to be found in the root (bottom).

7DEOHV

7DEOH  0DWKHPDWLFDO YDOLGDWLRQ RI 6$*$ DJDLQVW 06$ XVLQJ 'DOL

MSA SAGA-MSA Test case Nseq Length Score Q CPU Score Q CPU __________________________________________________________________________________ Cyt c 6 129 1051257 74.2 7 1051257 74.2 960  82.0 75 Gcr 8 60 371875 75.0 3 Ac protease 5 183 379997 80.1 13 379997 80.1 331 S protease 6 280 574884 91.0 184 574884 91.0 3500  * 3542 Chtp 6 247 111924 * 4525  82.5 411 Dfr secstr 4 189 171979 82.0 5 Sbt 4 296 271747 80.1 7 271747 80.1 210 Globin 7 167 659036 94.4 7 659036 94.4 330  54.0 510 Plasto 5 132 236343 54.0 22

Nseq is the number of sequences; Length is the length of the final SAGA alignment; Score is the score of the alignment returned by MSA using the weighted sums-of-pairs with quasi-natural affine gap penalties (the function is minimized and the best scores are the lowest). The columns marked ‘Q' give the percentage of an MSA alignment that matches the structural alignment. CPU time is given in seconds. SAGA-MSA indicates similar results with alignments obtained by SAGA. In the Score column, alignments for which SAGA outperforms MSA are indicated in bold. The PDB structure identifiers for each test case can be found in 3Dali. The PDB structure identifiers for each test case are as follows. &\W F: 451c, 1ccr, 1cyc, 5cyt, 3c2c, 155c. *FU: 2gcr, 2gcr-2, 2gcr-3, 2gcr-4, 1gcr, 1gcr-2, 1gcr-3, 1gcr-4. $F SURWHDVH: 1cms, 4ape, 3app, 2apr, 4pep. 6 SURWHDVH: 1ton, 2pka, 2ptn, 4cha, 3est, 3rp2. 'IU VHFVWU: 1dhf, 3dfr, 4dfr, 8dfr. &KWS: 3rp2, M13143 (EMBL accession number), 1gmh, 2tga, 1est, 1sgt. 6EW: 1cse, 1sbt, 1tec, 2prk. *ORELQ: 4hhb-2, 2mhb-2, 4hhb, 2mhb, 1mbd, 2lhb, 2lh1. 3ODVWR: 7pcy, 2paz, 1pcy, 1azu, 2aza.

7DEOH %LRORJLFDO YDOLGDWLRQ RI WKH &2))(( IXQFWLRQ XVLQJ 'DOL

Test case Nseq Length SAGA-MSA SAGA-COFFEE CLUSTAL ___________________________________________________________________________  50.2 39.2 ac_prot 21 14 binding 31 7 64.2  50.0  89.1 cytc 42 6 67.3  42.0 fnIII 17 9 45.2  80.8 gcr 36 8 80.8  globin 24 17 78.0 85.2  74.8 igb 24 37 70.1   72.2 lzm 39 6  58.5 Phenyldiox 22 8 55.6  96.7 Sbt 61 7 96.0  66.6 62.5 s-prot 27 15 Nseq is the number of sequences; Length is the length of the final SAGA alignment; SAGA-MSA is the percentage of the alignment that matches the structural alignment when SAGA is run to optimize the weighted sums-of-pairs with natural affine gap penalties. SAGA-COFFEE is similar but using the COFFEE function. CLUSTALW is similar with a comparison made on the default output of ClustalW. The method giving the best result is in bold.

7DEOH  %LRORJLFDO YDOLGDWLRQ RI DQ 51$VSHFLILF REMHFWLYH IXQFWLRQ

Master

Slave

Dist.

Pairs (%)

Len. DP

Q(%) PRAGA
86.6 76.1 92.5 92.5 76.6 56.0 63.8 53.2 60.2

Homo sapiens Homo sapiens Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit.

Oxytrichia nova Giarda ardeae Latim. chalum. Mit. Xenopus laevis Mit. Dros. virilis Mit. Apis mellifera Mit. Penicil. chryso. Mit. Chla. reinha. Mit. Sacc. cerevis. Mit.

0.41 0.57 0.31 0.43 0.76 1.23 1.26 1.30 1.33

82.5 82.1 81.2 84.9 82.6 72.1 81.3 66.6 80.3

1914 1895 998 985 973 977 1478 1271 1699

83.9 72.2 85.9 83.9 66.8 45.2 37.7 34.1 31.6

Master: Sequence with a known structure. Slave: sequence with an unknown structure. Dist.: estimated mean number of substitutions per site between the master and the slave measured on the reference alignment. Pairs: percentage of residues involved in the master secondary structure. Len.: length of the reference alignment. Q: measure m1 (overall level of identity with the reference alignment) made on a Dynamic Programming with local gap penalties alignment (DP) or on a RAGA alignment. The sequences EMBL accession numbers are as follow: +RPR VDSLHQV (X03205), +RPR VDSLHQV mitochondria (V00702), 2[\WULFKLD QRYD (X03948), *LDUGD DUGHDH (Z177210), /DWLPHULD FKDOXPQDH mitochondria (Z21921), ;HQRSXV ODHYLV mitochondria (M27605), 'URVRSKLOD YLULOLV mitochondria (X05914), $SLV PHOOLIHUD mitochondria (S51650), 3HQLFLOOLXP FKU\VRJHQXP mitochondria (L01493), &KODP\GRPRQDV UHLQKDUGWLL mitochondria (M25119), 6DFFKDURP\FHV FHUHYLVLDH mitochondria (V00702).

doi:10.1016/j.jmb.2004.04.058

J. Mol. Biol. (2004) 340, 385–395

3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments
Orla O’Sullivan1, Karsten Suhre2, Chantal Abergel2 ´ Desmond G. Higgins1 and Cedric Notredame2,3*
1 Conway Institute, University College Dublin, Belﬁeld Dublin 4, Ireland

´ Information Genomique et Structurale, CNRS UPR-2589 31, Chemin Joseph Aiguier 13402 Marseille, France Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges Switzerland
3

2

Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment.
q 2004 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: multiple alignment; structural superposition; TCoffee; threading; sap

Introduction
It has long been assumed that using structural information can increase the accuracy of multiple protein sequence alignments (MSA).1 Recent results2,3 suggest that accurate MSAs obtained this way are useful for making functional assignments. These ﬁndings are quite exciting in a context where a structure may soon be available for each protein family (transmembrane proteins excepted).4 However, making the best out of this wealth of data will require the development of new automatic methods, able to efﬁciently incorporate protein structure information within
Abbreviations used: MSA, multiple protein sequence alignment(s); S-MSA, structure-based MSA; DP, dynamic programming; NW, Needlman & Wunsch; CS, column score. E-mail address of the corresponding author: cedric.notredame@gmail.com

MSAs. The incentive for doing so is very strong, considering the critical role MSAs play in so many sequence analysis applications,5 like phylogenetic reconstruction, structure prediction, functional characterization, database searches and nonsynonymous single nucleotide polymorphism characterization.6 Despite their usefulness, accurate MSAs remain difﬁcult to compute, owing to reasons that are both computational7 and biological.8 From a computational point of view, the assembly of an optimal MSA is a complex problem and an exact solution can be computed only for small sets of related sequences.9 This is the reason why most packages use an approximate heuristic, the progressive alignment algorithm,10 that gives no guarantee on delivering an optimal solution but can rapidly align large sets of sequences. On the biological side, one is limited by the lack of an objective and accurate criterion to assess MSA quality.8 As a consequence, most methods use

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

386

3DCoffee: Mixing Sequences and Structures

sequence similarity (assessed with a substitution matrix) as a criterion for optimization. However, similarity is not informative enough to drive the correct alignment of distantly related sequences, a situation that typically requires using structure comparison methods so that a structure-based MSA (S-MSA) can be derived. S-MSAs constitute the de facto standard of truth for assessing sequence alignment accuracy and several established S-MSAs collections11 – 13 are used routinely to evaluate MSA packages.14 – 17 Although one may argue that these highly accurate MSAs (as judged from structural analysis) are not always optimal from an evolutionary point of view, they usually reﬂect well the structural and functional relationships between the considered proteins. With 3DCoffee, we show that using a small amount of structural information when assembling an MSA makes it possible to improve alignment accuracy and emulate the computation of an S-MSA. Combining sequences and structures in this manner requires the integration of three types of methods: (i) sequence alignment methods; (ii) methods for comparing two or more structures and deduce a sequence alignment; (iii) methods for comparing sequences and structures, often referred to as threading. Sequence – sequence comparison methods rely mostly on the dynamic programming (DP) algorithm to compute an alignment where gaps are disposed in such a manner that similarity is maximized between the two sequences.18,19 Given a substitution matrix and a gap penalty scheme, DP can be used to compute global or local alignments20,21 but accurate alignments can be obtained only with pairs of sequences that are at least 30% identical.22 Structure–structure comparison has been approached using a wide variety of heuristics,23,24 and to this day more than 30 algorithms have been reported. The simplest, like LSQman,25 use rigid body superposition and let the algorithm look for an optimal superposition where intermolecular distances are minimized between superposed positions in the two structures. These methods perform well on similar structures where the 3D relationships of residues have been well preserved by evolution. These structures are usually encoded by closely related sequences. When dealing with more distantly related sequences, the residue equivalences can be worked out iteratively, as done in STAMP,26 where the equivalences are used to drive a superposition that is used, in turn, to compute a distance matrix. The algorithm uses this updated matrix to reﬁne the set of residue equivalences and make a new superposition. The process is carried out until it converges. SAP27 uses a similar principal, although rather than being iterative, the algorithm computes the series of rigid superpositions associated with forcing the superposition of every possible pair of residues. The ﬁnal alignment is computed by DP, using the summed distance matrices of all the superpositions considered. DALI produces align-

ments of comparable accuracy, computed by considering the local comparison of the distance maps associated with the considered structures.28 Most of these methods make it possible to use structures for aligning sequences that are less than 30% identical. Although they diverge slightly in the alignment they produce, it is hard to establish which one (if any) performs better than the others. Sequence – structure comparisons (or threading) can be achieved using two categories of methods.29,30 One may use techniques inspired from molecular replacement to check whether a sequence is compatible with a 3D fold,31 or sophisticated DP where the algorithm analyses the 3D-structure to determine local gap penalties and local substitution costs. Fugue is based on this principle and turns a structure into a positionspeciﬁc substitution matrix, so that a sequence – structure alignment can be delivered using DP.32 Many of the structure-based alignment methods have been extended to generate S-MSAs. For instance, the double DP strategy of SAP has been coupled with a progressive algorithm to align more than two structures.33 At least two other pairwise structural alignment methods have been incorporated in a progressive alignment strategy: STAMP and COMPARER. COMPARER34 was used to assemble HOMSTRAD, the collection of multiple structural alignments used in this work for validation purposes. Other multiple structural alignment methods exist that use more speciﬁc procedures. For instance, DALI produces S-MSAs by aligning several structures to a master structure. One may use Fugue in a similar fashion by aligning several sequences to a single structural template. MNYFIT computes a consensus structure and uses it as a master to align all the others.35 The lack of method-independent reference datasets makes it difﬁcult to benchmark these packages accurately and establish their respective strength and weaknesses. Yet they all share a common drawback: they are all built around a speciﬁc pairwise alignment algorithm, making it difﬁcult to combine the respective strengths of several algorithms into a single model. Furthermore, none of the available methods can seamlessly handle a mixture of sequences and structures, and when doing so, the most common strategy is to start aligning the structures into an S-MSA, before adding the sequences in a semi-manual fashion.2 We designed 3DCoffee to address this problem. 3DCoffee uses the TCoffee v2.00 MSA package. TCoffee computes MSAs using pre-compiled libraries of pairwise alignments. The libraries can be compiled using any method able to generate pairwise alignments, like threading and structure superposition. This makes the library a powerful means to incorporate structural information into the MSA assembly process. Using methods like SAP or Fugue, we studied the effect of compiling the library with a mixture of sequences and structures. Our methodology could easily be extended to incorporate methods that have not yet been

3DCoffee: Mixing Sequences and Structures

387 superposition.27 When integrating these alignments within the primary library, we set to 100 the weight associated with each pair of aligned residues. This is the maximum weight an individual constraint can receive in a TCoffee primary library. Although this value is meant to reﬂect the high reliability of SAP, it only makes it more likely for these pairs to be aligned in the ﬁnal MSA without explicitly forcing them to be so. Not forcing every pair of the structural alignments to ﬁnd their way into the ﬁnal alignment is important, as some portions of the SAP alignments correspond to non-superposable portions of the structures and are therefore unreliable. These segments usually have a low consistency within the primary library, and are therefore down-weighted at the extension stage. LSQman is a rigid body structure superposition package that makes structure-based sequence alignments.40 When turning an LSQman structure superposition into a sequence alignment, we considered two residues to be aligned if they were ˚ less than 3 A apart in the superposition. LSQman constraint weights are set to 100, like those of SAP and for similar reasons. Producing multiple sequence structure alignments We adapted TCoffee so that, given a collection of sequences and structures, one may specify which structures must be used and which methods should be applied on each possible pair. For instance, given a peptide ﬁle, 3DCoffee considers in turn every possible sequence pair within the dataset. For a given pair, the program computes a global alignment using NW and a series of local alignments using Lalign. If both sequences have an available structure, a pairwise alignment is computed using SAP and another one using LSQman. If one sequence only has a known structure, an alignment is made using the threading method Fugue. All these alignments are added to the TCoffee library using the standard procedure described above. Benchmarking procedure We used the February 2002 release of HOMSTRAD11 (e) to design a benchmark strategy for 3DCoffee. HOMSTRAD is a hand-curated database of high-quality S-MSAs built around the multiple structure alignment package COMPARER. We selected within HOMSTRAD the most demanding alignments using two criteria: at least four sequences and less than 25% average identity within the MSA. This yields a collection of 43 MSAs, four of which had to be discarded (FADOxidase_C, FAD-Oxidase_NC, TPR and bv) because they are impossible to align with any of the available methods and are therefore uninformative for the analysis. The 39 remaining MSAs (245 sequences) constitute our HOM39

considered so that biologists can integrate and combine their techniques of choice.

Principle of the 3DCoffee method
Computation of TCoffee multiple sequence alignments We used TCoffee version 2.00 to compute nonstructure-based MSAs (default mode), as well as S-MSAs. In its default mode, TCoffee does not use structures, it takes sequences as input and makes pairwise comparisons to compile a primary library. This primary library is a list of weighted pairs of residues.36 A residue pair appears in the library when it has been observed in one of the precompiled pairwise alignments. The pairwise alignments compiled in the primary library can be computed using any method one ﬁnds suitable. By default, TCoffee computes for each pair of sequence a global pairwise alignment obtained with the Needlman & Wunsch (NW)18 algorithm and the ten best-scoring local alignments as given by the SIM algorithm.37 The weight associated with every residue pair obtained this way is set to the average percentage identity within the primary alignment (local or global). When two alignments contribute the same pair of aligned residues, the weights are added. The weights within the primary library are then re-estimated according to the library selfconsistency,36 and the re-weighted library (named an extended library) is used as a position-speciﬁc substitution matrix to carry out a progressive multiple alignment.38 Doing so involves computing a distance matrix by comparing every pair of sequences and using this matrix to compute a neighbor-joining guide tree.39 The tree topology determines the order in which the sequences are incorporated within the MSA, using standard DP and the extended library as a position-speciﬁc substitution matrix. Incorporation of structural information within the TCoffee library Structural information is incorporated within the library by the means of structure-based pairwise sequence alignments. We used three methods, now fully integrated within TCoffee, providing the associated package is installed. Fugue is a threading method that aligns a protein sequence with a 3D-structure.32 3DCoffee directly submits sequence/structure pairs to the ofﬁcial Fugue server† and retrieves the resulting pairwise alignments that are integrated into the primary library using the standard TCoffee weighting scheme. SAP uses double DP to compute a pairwise alignment based on a non-rigid structure
† http://www.cryst.bioc.cam.ac.uk/~fugue/

388

3DCoffee: Mixing Sequences and Structures

dataset. It has the advantage of being both compact and discriminative. We assessed the biological quality of our MSAs by comparing them with their HOM39 reference MSA, using the aln_compare package36 that computes the column score (CS), which is a measure of the fraction of columns aligned identically between two MSAs.41 We checked whether sequences without a known structure could beneﬁt from being aligned with sequences whose structure is known. We named this measure the induced improvement, and measured it by removing the provided structure(s) from the reference and the target MSAs before comparing them. System and packages Academic licences (free of charge) to run TCoffee 2.00, SAP and LSQman were obtained for each package. These were installed on an SGI 02, running Irix 6.2. The protocols used here are now part of the TCoffee documentation.

Results
Improving MSA accuracy with a single structure Single structures can be incorporated into an MSA only by using a threading method like Fugue. Before doing so, we evaluated the accuracy of Fugue as a pairwise method on the entire HOM39 dataset. Figure 1(a) shows a comparison between Fugue and TCoffee (TCoffee uses SIM and NW by default) where the relative performances of the two methods are assessed by comparison with the HOM39 reference. Fugue clearly outperforms TCoffee when making pairwise alignments. For instance, when comparing Fugue and TCoffee on all pairs of sequences from HOM39 (Figure 1(a)), we found Fugue to be three percentage points more accurate than TCoffee (61.8% accuracy for Fugue against 58.8% for TCoffee). The difference is signiﬁcant with a P-value of 1029 (Wilcoxon signed rank test). We then computed each HOM39 MSA while providing TCoffee with one structure via the -struc_to_use ﬂag. In each test case, we chose the most distantly related sequence (as judged with the average percentage identity in the HOM39 reference). The extent of identity between the selected structures and the rest of their MSA ranged between 12% and 24%. A new ﬂavor of TCoffee (TC-Fugue) was designed, that uses three pairwise alignment methods: SIM, NW, and Fugue (Table 1A). We also used TCoffee associated with the Fugue method only (Fugue) as a control. This last procedure amounts to aligning the sequences one after the other onto the sequence with known structure, using the Fugue algorithm. Two other controls were set up using TCoffee in default mode and CLUSTAL W version 1.83 (CW183).

Figure 1. Performances of pairwise structure-based sequence alignment methods. Each dot corresponds to a parwise alignment taken from HOM39 (see method). The vertical axis represents the difference of alignment accuracy (Column Score) between TCoffee and (a) Fugue, (b) SAP and (c) LSQman. The horizontal axis shows the percent identity between the two sequences being considered, as measured on the reference HOM39 MSA.

3DCoffee: Mixing Sequences and Structures

389

Table 1. Direct (A) and induced (B) improvement when providing one structure of the HOM39 datasets
Method N str. 0 0 1 1 0 0 1 1 Avg. acc. 42.24 38.43 31.26 46.33 52.83 45.75 35.53 54.73 Difference with TCoffee – 23.8 210.9 1 4.1 – 27.1 217.3 1 1.9 P-value (Wilcoxon signed-rank test) – 2 £ 1022 2 £ 1024 1 3 1023 – 1 £ 1023 3 £ 1024 4 3 1021

A. Direct improvement TCoffee CW-183 Fugue TC-Fugue B. Induced improvement TCoffee CW-183 Fugue TC-Fugue

Method indicates the method being used: TCoffee (TCoffee with NW and SIM), CW-183 (CLUSTAL W, 1.83), TC-Fugue (TCoffee with NW, SIM and Fugue), Fugue (TCoffee þ Fugue, without NW or SIM). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical signiﬁcance of the observed difference between the considered method and the default TCoffee. The best performing method is in bold.

Our results (Table 1A) show that providing a structure to TC-Fugue improves MSAs by four percentage points over TCoffee (or by a litle less than eight percentage points over CLUSTAL W). The difference is signiﬁcant with a P-value of 1023, and an observed improvement on 23 of the 31 alignments that are not tied between the two methods. We found (Figure 2(a)) that the amount of reported improvement depends loosely on the structure/sequence ratio, with high ratios yielding greater improvements. The low performances of the Fugue control are probably explained by the stringency of the CS measure that requires every sequence to be aligned correctly and is not well adapted to the pairwise-based alignment method used here. We measured the induced improvement in the TC-Fugue alignments by removing the provided structure and found the average TC-Fugue accuracy to remain higher than that of TCoffee (Table 1B and Figure 2(b)), although in this case the difference is not statistically signiﬁcant, as the observed difference is associated with a P-value of only 0.4. Note that the values in Table 1B are higher than the corresponding values in Table 1A because in Table 1B the evaluation is carried out while ignoring the provided structure (usually the hardest sequence to align). Improving MSA accuracy with two structures Using two structures offers the possibility of making structure– structure (SAP, LSQman) as well as structure –sequence comparisons. Before using these methods to compute an MSA, we evaluated their pairwise accuracy (Figure 1(b) and (c)). As expected, we found SAP and LSQman to outperform TCoffee signiﬁcantly. A measure made on the SAP alignments of every HOM39 pair of sequence (Figure 1(b)) indicates an average accuracy of 86.3%. The difference with TCoffee is highly signiﬁcant with a P-value of 10211 (Wilcoxon signed rank test). Under the same conditions, LSQman outperforms TCoffee by 12 points with

an average accuracy of 70.3%, and a difference also highly signiﬁcant. We computed every HOM39 MSA while providing TCoffee with two structures: the one used previously with TC-Fugue and its most distantly related homologue (lowest percentage identity) within the considered HOM39 MSA. An attempt to use the most informative pairs guided this choice. In order to judge the individual contribution of each of the three structure-based methods (Fugue, SAP and LSQman) to the overall accuracy of 3DCoffee, we used them separately, each time in conjunction with SIM and NW (Table 2A). These three new ﬂavors of TCoffee are named TC-Fugue, TC-SAP and TC-LSQ, and the combination of all the available pairwise methods (Fugue, SAP, LSQman, SIM and NW) constitutes the new 3DCoffee method (TC-3D in the Tables). As expected, TC-Fugue, TC-SAP and TC-LSQ all outperform TCoffee (Table 2A). Furthermore, TC-3D outperforms every alternative ﬂavor and, given two structures, it produces MSAs on average ten percentage points better than TCoffee and 4.5 percentage points better than TC-Fugue (Table 2A). As indicated in Table 2A, all the differences reported between the new methods and TCoffee are statistically signiﬁcant. Here as well, the extent of the improvement depends on the structure/ sequence ratio (Figure 3(a)). Similar trends were observed when measuring the induced improvement (Figure 3(b)), which amounts to slightly less than 3.5 percentage points when comparing TC-3D with TCoffee (Table 2B). Although limited in amplitude, this improvement is also statistically signiﬁcant. Improving MSAs accuracy with many structures We examined the effect of varying the structure/ sequence ratio for every HOM39 MSA. We did so by applying TC-3D on each HOM39 dataset, using structural sets that contained between one and N structures (N being the total number of sequences).

390

3DCoffee: Mixing Sequences and Structures

Figure 2. Comparative performances of TC-Fugue and TCoffee when using one structure. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39. The vertical axis indicates the difference of accuracy between a TC-Fugue and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (1 structure) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to Figure 2(a), the MSA Accuracy is measured while ignoring the contribution of the provided structure.

The structural sets were assembled in an incremental manner. Given an MSA, one starts with the most distantly related structure (as shown above) before adding the structure of the less similar remaining sequences one by one, until N structural sets are deﬁned for each HOM39 MSA. We then realigned every HOM39 MSA with each of its

associated structural sets and compared the resulting alignments with the HOM39 reference. This makes a total of 200 MSA (between four and 15 for each HOM39 protein family) that were used to compute the data presented in Figure 4(a) and 161 for Figure 4(b). The results are presented in the form of a

3DCoffee: Mixing Sequences and Structures

391

Table 2. Direct (A) and Induced (B) improvement when providing two structure of the HOM39 datasets
Method N Str. 0 0 2 2 2 2 0 0 2 2 2 2 Avg. acc. 42.24 38.43 46.39 50.81 47.26 52.52 56.12 50.22 58.07 58.49 57.52 59.55 Difference with TCoffee 0.0 23.8 þ4.0 þ8.5 þ5.0 1 10.3 0.0 25.9 þ1.9 þ2.4 þ1.4 1 3.4 P-value Wilcoxon signed-rank test 1.0 2 £ 1022 5 £ 1023 6 £ 1026 2 £ 1023 1 3 1025 1.0 1 £ 1021 2 £ 1021 2 £ 1021 4 £ 1021 2 3 1022

A. Direct improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D B. Induced improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D

Direct improvement is measured on the complete alignment, including the used structures. The induced improvement is measured only on the sequences whose structures were not used. Method indicates the method being used: TCoffee (TCoffee with SIM and NW), TCW-183 (CLUSTAL W version 1.83) TC-Fugue (TCoffee þ NW þ SIM þ Fugue), TC-SAP (TCoffee þ SIM þ NW þ SAP), TC-LSQ (TCoffee þ SIM þ NW þ LSQman), TC-3D (TCoffee þ SIM þ NW þ Fugue þ SAP þ LSQman). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical signiﬁcance of the difference between the considered method and TCoffee default using the Wilcoxon signed-rank test. The best performing method is in bold.

boxplot in Figure 4(a) (direct improvement) and Figure 4(b) (induced improvement). Figure 4(a) indicates the existence of a reasonable correlation between the structure/sequence ratio and the MSA accuracy, although the data are not distributed evenly. One gains roughly ten percentage points in accuracy with every 20 percentage points increase of the structure/sequence ratio. An individual analysis of each protein family suggests that this trend is consistent across most of the HOM39 dataset, although the phenomenon varies in amplitude. When using 3DCoffee and all the available structures in a procedure that amounts to assembling a multiple structural alignment, we obtained a score of 71.9% accuracy, a value short of the theoretical maximum of 100 that might have been expected if the unreliable regions of HOM39 had been removed from the evaluation. This value is an estimate of the correlation between the two-structure superposition method SAP and COMPARER rather than an estimate of accuracy. The induced improvement follows a similar trend, albeit more modestly (Figure 4(b)), and yields a gain of roughly two percentage points for every 20 percentage points of ratio increase. The distribution of this induced improvement is even less regular than that of the direct improvement. It indicates that in the HOM39 dataset, sequences beneﬁt only modestly from the incorporation of the 3D information associated with one of their remote homologue.

Conclusion
3DCoffee is a novel method that takes advantage of structural information for aligning sequences. We benchmarked 3DCoffee using HOM39, a collection of high-quality reference S-MSAs. We used

the TCoffee package to mix sequences, structures and structure/sequence alignment methods, and found this new protocol to improve MSA accuracy in a manner that depends on the structure/ sequence ratio within the considered dataset. Our results suggest that using structures can improve the alignment accuracy of sequences without a known structure. The 3DCoffee protocol bears several advantages. It is relatively fast: given all the pairwise alignments, it takes a few seconds to align ten sequences 200 residues long on a standard workstation. It is also very ﬂexible and could easily be adapted to include any structure analysis method able to deliver a sequence alignment. We show here that one can effectively use this protocol to combine the output of methods based on different principles, like a rigid structure superposition method (LSQman) and a non-rigid one (SAP). This makes 3DCoffee a versatile tool that could easily be used in MSAs computation the way meta-methods are used in structure prediction.42 Yet, this study lends itself to a more paradoxical conclusion. Although structural information clearly helps improve MSA accuracy, it is surprising to ﬁnd that its usage lacks the dramatic effect one may have expected. For instance, using one structure on a dataset of distantly related sequences increases the average accuracy by only an average four percentage points (and a maximum of ten). One may have hoped that the ﬁrst or the ﬁrst two structures would have delivered a larger share of the potential improvement. Yet this does not happen and every extra structure has about the same effect as the others on the overall accuracy, thus yielding a quasi-linear correlation between the structure/sequence ratio and the overall MSA accuracy. This ﬁnding suggests that the standard methods

392

3DCoffee: Mixing Sequences and Structures

Figure 3. Comparative performances of TC-3D and TCoffee when using two structures. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39 (see method). The vertical axis indicates the difference in accuracy between a TC-3D and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (2 structures) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to (a) with the MSA accuracy computed on the sequences without known structure.

we used here are not yet able to let the structural information diffuse optimally onto distantly related sequences. As a consequence, the best way to obtain a highly accurate MSA of remote homologues is to use more than one structure and, if possible, one structure for each sequence (or group of closely related sequences). On the

basis of these results one may argue that given current methods, the “one structure for every protein family” strategy43 may prove short of solving all the alignments problems faced by homology modeling. Achieving this purpose will require either better sequence comparison methods or more structures.

3DCoffee: Mixing Sequences and Structures

393

Figure 4. Alignment accuracy and structure/sequence ratio. (a) Each box indicates the average accuracy difference between TC-3D and TCoffee when computing HOM39 MSAs with various structure/sequence ratios: [0 – 20] (6 values), [21 – 40] (27 values), [41 –60] (44 values), [61 – 80] (44 values), [81 – 100] (20 values). The vertical axis shows the average difference of accuracy and the horizontal axis the average structure/ sequence ratio. The boxplot was generated with the R package using standard settings. Each box stretches from its lower hinge (deﬁned as the 25th percentile) to its upper hinge (the 75th percentile). The median is shown as a line across the box. The top and the bottom whisker indicate the smallest data value larger then lower inner fence. The lower inner fence (not drawn) is equal to 1.5p spread to the 25th percentile. Values below the lower inner fence are plotted as a dot. The upper whisker is plotted in a similar fashion while using the 50th percentile as reference. (b) Induced improvement. Identical to 3b, with the measure of accuracy made on the sequences without known structure only.

Acknowledgements
Orla O’Sullivan was paid from Enterprise Ireland and Hewlett Packard provided some support. We thank Willie Taylor for helping us with

setting up SAP, and Kenji Miziguchi for helping with Fugue. The comments of both referees were very helpful in improving the manuscript. We thank Jean-Michel Claverie for his many suggestions in improving and clarifying this manuscript.

394

3DCoffee: Mixing Sequences and Structures

References
1. Lesk, A. M. & Chothia, C. (1980). How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136, 225– 270. 2. Al-Lazikani, B., Sheinerman, F. B. & Honig, B. (2001). Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases. Proc. Natl Acad. Sci. USA, 98, 14796– 14801. 3. Marchler-Bauer, A., Panchenko, A. R., Ariel, N. & Bryant, S. H. (2002). Comparison of sequence and structure alignments for protein domains. Proteins: Struct. Funct. Genet. 48, 439– 446. 4. Brenner, S. E. (2001). A tour of structural genomics. Nature Rev. Genet. 2, 801–809. 5. Duret, L. & Abdeddaim, S. (2000). Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics, Sequence, Structure and Databanks (Higgins, D. & Taylor, W., eds), pp. 135– 147 Oxford University Press, Oxford. 6. Ng, P. C. & Henikoff, S. (2002). Accounting for human polymorphisms predicted to affect protein function. Genome Res. 12, 436– 446. 7. Wang, L. & Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337– 348. 8. Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J. C. & Poch, O. (2001). Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 314, 937– 951. 9. Lipman, D. J., Altschul, S. F. & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412– 4415. 10. Hogeweg, P. & Hesper, B. (1984). The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175– 186. 11. Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469– 2471. 12. Raghava, G. P., Searle, S. M., Audley, P. C., Barber, J. D. & Barton, G. J. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. 13. Thompson, J. D., Plewniak, F. & Poch, O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87 – 88. 14. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl. Acids Res. 30, 3059– 3066. 15. Lassmann, T. & Sonnhammer, E. L. (2002). Quality assessment of multiple alignment programs. FEBS Letters, 529, 126– 130. 16. Lee, C., Grasso, C. & Sharlow, M. F. (2002). Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452– 464. 17. Thompson, J. D., Plewniak, F. & Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682 –2690. 18. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443– 453.

19. Pearson, W. R. & Miller, W. (1992). Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210, 575– 601. 20. Huang, X., Hardison, R. C. & Miller, W. (1990). A space-efﬁcient algorithm for local similarities. Comput. Appl. Biosci. 6, 373– 381. 21. Smith, T. F. & Waterman, M. S. (1981). Identiﬁcation of common molecular subsequences. J. Mol. Biol. 147, 195– 197. 22. Brenner, S. E., Chothia, C. & Hubbard, T. J. (1998). Assessing sequence comparison methods with reliable structurally identiﬁed distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073– 6078. 23. Eidhammer, I., Jonassen, I. & Taylor, W. R. (2000). Structure comparison and structure patterns. J. Comput. Biol. 7, 685– 716. 24. Sillitoe, I. & Orengo, C. (2002). Protein structure comparison. In Bioinformatics: genes, proteins and computers (Orengo, C., Jones, D. & Thornton, J., eds), pp. 250– 265, BIOS Scientiﬁc Publisher, Oxford. 25. Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog. sect. A, 34, 827– 828. 26. Russell, R. B. & Barton, G. J. (1992). Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue conﬁdence levels. Proteins: Struct. Funct. Genet. 14, 309– 323. 27. Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208, 1 – 22. 28. Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123– 138. 29. Jones, D. T., Orengo, C. A. & Thornton, J. M. (1996). Protein Folds and their Recognition From Sequence Protein Structure Prediction (Sternberg, M. J. E., ed.), 1st edit., vol. 170, pp. 173– 206, Oxford University Press, Oxford. 30. Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L. & Elofsson, A. (2001). A study of quality measures for protein threading models. BMC Bioinform. 2, 5. 31. Bryant, S. H. & Lawrence, C. E. (1993). An empirical energy function for threading protein sequence through the folding motif. Proteins: Struct. Funct. Genet. 16, 92 – 112. 32. Shi, J., Blundell, T. L. & Mizuguchi, K. (2001). FUGUE: sequence-structure homology recognition using environment-speciﬁc substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243– 257. 33. Taylor, W. R., Flores, T. P. & Orengo, C. A. (1994). Multiple protein structure alignment. Protein Sci. 3, 1858– 1870. 34. Sali, A. & Blundell, T. L. (1990). Deﬁnition of general topological equivalence in protein structures. J. Mol. Biol. 212, 403– 428. 35. Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Knowledge based modelling of homologous proteins. Part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng. 1, 377–384. 36. Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205– 217. 37. Huang, X. & Miller, W. (1991). A time-efﬁcient, linear-space local similarity algorithm. Advan. Appl. Math. 12, 337– 357. 38. Thompson, J., Higgins, D. & Gibson, T. (1994). CLUSTAL W.: improving the sensitivity of progressive

3DCoffee: Mixing Sequences and Structures

395

multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673– 4690. 39. Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406– 425. 40. Jones, T. A. & Kleywegt, G. J. (1999). CASP3 comparative modeling evaluation. Proteins: Struct. Funct. Genet. Suppl. 3, 30 – 46.

41. Karplus, K. & Hu, B. (2001). Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. 42. Bourne, P. E. (2003). CASP and CAFASP experiments and their ﬁndings. Methods Biochem. Anal. 44, 501 –507. 43. Vitkup, D., Melamud, E., Moult, J. & Sander, C. (2001). Completeness in structural genomics. Nature Struct. Biol. 8, 559– 566.

Edited by J. Thornton (Received 14 November 2003; received in revised form 20 April 2004; accepted 22 April 2004)

doi:10.1016/j.jmb.2010.05.012

J. Mol. Biol. (2010) 400, 605–617

Available online at www.sciencedirect.com

T-RMSD: A Fine-grained, Structure-based Classification Method and its Application to the Functional Characterization of TNF Receptors
Cedrik Magis 1 ⁎, François Stricher 1 , Almer M. van der Sloot 1 , Luis Serrano 1,3 and Cedric Notredame 2
EMBL/CRG Systems Biology Research Unit Center for Genomic Regulation (CRG), UPF, Barcelona, Catalunya, 08003, Spain Bioinformatics and Genomics program, Center for Genomic Regulation (CRG), UPF, Barcelona, Catalunya, 08003, Spain Institucio Catalana de Recerca I Estudis Avancats (ICREA) professor, Center for Genomic Regulation (CRG), Barcelona, Catalunya, 08003, Spain Received 18 March 2010; received in revised form 21 April 2010; accepted 7 May 2010 Available online 13 May 2010 Edited by M. Sternberg
3 2 1

This study addresses the relation between structural and functional similarity in proteins. We introduce a novel method named tree based on root mean square deviation (T-RMSD), which uses distance RMSD (dRMSD) variations to build fine-grained structure-based classifications of proteins. The main improvement of the T-RMSD over similar methods, such as Dali, is its capacity to produce the equivalent of a bootstrap value for each cluster node. We validated our approach on two domain families studied extensively for their role in many biological and pathological pathways: the small GTPase RAS superfamily and the cysteine-rich domains (CRDs) associated with the tumor necrosis factor receptors (TNFRs) family. Our analysis showed that T-RMSD is able to automatically recover and refine existing classifications. In the case of the small GTPase ARF subfamily, T-RMSD can distinguish GTP- from GDP-bound states, while in the case of CRDs it can identify two new subgroups associated with well defined functional features (ligand binding and formation of ligand pre-assembly complex). We show how hidden Markov models (HMMs) can be built on these new groups and propose a methodology to use these models simultaneously in order to do fine-grained functional genomic annotation without known 3D structures. T-RMSD, an open source freeware incorporated in the T-Coffee package, is available online.
© 2010 Elsevier Ltd. All rights reserved.

Keywords: structural classification; functional classification; multiple sequence alignment; Tumor Necrosis Factor Receptor; Cysteine-rich Domain

Introduction
Fold identification is a very active field of research, whose importance reflects the usefulness of structural information when elucidating protein function.1 It is therefore not surprising to note that several structural classifications have been developed over time.2–4 These resources, which are either used in a stand alone fashion or through their integration within domain databases,5 tend to be

*Corresponding author. E-mail address: cedrik.magis@crg.es. Abbreviations used: T-RMSD, tree based on RMSD; dRMSD, distance RMSD; CRD, cysteine-rich domain; TNFR, tumor necrosis factor receptor.

focused on the identification of generic folds. In practice, these resources are mostly used to help assigning an anonymous sequence to some highlevel fold/functional category. The subtle differences associated with fine-grain functional and structural variations are often overlooked. Refining such classification and showing that fine-grain structural clustering can have clear functional implications is precisely the object of the work presented here. We have developed a novel method named tree based on RMSD (T-RMSD), which computes a clustering using a combination of structure comparison and tree reconstruction methods. The aim of the model is to identify subtle structural differences within a group of homologous protein sequences or domains using distance RMSD measures (dRMSD) as a similarity metrics. Contrarily to the standard

0022-2836/$ - see front matter © 2010 Elsevier Ltd. All rights reserved.

606 RMSD that measures the distance between superposed Cα equivalents, dRMSD6 relies on a comparison between homologous intramolecular distances. It is therefore independent of any structural superposition, and depends only on how equivalences are defined. Dali7 is probably the most popular dRMSDbased superposition procedure. Interestingly, the metrics can be used to evaluate MSAs independently of any reference or superposition,8,9 a property that makes it an ideal measure for comparing the structural merits of alternative alignments using a meta-aligner such as M-Coffee10 or 3D-Coffee.11 Existing dRMSD-based metrics, however, can be used only to estimate a global structural similarity, thus making it hard to properly quantify the reliability of clusters built upon them. Here, we show that one can take advantage of the distancebased nature of the dRMSD to build distinct clusters on each individual column of an alignment, thus producing a collection of alternative clusters that can be compared, combined and assessed for robustness,

T-RMSD: A Method for Structure Classification

in a way reminiscent of bootstrap support analysis in phylogeny. In practice, a bootstrap value is associated with each node. It indicates the fraction of ungapped columns supporting the split associated with the considered node (100 indicates a full support from all considered columns). This approach, outlined in Fig. 1, forms the backbone of the T-RMSD. Clustering strategies are mostly exercises in description, and their full power materializes only when clusters prove useful to reveal functional relationships. One of the main limitations when estimating the relative merits of fine-grain classifiers is the scarcity of experimental data lending itself to such analysis. Indeed, one needs a family of sequences with several well characterized members both from a structural and a functional point of view. One also requires these members to be sufficiently functionally distinct to allow an estimation of how well a clustering manages to correlate functional and structural variations. In practice, the

Fig. 1. A schematic display of the T-RMSD procedure. A multiple sequence alignment is used to declare equivalences among aligned residues. Each sequence must have a known 3D structure. For a given ungapped position (P1) and a given sequence A, the Cα–Cα distance between the residue at position P1 and every other residue of A (occurring in an ungapped column) is estimated on the 3D structure. A similar set of distances is measured on structure B. The divergence between A and B is estimated by summing the absolute differences between equivalent pairs on A and B. A divergence matrix can be computed for position P1 by considering every possible pair of sequence. Such matrices are computed for every ungapped position of the MSA and are then turned into a UPGMA tree. The collection of trees thus obtained is combined into a consensus tree using the consense program of the PHYLIP package (see Methods).

T-RMSD: A Method for Structure Classification

607 the TNFRs, a class of receptors involved in many key cellular processes (cell survival, programmed cell death, proliferation) and whose association with important pathologies is well documented (cancer, viral infection, autoimmune disease).15–17 While TNFRs exhibit very different signaling capacities, sizes and a variety of domain organization, most of them seem to use similar binding modes.18 Because of its biological importance, this process has received a lot of attention,19–23 although the precise determinants shaping CRD specificity toward a given TNFL remain to be determined. Sequence analysis alone brings little information, with the main domain databases (Pfam, Smart 24 and PROSite25) showing, on average, 40% agreement with UniProt annotation. 26 This disagreement should not come as a surprise given the wide sequence diversity measured across CRDs (about 18 ± 8% of identity). The current CRD classification is therefore structure-based27,28 and relies on a module decomposition of each domain (N and C terminal), with each module making up about half of a CRD. This approach conveniently describes all existing CRDs but it fails at defining functionally homogeneous groups and is therefore not really informative. Furthermore, the most common module combination puts together CRDs that can be structurally diverse (RMSD up to 5 Å) and probably just as functionally different, given their tendency to occur in any part of the domain architecture. In the context of this work, one of our goals was to assess whether systematic structure-based clustering can yield new more homogeneous categories. The resulting methodology is very generic and can be applied to any protein family for which the definition of structural subgroups makes biological sense (provided enough structural data are available).

analysis of the Pfam12 domain database (containing N 11,000 families) reveals about 100 protein families associated with sufficient structural information (N20 distinct UniProt ID). These families include protein-binding domains (PDZ, SH2 and SH3), nucleic acid-binding protein (RNA recognition motifs, zinc fingers and U-box) and enzymes (protein kinases, histone acetyltransferases and proteases). Unfortunately, and although a large literature corpus is available for many of these families, the required information is not yet integrated with appropriate levels of detail and the establishment of reference datasets such as those used here remains a labor-intensive process. In the context of this work, we have focused the validation of our method (T-RMSD) on two well-described domain families: the small GTPase RAS superfamily13 and the cysteine-rich domains (CRDs) associated with tumor necrosis factor receptors 14 (TNFRs). Both of these domain families have in common the availability of a large number of experimental 3D structures (between 20 and 50 non-redundant structures), along with enough experimental data to support the existence of significantly different functional subgroups. These two families also define the two extremes of our functional/structural spectrum. On the one hand, RAS, a super family made of related small GTPases with subcategories defined well enough at the sequence level to be recovered using profile collections. These subcategories are associated unambiguously with distinct features such as cellular localization, pathway and conformational switches related to GTP hydrolysis (Table 1). Altogether, this makes the GTPase RAS superfamily an ideal validation target. The CRDs, on the other hand, exhibit a less clear functional/structural relationship. CRDs constitute the ligand-binding region of

Table 1. Small GTPase RAS superfamily functional classification
Domain RAS RHO RAB RAN PDB 9 5 19 1 ID (%) 46 ± 10 52 ± 13 42 ± 9 n.a. Function/Localization Specific association with plasma membrane, Golgi or ER after post-translational modification. Rho GTPases are key regulators of cytoskeletal dynamics and affect many cellular processes, including cell polarity, migration, vesicle trafficking and cytokinesis Organise membrane trafficking processes, vesicle budding, uncoating, mobility and fusion Ran is located predominantly in the nucleus of eukaryotic cells and is involved in the nuclear import of proteins as well as in control of DNA synthesis and of cell-cycle progression, nuclear localization, and its lack of sites required for post-translational lipid modification Recruitment of proteins or complexes to initiate vesicle budding. Regulation of vesicle biogenesis in intracellular traffic/Phospholipase D activator Cytoskeletal remodeling via RHO interaction, voltage-gated calcium channel activity modulation. Undefined Reference 25 26 27 28

SAR ARF RGK Ambiguous Overall

0 6 1 2 50

n.a. 48 ± 10 n.a. n.a. 29 ± 12

29 29,30 31 —

The RAS superfamily is dissected into seven domain families, each assuming a specific function and cellular localization. For the overall superfamily, as for each subfamily, the average identity and its standard deviation are indicated, calculated for the sequences included in our reference set.

608

T-RMSD: A Method for Structure Classification

Results
Validation of T-RMSD for functional classification of RAS subfamilies We first estimated the relevance of the structurebased clustering produced by the T-RMSD (Fig. 1; Materials and Methods) on the classification of the RAS GTPase subfamilies (Table 1).29-35 For that purpose, non-redundant sequences with a known structure (Supplementary Data Table 1) were multiply aligned with 3D-Coffee and the resulting MSA was used to derive a neighbor-joining (NJ) tree (Supplementary Data Fig. 2A). As expected, this clustering roughly coincides with the widely accepted RAS classification, 13 with the exception of centaurin γ1 (2IWR) and centaurin γ3 (3IHW). This does not come as a surprise because these two proteins are hard to classify, which is reflected by the disagreement among domain databases. The same 3D-Coffee MSA was then fed into T-RMSD in order to produce a structure-based clustering (Fig. 2b). The topology thus obtained is roughly similar to the sequence-based analysis with one notable exception, its capacity to distinguish the GTP-bound from the GDP-bound states of the ARF subfamily.36,37 The split separating these two subgroups (ARF-GTP and ARF-GDP) has a bootstrap value of 38, which means that 38% of the columns support this dichotomy. However, discrimination cannot be achieved on the basis of sequence alone because each RAS GTPase is able to bind both GTP and GDP. This example therefore illustrates well the advantage of a structure-based clustering over its sequence-based counterpart. Moreover, obtaining the reliability value (bootstrap) also shows the benefits of basing an analysis on the simultaneous use of many possible models (one per column) rather than relying on a global measure. It is worth mentioning that discrimination of GTP and GDP forms could be achieved only on the ARF subfamily, which suggests a more concerted conformational change related to GTP or GDP binding than the rest of the GTPases of the RAS superfamily. Structure-based classification of CRDs associated with the TNFR family We applied the same procedure to the CRD dataset. In that context, the goal was to validate the method and to estimate whether the current classification could be refined to become more informative. A set of 22 non-redundant sequences was identified (see Materials and Methods) and multiply aligned using 3D-Coffee. The resulting MSA was used to compute the unrooted neighbor-joining phylogenetic tree shown in Fig. 3a. In theory, such a tree could be used to reveal subclasses (as happens with the RAS family) but in practice the evolutionary distance between the considered domains is too great, resulting in an unrooted tree poorly resolved with no significant bootstrap support for any deep node

(Fig. 3a). Furthermore, the subgroups it defines show no significant agreement with the current module based classification.27,28 The current CRD classification is based on the dissection and comparison of five PDB structures suggesting nine different modules (A1, A2, B1, B2, C2, X2, D2, N and N1), the combination of which can be used to describe eight CRD types. The 22 CRDs considered in the dataset are predominantly A1-B2 types (underlined in red in Fig. 3a). They appear to be evenly spread and associated with all the other module combinations such as A1-B1, A2-B1, A1-C2, A1-D2 or X2-N. Similar results were obtained even when excluding columns containing gaps. When applying the T-RMSD analysis to the same dataset (Fig. 3b), we obtain a much more informative topology, clearly supporting three well defined groups with one outlier. The bootstrap values indicate the fraction of ungapped columns (13 in total) supporting a given split and suggest a good support for these three groups. For instance, the separation between group II and the other CRDs is supported by 100% of ungapped columns, while 53% support the separation between group I and III. Interestingly, the groups thus defined are fairly homogeneous from a module point of view, with groups I and II composed almost exclusively by A1B2 CRDs and group III perfectly overlapping a previously described CRDs category: the small TNFR (made of one or two CRDs).38 We compared these results with those obtained when using the Dali Z-score as a clustering metrics (Materials and Methods) and found the resulting clusters to be in broad agreement (Supplementary Data Fig. 1) even though they lack the bootstrap value associated with the T-RMSD clusters. Structural characterization of the new CRDs classification The differences captured here, which clearly reflect genuine structural variation, might be anecdotal and non-biologically informative. In order to rule out this possibility we performed a detailed analysis of the clusters, exploring all three layers of information associated with the considered sequences: homology, structure and function. Our goal was to establish whether each cluster is specifically associated with a structural or functional property. Each subgroup was realigned using 3D-Coffee and analyzed for sequence conservation. As expected, the degree of conservation is fairly poor aside from a few highly conserved residues: one highly conserved aromatic/hydrophobic position whose involvement in the domain packing has been previously described through structural analysis 19,21 and the disulphide bridges network (C1–C2, C3–C5 and C4–C6). Nonetheless, it is interesting to note how the three groups identified at the structural level might have been easily defined on the basis of their insertion-deletion (indel) patterns (Fig. 4a). For

T-RMSD: A Method for Structure Classification

609

Fig. 2. Tree analysis of current sequence/function clustering of small GTPase RAS superfamily compared to a new clustered based classification. (a) Phylogenetic analysis of the 43 structures of the non redundant GTPase RAS (see Methods). The neighbor-joining tree was derived from MSA of CRDs using 3D-Coffee. For clarity, bootstrap values under 10.0 are not indicated on the nodes. (b) T-RMSD structural clustering derived from MSA obtained by 3D-Coffee. RAB domains are indicated in light violet, RAS domains in light blue, ARF domains in light green, orphans domains (RAN and RGK) in light orange, RHO in light red and ambiguous domains in khaki.

610

T-RMSD: A Method for Structure Classification

Fig. 3. Tree analysis of current module-based CRD classification compared to a new clustered based classification. (a) Phylogenetic analysis of the 22 structures of non redundant CRDs (see Methods). The neighbor-joining tree was derived from MSA of CRDs using 3D-Coffee. (b) T-RMSD structural clustering derived from MSA obtained by 3D-Coffee. For a and b, each CRD is represented by its PDB identifier, followed by its position number within the architecture of its corresponding TNFR and the chain identifier of the corresponding PDB file. The module composition is indicated by colored underlines: A1-B1, dark red; A1-B2, red; A1-C2, orange; A1-D2, yellow; X2-N, light green; and A2-B1, dark green. Bootstrap values are given for each node.

instance, in group I, the block defined by the C2–C5 cysteines is fairly compact and does not seem to tolerate insertions or deletions. By contrast, the corresponding region within groups II and III is much more variable. Conversely, the C5–C6 block of group II is more constrained than its counterparts in groups I and III. As a consequence, the overall length is fairly conserved within each group but variable across groups. Group II is the largest, 5–10 residues longer than groups I and III (Table 2). Interestingly, the very nature of this variability probably explains the failure of sequence-based clustering methods, whose metrics are based on identity rather than a comparison of indel patterns. We then asked if such conservation patterns could reflect structural homogeneity. We measured the average standard RMSD in each group (Table 2) and found them highly homologous with average values below 1.0 Å within each group

(see Materials and Methods). This observation is not surprising given the structure-based nature of the clustering, but it nonetheless confirms the adequacy of dRMSD as a distance metric. As suggested by their common A1-B2 module composition, groups I and II contain the most similar members (Fig. 4b, left and middle respectively), with precisely conserved topologies (loop 1 & 2, turn 1 & 2). Despite a smaller structural core (black dotted lines in Fig. 4b, right) group III is a bit more diverse albeit homogeneous enough to also yield a standard RMSD of less than 1.0 Å. Homology-based validation of the new CRD classification Another way to estimate the biological meaning of our three clusters is to assess the discriminative power of sequence homology models based upon

Fig. 4. Sequence and structure analysis of CRD structural groups (a) MSAs of CRDs type I, II and III obtained from 3DCoffee. Alignments are illustrated using JalView 2.4 and conserved residues are colored from light blue to dark blue. Conserved cysteine (dark blue columns) residues connectivity is indicated by black lines. Structural elements described after structural superposition (Fig. 3b) are indicated under each alignment. Consensus sequences indicated for each type were derived from MSAs. (b) Structural superimposition of CRDs according to their structural type I, II or III. CRD structures are represented as ribbon and the disulfide bridge is shown as a black stick. CRDs are colored using a rainbow scale according to the alphabetical order given by the PDB names and the position of the domain in the corresponding TNFR architecture (from 1D4V-2 to 2HEY-4). Structurally variable regions are circled by broken black lines and structurally conserved regions are circled by black dotted lines.

T-RMSD: A Method for Structure Classification

611

Fig. 4 (legend on previous page)

612
Table 2. Characterization of CRDs according to structural type.
Sequence characteristics Overall size Average identity (%) Core block length Core block length after superposition Ratio sizea (%) Core block RMSD (Å) Group I 35 ± 2 33 ± 4 18 16.0 ± 1.2 44.8 ± 5.0 0.64 ± 0.13 Group II 40 ± 2 30 ± 7 22 19.3 ± 1.9 48.4 ± 6.2 0.66 ± 0.32 Group III 31 ± 3 26 ± 7 13 10.5 ± 2.8 34.2 ± 11.2 0.75 ± 0.75

T-RMSD: A Method for Structure Classification

Overall size corresponds to the number of residues within CRDs (average). Core block size corresponds to the number of structurally conserved residues. The Core block/Overall ratio corresponds to the ratio of the number of structurally conserved residues (Core block) to the total number of residues (Overall, average). Structural similarity of core blocks estimated by average RMSD values, indicated (in Å) for each cluster (see Methods). a Core block/Overall.

them. In order to do this analysis, the MSAs associated with each of the three groups were turned into hidden Markov model (HMM) profiles (Materials and Methods) and used to scan the CRD database, the rationale being that the HMM giving the best score is more likely to define the true nature of considered CRDs. This methodology was validated beforehand using a leave-one-out strategy to assess each HMM specificity/sensitivity (Table 3). The leave-one-out was done by removing in turn each sequence from its corresponding MSA, reestimating the associated HMM and measuring the E-value of the left out sequence when aligned with the HMMs of the three groups (i.e. the re-estimated HMM and the other two). When applied in turn to each of the 22 CRDs this validation shows that 100% of the sequences achieve a better score (lower E-

value) when aligned to the HMM corresponding to their original group (Table 3). This high specificity is unlikely to result from any bias in sequence similarity given the similar level of identity across the three CRD types (28 ± 7%) or within each of the three groups (26–33%). The three HMMs built on the full dataset for each group were used to annotate the 80 remaining human sequences annotated as CRDs or atypical CRDs in UniProt but for which no experimental structure is available. Domains with an E-value greater than 0.1 with each of the HMMs of the three groups were left unclassified (22 in total). The remainder were labeled (i.e. assigned to group I, II or III) by the HMM yielding the lowest E-value (Supplementary Data Table S2). Newly and non-ambiguously automatically classified CRDs were then incorporated in the groups MSAs and used to update their respective HMM profiles (Fig. 5). The three HMMs thus produced were used to label all the CRDs with UniProt sequences annotated as TNFRs. In 98% of the cases, orthologous CRDs received an identical label across all vertebrate species. Functional evaluation of the new classification The three CRD groups identified through structural analysis were analyzed for known functions associated with their members. CRD repeats have been shown to play distinct roles in TNFRs. For instance, the N-terminal CRD (known as the preligand assembly domain or PLAD) has a very specific feature that sets it apart from other CRDs: it does not necessarily interact with the ligand but forms a specific pre-ligand assembly complex (homo

Table 3. Leave-one-out strategy performed on CRD sequences using HMM profiles built from MSA according to structural clustering described earlier
Domain information Name TNF-R1 Ox40 HVEM NGFR CRME TNF-R1 TNF-R1 Ox40 Ox40 DR5 DR5 HVEM NGFR NGFR NGFR CRME TNF-R1 TACI BCMA TWEAKR BAFF-R PDB 1EXT 2HEY 1JMA 1SG1 2UWI 1EXT 1EXT 2HEY 2HEY 1D4V 1D4V 1JMA 1SG1 1SG1 1SG1 2UWI 1EXT 1XU1 1OQD 2RPJ 1OQE Gene TNFRSF1A TNFRSF4 TNFRSF14 TNFRSF16 CRME TNFRSF1A TNFRSF1A TNFRSF4 TNFRSF4 TNFRSF10B TNFRSF10B TNFRSF14 TNFRSF16 TNFRSF16 TNFRSF16 CRME TNFRSF1A TNFRSF13B TNFRSF17 TNFRSF12A TNFRSF13C Domain 1 1 1 1 1 2 3 2 4 2 3 2 2 3 4 2 4 1 1 1 1 Gr I 3,7E-05 3,5E-07 5,9E-07 4,2E-08 5,9E-06 1,4E-03 1,4E-03 2,2E-05 N0,1 6,2E-05 3,7E-03 1,3E-05 1,1E-02 2,8E-05 4,2E-03 N0,1 N0,1 N0,1 3,1E-02 N0,1 N0,1 E value Gr II 6,4E-05 1,2E-03 5,4E-05 3,1E-06 8,2E-05 1,3E-14 2,7E-06 6,1E-10 1,7E-06 3,3E-13 3,6E-08 8,9E-12 5,1E-07 3,9E-08 9,7E-16 2,1E-05 N0,1 N0,1 1,7E-02 N0,1 N0,1 Gr III N0,1 N0,1 1,0E-01 N0,1 9,1E-04 4,0E-03 9,5E-03 N0,1 N0,1 N0,1 N0,1 2,3E-03 N0,1 N0,1 N0,1 N0,1 7,0E-02 2,2E-08 8,6E-05 3,7E-02 3,1E-02

CRDs are indicated by the common name of the receptor they belong to, the PDB file used for the structural analysis, the gene name of the corresponding receptor and the position in the receptor architecture from N to C terminus. Non-significant E values are N0.1.

T-RMSD: A Method for Structure Classification

613

Fig. 5. Classification of human putative CRDs associated with TNFRs according to the structural types I, II and III defined here. Type I, red box; type II, blue box; and type III, green box. Unclassified CRDs are represented as a black box. Half CRDs are identified by smaller boxes.

or hetero receptor–receptor complexes), prerequisite in the TNFL binding process.39 The existence of such domains, which are essential for binding, has been shown experimentally in several human TNFRs, including CD40, TNFR1, TNFR2 and FAS.40,41 The core CRDs, however, are directly involved in ligand binding activities as revealed by the structural analysis of TNFR/TNFL complexes (Materials and Methods). The annotation resulting from the scanning of human TNFRs with our new collection of CRD HMMs is shown in Fig. 5. A total of 88% of the type I CRDs occur on the N terminus, thus labeling the termination of 56% of the large receptors (three to six CRDs). In contrast, type II CRDs are found predominantly (about 92%) in the TNFR core. In agreement with the literature, we found type III to be mostly associated with small TNFRs made of one

or two CRD repeats for which a single domain is sufficient for a high level of interaction specificity.38 These findings suggest that the two new subcategories identified within the A1-B2 module group correspond to two genuinely distinct classes of CRDs: TNFL binders and PLAD CRDs. Unassigned CRDs present us with the possibility that other categories might exist, yet unreported for lack of structural information.

Discussion
Here, we describe T-RMSD, a novel method for fine-grained structural classification. Given an MSA where each sequence has a known structure, the TRMSD uses dRMSD measurements to estimate a

614 clustering for each column of the MSA. The resulting clusters are combined into one unique consensus cluster, whose nodes are scored for their frequency among the combined clusters. We show that this algorithm can recover existing and validated classifications (small GTPase RAS superfamily) or allow the refinement of existing classifications (CRD family). But more importantly, we show that these fine-grain structural classes correspond to welldefined functional variations, which justify the systematic usage of this approach for refined domain annotation. The T-RMSD could be described as DALI clustering with bootstrap values. It is therefore not surprising in any way that our method delivers accurate results, but it must be pointed out that providing meaningful reliability values for cluster nodes is just as important as providing a clustering in the first place. Our approach provides reliability estimation, and it gives an indication of which intramolecular distances support better a given split in the clustering (i.e. the column setting apart one subgroup from the others). We do not use this information in the current work, but it could certainly find important applications in homology modeling and drug design. The clusters produced by T-RMSD are very precise and reliable. Even though it is fairly obvious that large families can be divided into smaller subfamilies, we show here that these refined classifications are sophisticated usage of structural information, and that they make biological sense, probably because the structure/function relationship extends deep, at least within the small GTPase RAS superfamily and CRD family. Validation is achieved on the RAS superfamily. We show how the T-RMSD recovers the known classification and manages to precisely identify within ARF the structural differences associated with the two states of GTPase proteins: GTP and GDP binding. We also tested the behavior of the T-RMSD on a more challenging dataset: the TNFR CRDs. As illustrated by the lack of a clear consensus among domain databases and large literature corpus, classifying CRDs is a challenging endeavor. Our results suggest that the TRMSD approach is able to recover a classification roughly similar to that produced by Dali. However, the bootstrap support makes it possible to use well supported deep branching nodes to define three groups unambiguously and in a non-arbitrary way. One difficulty when classifying CRDs stems from the poor support offered by sequence similarity. In order to validate our new classification, we therefore used all the available information associated with the considered sequences. We first showed that their alignment reveals well conserved spacing between disulphide bridges, in a way that appears to be group specific. We showed that all subgroups considered are reasonably homogenous from a structural point of view. In order to further build on this descriptive validation, we looked into the predic-

T-RMSD: A Method for Structure Classification

tive power of the three newly defined groups. We did so by estimating an HMM for each group and by establishing the preference of each HMM for its group members. While this provided clear support for the notion of the homogeneity within each group, the strongest support for the new categories presented here probably stems from the capacity of their associated HMMs to identify CRDs by functional criteria, such as PLAD (type I), ligand binding capacity (type II) or small TNFR-associated CRDs (type III). Although this third category appears to be the most heterogeneous, detailed structural analysis reveals that all its members exhibit a highly conserved hairpin structure known to interact with a conserved binding pocket present in both APRIL and Baff (Baff/BaffR, Baff/BCMA, APRIL/BCMA and APRIL/TACI co-crystals). This finding is consistent with the reported capacity of several small TNFRs to crossreact (TACI, Baff-R and BCMA) with their cognate ligands (APRIL and Baff).42 In Fig. 5, most of the Nterminal CRDs are labeled as type I, but it is worth mentioning that several of N-terminal domains could not be labeled by any of the three HMMs. This observation raises the possibility that further expansion of the classification might be needed, either by defining a new typology or by expanding the current one. Unfortunately, the T-RMSD approach cannot help here, due to a lack of structural information, but these observations might provide an incentive to prioritize these proteins for structure determination. To conclude, our detailed validation of T-RMSD on the small GTPase and the CRD classifications provide a new basis for using structural and functional information when performing genome annotation at a detailed functional level. It must be stressed that expanding the validation reported here would be mostly hampered by the lack of functional evidence. All known large families are easily clustered using our approach, but when doing so, one will often be at a loss to estimate whether these clusters make any biological sense. We have shown here that structure-based clusters were biologically meaningful on two well studied cases and we are now planning to explore further in this direction by applying this methodology to other important protein families, such as metabolic enzymes, protein-binding domains involved in signaling pathways and DNA-binding proteins.

Materials and Methods
Hardware and software All experiments ran on a standard 2 GHz dualcore PC with 4 Gbytes of RAM. The T-RMSD is incorporated in the T-Coffee package, an open source freeware†.

† www.tcoffee.org

T-RMSD: A Method for Structure Classification Multiple sequence alignment and phylogenetic tree computation Multiple sequence alignments were carried out using the T-Coffee43 alignment package (version 7.68). Structurebased multiple sequence alignments were done with the 3D-Coffee mode of T-Coffee that makes it possible to combine alternative pairwise structural and sequence alignment methods.11 3D-Coffee was used to combine three different types of aligners: SAP (structural aligner), slow_pair (global aligner) and lalign_id_pair (local aligner). For each sequence, the corresponding PDB was used as template. This alignment was used to compute a neighbor-joining44 phylogenetic tree with 100 bootstrap replicates using the PHYLIP package‡.45 Phylogenetic and structural trees were represented using Phylodendron version 0.8d. Structural clustering using DaliLite version 3.0 Structural clustering of CRDs was done with DaliLite version 3.0. The Z-scores obtained from pairwise comparison using DaliLite v3.0 were normalized to generate a symmetrical square matrix of distances. The matrix was then used to generate a structural tree using the NEIGHBOR executable (an implementation of Neighbour Joining and of the UPGMA methods) from the PHYLIP package. Structural clustering method: T-RMSD A schematic display of the T-RMSD method is presented in Fig. 1. The structural clustering is obtained by comparing equivalent Cα–Cα intramolecular distances across the MSA of the considered structures. Two distances are considered equivalent when they separate a pair of equivalent residues. The list of residue equivalences is obtained from the alignment. Given a pair of aligned sequences I and J and the two aligned residues Ic and Jc, a global distance is associated with this pair by summing the difference of distances between every pair d(IcIx) and d(JcJx), where Ix and Jx are two aligned residues of the corresponding sequence (Eq. (1)). The distance is then normalized by N, the number of contributing columns (i.e. the number of ungapped position x considered) Dc ðI; J Þ =
L X x

615
normalized by C to indicate the fraction of columns supporting a given node. The T-RMSD procedure is implemented in T-Coffee (version 7.68 and higher). Dataset of GTPase protein/domains The small GTPase RAS superfamily is currently divided into six different domain subfamilies corresponding to specific functional differences: RAB, RAS, RHO, RAN, SAR and ARF as defined by SMART and PROSITE and RGK defined only by PIRSF.46 We used this clustering as a basis for the evaluation of the T-RMSD on this superfamily. Structures associated with small GTPases were extracted from the Protein Data Bank47 (PDB§) and selected by several criteria related to structure quality and non-redundancy: (1) only X-ray structures were used; (2) presenting a full structural coverage (no missing or undefined regions); (3) exhibiting resolution below 2.5 Å (with one exception at 2.65 Å); and (4) showing a limited sequence/structure redundancy using a threshold of 70%.48 We thus defined a reference set of 43 domains (Supplementary Data Table S2): 19 RAB domains, nine RAS domains, six ARF domains, five RHO domains, one RAN, one RGK and no SAR domain because no structure presented full structural coverage. We included in this dataset two additional GTPases for which domain databases exhibit disagreement: AGAP2_HUMAN and AGAP3_HUMAN (centaurin γ1 and centaurin γ3), which contain a GTPase domain (Pfam: MIRO, PROSITE: RAB, SMART: small GTPase, and Pfam: MIRO, PROSITE: RAS, SMART: small GTPase, respectively). Dataset of TNFR protein/domains For sequence analysis, CRDs are referred to by the gene name corresponding to the receptor they belong to and by the number corresponding to their position within the receptor architecture (N-terminal to C-terminal). Sequences of CRDs associated with TNFR were obtained from UniProt annotation (called TNFR-Cys). It includes current described domain families corresponding to TNFR associated CRDs, unidentified CRDs, and half CRDs suggested in the current structural classification. Within UniProt, a few of these half-CRDs are annotated as atypical or truncated TNFR-Cys (from human, TNFRSF4-3, TNFRSF12A-1 and TNFRSF13C-1). However, a few CRDs and half CRDs are not described precisely regarding domain database or structural classification. For such cases, sequence boundaries were determined on the basis of consensus sequences of structural modules (from human, TNFRSF6B-5, TNFRSF14-4, TNFRSF19L-1 and TNFRSF19L-3) or according to domain databases (from Vaccinia virus, CRME-1, CRME-2 and CRME-3). A total of 97 CRDs or half CRDs were identified in the human TNFR family. All structures of CRDs associated with TNFRs (from Homo sapiens, Rattus norvegicus and Vaccinia virus) used for analysis and classification performed in this study were extracted from the PDB. The selection of PDB files corresponding to similar domains was realized on the basis of the highest sequence coverage and highest resolution. PDB files corresponding to TNFRs (gene/ receptor name) used in this work are: TNFRSF1A/TNF-

j dðIc Ix Þ − dðJc Jx Þ j = N

Using this formula, one can estimate, on position c, a distance for every pair of sequences (provide there is no gap in the alignment column corresponding to position c). A set of such distances defines a distance matrix. Such matrices can be turned into clusters using an appropriate clustering algorithm. In the context of this work, we used the UPGMA algorithm to turn the matrices into unrooted trees. With C ungapped columns, one produces C trees. These trees are combined into one unique consensus tree using the Consense program (PHYLIP). Consense produces a consensus tree and labels each node with the number of trees supporting the considered split. This value can be ‡ Felsenstein, J. (2005). PHYLIP (Phylogeny Interference Package) version 3.6. Distributed by the author. Departement of Genome Science, University of Washington, Seattle

§ www.pdb.org

616
R1, 1EXT;49 TNFRSF4/Ox40, 2HEY;50 TNFRSF10B/DR5, 1D4V;51 TNFRSF12A/TWEAK-R, 2RPJ;52 TNFRSF13B/ TACI, 1XU1;42 TNFRSF13C/Baff-R, 1OQE;38 TNFRSF14/ HVEM, 1JMA;53 TNFRSF16/NGFR, 1SG1;54 TNFRSF17/ BCMA, 1XU2;42 CRME/CRME, 2UWI.55 Structural comparison of CRD subgroups Global structural comparison using RMSD measurement was achieved after defining the structural core block within each subgroup. These structural core blocks include positions likely to be aligned across all pairwise structural alignment constructed using SAP version 0.5.56 RMSD values, after core blocks superimposition, were obtained with the align command of PyMOL package version 1.1∥.57 The number of aligned residues (despite the exclusion routine of the PyMOL align command) is globally similar with standard deviations between one and three residues (Table 2). Our results are presented by subgroup: average RMSD value, standard deviation and number of position aligned for each structural core block. HMM profile for sequence classification Hidden Markov models (HMM) have proven to be efficient for protein classification based on statistical relevance of specific position. We used HMMer version 2.0 to generate profiles corresponding to each structural subgroup we defined.58,59 Profiles were built using the default mode, which corresponds to local alignments regarding domain sequences and global regarding the HMM. Each profile was calibrated empirically in order to increase sensitivity of searches using default parameters provided by HMMer version 2.0. Searches using HMM profiles were achieved on a reduced dataset corresponding to CRDs associated specifically with TNFRs.

T-RMSD: A Method for Structure Classification

References
1. Sadowski, M. I. & Jones, D. T. (2009). The sequencestructure relationship and protein function prediction. Curr. Opin. Struct. Biol. 19, 357–362. 2. Orengo, C. A., Michie, A. D., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH: a hierarchic classification of protein domain structures. Structure, 5, 1093–1108. 3. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins databases for the inverstigation of sequence and structure. J. Mol. Biol. 247, 536–540. 4. Dietmann, S., Park, J., Notredame, C., Heger, A., Lapp, M. & Holm, L. (2001). A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29, 55–57. 5. Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D. et al. (2009). InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D224–D228. 6. Nishikawa, K. & Ooi, T. (1974). Comparison of homologous tertiary structures of proteins. J. Theor. Biol. 43, 351–374. 7. Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138. 8. O'Sullivan, O., Zehnder, M., Higgins, D., Bucher, P., Grosdidier, A. & Notredame, C. (2003). APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19, i215–i221. 9. Armougom, F., Moretti, S., Keduas, V. & Notredame, C. (2006). The iRMSD: a local measure of sequence alignment accuracy using structural information. Bioinformatics, 22, e35–e39. 10. Wallace, I. M., O'Sullivan, O., Higgins, D. G. & Notredame, C. (2006). M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 34, 1692–1699. 11. O'Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G. & Notredame, C. (2004). 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395. 12. Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, J. S., Hotz, H. R. et al. (2008). The Pfam protein families database. Nucleic Acids Res. 36, D281–D288. 13. Wennerberg, K., Rossman, K. L. & Der, C. J. (2005). The RAS superfamily at a glance. J. Cell Sci. 118, 843–846. 14. Bazan, J. F. (1993). Emerging families of cytokines and receptors. Curr. Biol. 3, 603–606. 15. Hehlgans, T. & Pfeffer, K. (2005). The intriguing biology of the tumour necrosis factor/tumour necrosis factor receptor superfamily: players, rules and the games. Immunology, 115, 1–20. 16. Balkwill, F. (2009). Tumour necrosis factor and cancer. Nat. Rev. Cancer, 9, 361–371. 17. Locksley, R. M., Killeen, N. & Lenardo, M. J. (2001). The TNF and TNF receptor superfamilies: integrating mammalian biology. Cell, 104, 487–501. 18. Zhang, G. (2004). Tumor necrosis factor family ligandreceptor binding. Curr. Opin. Struct. Biol. 14, 154–160. 19. Idriss, H. T. & Naismith, J. H. (2000). TNFα and the TNF receptor superfamily: structure-function relationship(s). Microsc. Res. Tech. 50, 184–195. 20. Cha, S. S., Sung, B. J., Kim, Y. A., Song, Y. L., Kim, H. J., Kim, S. et al. (2000). Crystal structure of TRAIL-DR5 complex identifies a critical role of the unique frame insertion in conferring recognition specificity. J. Biol. Chem. 275, 31171–31177.

Acknowledgements
This work was supported by European Union (EU) programs TRIDENT (grant number LSHC-CT2006-037686), 3D REPERTOIRE (grant number LSHG-CT-2005-512028) and the Spanish Ministry of Education and Science through its Consolider programme. C.N. is supported by the CRG and the Plan Nacional (BFU 00419) A.M. v.d. S. was funded, in part, by a Juan de la Cierva fellowship of the Spanish Ministry of Education and Science. The authors wish to thank referees 1 and 2 for useful comments and suggestions.

Supplementary Data
Supplementary data associated with this article can be found, in the online version, at doi:10.1016/ j.jmb.2010.05.012

∥ www.pymol.org

T-RMSD: A Method for Structure Classification 21. Fellouse, F. A., Li, B., Compaan, D. M., Peden, A. A., Hymowitz, S. G. & Sidhu, S. S. (2005). Molecular recognition by a binary code. J. Mol. Biol. 348, 1153–1162. 22. Lam, J., Nelson, C. A., Ross, F. P., Teitelbaum, S. L. & Fremont, D. H. (2001). Crystal structure of TRANCE/ RANKL cytokine reveals determinants of receptorligand specificity. J. Clin. Invest. 108, 971–979. 23. Hymowitz, S. G., Compaan, D. M., Yan, M., Ackerly, H., Dixit, V. M., Starovasnik, M. A. & de Vos, A. M. (2003). The crystal structure of EDA-A1 and EDA-A2: splice variants with distinct receptor specificity. Structure, 11, 1513–1520. 24. Schultz, J., Milipetz, F., Bork, P. & Ponting, C. P. (1998). SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA, 27, 229–232. 25. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B., De Castro, E. et al. (2007). The 20 years of PROSITE. Nucleic Acids Res. 36, D245–D249. 26. UniProt Consortium. (2009). The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 37, D169–D174. 27. Naismith, J. H. & Sprang, S. R. (1998). Modularity in the TNF-receptor family. Trends Biochem. Sci. 23, 74–79. 28. Bodmer, J. L., Schneider, P. & Tschopp, J. (2002). The molecular architecture of the TNF superfamily. Trends Biochem. Sci. 27, 19–26. 29. Hancock, J. F. (2003). Ras proteins: different signals from a different locations. Nat. Rev. Mol. Cell Biol. 4, 373–384. 30. Etienne-Manneville, S. & Hall, A. (2002). Rho GTPases in cell biology. Nature, 420, 529–635. 31. Stenmark, H. (2009). Rab GTPases as coordinators of vesicle traffic. Nat. Rev. Mol. Cell Biol. 10, 513–525. 32. Sazer, S. & Dasso, M. (2000). The RAN decathlon: multiple roles of RAN. J. Cell Sci. 113, 1111–1118. 33. Kahn, R. A., Cherfils, J., Elias, M., Lovering, R. C., Munro, S. & Schurmann, A. (2006). Nomenclature for the human Arf family of GTP-binding proteins: ARF, ARL, and SAR proteins. J. Cell Biol. 172, 645–650. 34. Donaldson, J. G. & Honda, A. (2005). Localization and function of Arf family GTPases. Biochem. Soc. Trans. 33, 639–642. 35. Kelly, K. (2005). The RGK family: a regulatory tail of small GTP-binding proteins. Trends Cell Biol. 15, 640–643. 36. Vetter, I. R. & Wittinghofer, A. (2001). The guanine nucleotide-binding switch in three dimensions. Science, 294, 1299–1304. 37. Pasqualato, S., Renault, L. & Cherfils, J. (2002). Arf, Arl, Arp and Sar proteins: a family of GTP-binding proteins with a structural device for ‘front-back’ communication. EMBO Rep. 3, 1035–1041. 38. Liu, Y., Hong, X., Kappler, J., Jiang, L., Zhang, R., Xu, L. et al. (2003). Ligand-receptor binding revealed by the TNF family member TALL-1. Nature, 423, 49–56. 39. Chan, F. K., Chun, H. J., Zheng, L., Siegel, R. M., Bui, K. L. & Lenardo, M. J. (2000). A domain in TNF receptors that mediates ligand-independent receptor assembly and signaling. Science, 288, 2351–2354. 40. Chan, F. K. (2000). The pre-ligand binding assembly domain: a potential target of inhibition of tumour necrosis factor receptor function. Ann. Rheum. Dis. 59, i50–i53. 41. Chan, F. K. (2007). Three is better than one: pre-ligand receptor assembly in the regulation of TNF receptor signaling. Cytokine, 37, 101–107.

617
42. Hymowitz, S. G., Patel, D. R., Wallweber, H. J., Runyon, S., Yan, M., Yin, J. et al. (2005). Structures of APRIL-receptor complexes: like BCMA, TACI employs only a single cysteine-rich domain for high affinity ligand binding. J. Biol. Chem. 280, 7218–7227. 43. Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217. 44. Nei, M. & Saitou, N. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425. 45. Felsenstein, J. (2005). PHYLIP (Phylogeny Interference Package) version 3.6. Distributed by the author. Departement of Genome Science, University of Washington, Seattle. 46. Nikolskaya, A. N., Arighi, C. N., Huang, H., Barker, W. C. & Wu, C. H. (2007). PIRSF family classification system for protein functional and evolutionary analysis. Evol. Bioinform. 10, 197–209. 47. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H. et al. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242. 48. Kosloff, M. & Kolodny, R. (2008). Sequence-similar, structure-dissimilar protein pairs in the PDB. Proteins: Struct. Funct. Genet. 71, 891–902. 49. Naismith, J. H., Devine, T. Q., Kohno, T. & Sprang, S. R. (1996). Structures of the extracellular domain of the type I tumor necrosis factor receptor. Structure, 4, 1251–1262. 50. Compaan, D. M. & Hymowitz, S. G. (2006). The crystal structure of the costimulatory OX40-OX40L complex. Structure, 14, 1321–1330. 51. Mongkolsapaya, J., Grimes, J. M., Chen, N., Xu, X. N., Stuart, D. I., Jones, E. Y. & Screaton, G. R. (1999). Structure of the TRAIL-DR5 complex reveals mechanisms conferring specificity in apoptotic initiation. Nat. Struct. Biol. 6, 1048–1053. 52. He, F., Dang, W., Saito, K., Watanabe, S., Kobayashi, N., Güntert, P. et al. (2009). Solution structure of the cysteine-rich domain in Fn14, a member of the tumor necrosis factor receptor superfamily. Protein Sci. 18, 650–656. 53. Carfí, A., Willis, S. H., Whitbeck, J. C., Krummenacher, C., Cohen, G. H., Eisenberg, R. J. & Wiley, D. C. (2001). Herpes simplex virus glycoprotein D bound to the human receptor HveA. Mol. Cell, 8, 169–179. 54. He, X. L. & Garcia, K. C. (2004). Structure of the nerve growth factor complexed with the shared neurotrophin receptor p75. Science, 304, 870–875. 55. Graham, S. C., Bahar, M. W., Abrescia, N. G., Smith, G. L., Stuart, D. I. & Grimes, J. M. (2007). Structure of CrmE, a virus-encoded tumour necrosis factor receptor. J. Mol. Biol. 372, 660–671. 56. Brown, N. P., Orengo, C. A. & Taylor, W. R. (1996). A protein structure comparison methodology. Comput. Chem. 20, 359–380. 57. DeLano, W.L. (2002). The PyMOL Molecular Graphics System on World Wide Web www.pymol.org. 58. Durbin, R., Eddy, S., Krogh, A. & Mitchinson, G. (1998). The theory behind profile HMMs. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, pp. 101–133, Cambridge University Press, Cambridge, UK. 59. Eddy, S. (1998). Profile hidden Markov models. Bioinformatics, 14, 755–763.

Nucleic Acids Research, 2004, Vol. 32, Web Server issue W37–W40 DOI: 10.1093/nar/gkh382

3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment
Olivier Poirot1, Karsten Suhre1, Chantal Abergel1, Eamonn O’Toole3 and Cedric Notredame1,2,*
1

´ Information Genomique et Structurale UPR2589-CNRS, CNRS, 31, Chemin Joseph Aiguier, 13 402 Marseille Cedex 20, France, 2Swiss Institute of Bioinformatics, Lausanne University, Chemin des Boversesses, 1066 Epalinges, Switzerland and 3hp High Performance Technical Computing Division, Hewlett Packard, BallyBrit, Galway, Ireland

Received February 14, 2004; Accepted March 16, 2004

ABSTRACT This paper presents 3DCoffee@igs, a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs). 3D-Coffee makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments. Structures can be either provided as PDB identifiers or directly uploaded into the server. Given a set of sequences and structures, pairs of structures are aligned with SAP while sequence–structure pairs are aligned with Fugue. The resulting collection of pairwise alignments is then combined into an MSA with the T-Coffee algorithm. The server and its documentation are available from http://igs-server.cnrsmrs.fr/Tcoffee/.

INTRODUCTION The assembly of an accurate multiple sequence alignment (MSA) is a key step in many sequence analysis procedures. One could cite in bulk: the identiﬁcation of a protein signature such as a Prosite pattern (1), the building of a domain proﬁle (or HMM) needed for identifying the most remote members of a protein family (2), structure prediction and homology modeling (3) and phylogenetic analysis (4). More recently, MSAs have also proven useful to characterize nsSNPs (non-synonymous Single Nucleotide Polymorphisms) (5,6). The success of such applications depends very much on the MSA quality, hence the importance of accuracy when computing an alignment. In practice, structurally correct alignments are considered to be a good starting point for most MSA applications (with maybe the exception of phylogenetic reconstruction), and established collections of reference structural alignments are widely used to benchmark and train existing MSA packages (7,8). However, when state-of-the-art packages are applied to sets of distantly related sequences,

they deliver alignments that are only partly correct from a structural point of view (8), thus suggesting that sequencebased alignment procedures can still be greatly improved. In the current situation, the best way to produce a high-quality MSA remains the assembly of a multiple structural alignment. Unfortunately, few examples exist where enough related structures are available to carry out such a task. An elegant alternative to the use of many structures is to mix sequences and structures, in the hope that the 3D information contained within the structures will help deliver a better alignment of the other sequences. Such a mix also constitutes a realistic solution, considering the increasing proportion of sequences without a known structure and the decreasing proportion of protein families not associated with at least one structure. However, the problem of combining sequences and structures has not yet been extensively addressed, and only a handful of methods are available that allow the seamless combining of sequences and structures (9) while appropriately using 3D information. Here we present 3DCoffee@igs, a web server especially designed to combine sequences and structures by seamlessly integrating in T-Coffee (10) the three types of alignment methods needed for this purpose: sequence–sequence, sequence–structure and structure–structure alignment methods. When using one or more structures, the alignments thus produced are more accurate than similar alignments based on sequence information alone, as judged by the comparison with reference structure-based alignments (O.O’Sullivan, K.Suhre, D.Higgins and C.Notredame, submitted for publication). The inclusion of a threading method (sequence– structure alignment) makes it possible to use as little as one structure. METHODS Standard T-Coffee sequence alignment assembly We use T-Coffee to mix sequences and structures. Given a set of sequences, the regular T-Coffee procedure involves the

*To whom correspondence should be addressed. Tel: +33 491 164 606; Fax: +33 491 164 549; Email: cedric.notredame@gmail.com The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. ª 2004, the authors

Nucleic Acids Research, Vol. 32, Web Server issue ª Oxford University Press 2004; all rights reserved

W38

Nucleic Acids Research, 2004, Vol. 32, Web Server issue O(N3L2), N being the number of sequences and L their average length. However, in 3D-Coffee, SAP is the limiting factor, with a time complexity in the order of O(L3). USING THE TCOFFEE@IGS SERVER 3D-Coffee is a new service that is available through the previously presented Tcoffee@igs server (17). It is maintained by ´ IGS (Information Genomique et Structurale) and runs on a dedicated Alpha ES45 quadriprocessor server. It supports the analysis of a maximum number of 100 sequences with a maximum of 2000 residues each. The 3D-Coffee service is provided in two versions, a regular and an advanced version. The regular version requires limited input from the user while the advanced version offers more possibilities such as uploading personal PDB structures and controlling the methods used to compute the library. Tcoffee@igs server The homepage of the server (http://igs-server.cnrs-mrs.fr/ Tcoffee/) contains pointers to the four types of computation performed: (i) The Make a Multiple Alignment section opens to the standard computation of a T-Coffee MSA, using the default parameters of the program, as described in (10). (ii) The Evaluate a Multiple Alignment section provides an alignment evaluation using the CORE method as described in (17). (iii) The Combine Multiple Alignments section makes it possible to combine several alignments into one. The advanced section of each server offers extra control on the library computation (choice of the methods) as well as a larger number of output options. These servers have all been previously described in (16). (iv) The last section, Align Structures (3D-Coffee), is new and described in the next paragraph. Align structures and sequences with 3DCoffee::regular The 3DCoffee::regular server inputs a set of sequences in FASTA format. Among the sequences, those with a 3D structure must be named according to their PDB identiﬁer. If the PDB ﬁle contains several chains, the chain index (letter or number) must be added to the name (1pptA). If the sequence provided in the FASTA ﬁle is a subsequence of the indicated chain, T-Coffee aligns the provided sequence with its full PDB counterpart and makes sure that only the appropriate 3D information is used for alignment computation. This comparison also handles slight sequence discrepancies between the PDB and the user-provided sequence. In the regular mode of 3D-Coffee, the handling of the structures is entirely under T-Coffee control, which uses the FASTA information to gather the structures and chop them to the relevant portion. For users familiar with the stand-alone version of T-Coffee, we give the corresponding command line: t_coffee-in seq:fasta Msap_pair Mfugue_pair Mslow_pair Mlalign_id_pair sap_pair, fugue_pair, slow_pair and lalign_id_pair are pairwise methods used to compute the T-Coffee library. Once the

computation of a collection of pairwise alignments where for each possible pair of sequences in the dataset, the program computes the best global alignment and the 10 best local alignments [using the Sim algorithm from the Lalign package (11)]. This collection of pairwise alignments is named a library. The second step of the procedure involves the assembly of an MSA with a high level of consistency with the alignments contained in the library. Since T-Coffee uses a heuristic, the optimality of this process is not guaranteed, although the results are generally satisfactory as judged by comparison with alternative optimization methods (12). The assembly procedure is very similar to that described for ClustalW (13); extensive details are available in the original publication (10). 3D-Coffee protocol The 3D-Coffee protocol takes advantage of the methodindependent manner in which T-Coffee uses its libraries. Rather than ﬁlling the library with sequence-based pairwise alignments, 3D-Coffee compiles it using three types of pairwise methods: sequence–sequence, structure–structure and structure–sequence (threading) alignment procedures. From among the vast variety of structure comparison algorithms, we selected SAP (14) for the structure–structure alignments and Fugue (15) for the structure–sequence comparisons. A full validation of these choices is detailed in O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication. Our main criterion was the relatively high accuracy of these two methods and their ease of integration within the T-Coffee framework. It is nonetheless worth pointing out that any method with similar characteristics (i.e. able to deliver a sequence alignment) could easily be added to the procedure we describe here. In practice, given a sequence dataset, the program starts by identifying the sequences associated with a structure and those that are not. It then considers all the possible pairs and applies the appropriate methods to these pairs. For instance, given a pair of structures, the program will successively make a global pairwise alignment, a local pairwise sequence alignment and a structure-based sequence alignment with SAP. If only one of the two sequences has a known structure, Fugue will be used instead of SAP. The resulting pairwise alignments are compiled into a list of weighted pairs of aligned residues found in the individual alignments. Each pair receives a weight equal to the average level of identity within the pairwise alignment where it occurred. When two or more alignments contribute the same pair, their respective weights are added to yield the ﬁnal weight. The collection of weighted residue-pairs constitutes the T-Coffee library. T-Coffee uses the library to assemble a standard progressive alignment in a ClustalW-like manner. The program starts by computing the distance matrix of the sequences and uses it to estimate a guide tree. The guide tree controls the order in which the sequences are included one by one into the MSA. Each sequence is incorporated using the library in place of a substitution matrix. A recent modiﬁcation of the T-Coffee algorithm (to be described elsewhere) has made it possible to signiﬁcantly reduce the time complexity of the algorithm, down to O(N2L2) from the previously reported

Nucleic Acids Research, 2004, Vol. 32, Web Server issue

W39

Figure 1. Typical output of a standard 3D-Coffee computation. Five structures have been aligned with a sequence (Q53396). The display is the ESPript (18) output produced by the Tcoffee@igs server. The CORE index is displayed on the alignment and indicates the relative reliability of the various sections (color code: blue, unreliable; green, low reliability; red, highly reliable portion of the alignment). DSSP (19) is used to determine the secondary structures from the PDB coordinates. Blue, green and yellow portions are mostly incorrectly aligned, as judged by comparison with HOMSTRAD reference alignment (9).

computation is over, the server returns a page of links to the produced result ﬁles. An ESPript (18) post-processing step makes it possible to visualize the secondary structure elements within the used structures (Figure 1). The returned alignment is a sequence alignment, albeit generally improved by the use of structural information. Systematic benchmarking, carried out on a subset of HOMSTRAD (O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication), indicates that the accuracy of mixed sequences/structure alignments increases proportionally to the amount of structural information provided. The 3DCoffee::advanced server The advanced server makes it possible to upload user-deﬁned PDB structures (up to three). The sequences of the uploaded structures should not be included within the FASTA sequences. The limitation to three private structures is arbitrary and will be increased upon request. In case the

ﬁle contains more than one chain, the program extracts only the ﬁrst one. It is the user’s responsibility to provide the correct chain. The advanced server also makes it possible to control the computation of the T-Coffee library by selecting the methods one wishes to include. For instance, if all the sequences have a known 3D structure, it is advisable to use only sap_pair, the structure–structure alignment method, to generate a structurebased MSAs.

CONCLUSION In this paper, we present 3D-Coffee, a major extension of the Tcoffee@igs server. This new feature of the server makes it possible to combine sequences and structures within an MSA, thus producing high-quality MSAs. The method we present here is versatile and easy to use. It affords the possibility of seamlessly combining structure and

W40

Nucleic Acids Research, 2004, Vol. 32, Web Server issue

sequence information, private and public data, without the need to install additional programs such as SAP and Fugue locally. It certainly constitutes an adequate means to efﬁciently use available structural data. Future plans will involve the addition of new modules, rendering easier the mapping of structural information on to sequence data. We strongly encourage users to send us their feedback.

REFERENCES
1. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. 2. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. 3. Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202. 4. Phillips,A., Janies,D. and Wheeler,W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. 5. Ng,P.C. and Henikoff,S. (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Res., 12, 436–446. 6. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88. 8. O’Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19, I215–I221.

9. de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. (2001) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17, 748–749. 10. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 11. Huang,X. and Miller,W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. 12. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 13. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. 14. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. 15. Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) FUGUE: sequence– structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243–257. 16. Poirot,O., O’Toole,E. and Notredame,C. (2003) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res., 31, 3503–3506. 17. Notredame,C. and Abergel,C. (2003) Using multiple sequence alignments to assess the quality of genomic data. In Andrade,M. (ed.), Bioinformatics and Genomes: Current Perspectives. Horizon Scientific Press, Norfolk, UK, pp. 30–50. 18. Gouet,P., Robert,X. and Courcelle,E. (2003) ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res., 31, 3320–3323. 19. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577–2637.

Published online March 23, 2006
1692–1699 Nucleic Acids Research, 2006, Vol. 34, No. 6 doi:10.1093/nar/gkl091

M-Coffee: combining multiple sequence alignment methods with T-Coffee
Iain M. Wallace, Orla O’Sullivan, Desmond G. Higgins and Cedric Notredame1,*
The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and 1Laboratoire ´nomique et Structurale, CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Information Ge Parc Scientifique de Luminy, 163 Avenue de Luminy, 13288, Marseille cedex 09, France
Received January 19, 2006; Revised February 7, 2006; Accepted March 7, 2006

ABSTRACT We introduce M-Coffee, a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. M-Coffee is an extension of T-Coffee and uses consistency to estimate a consensus alignment. We show that the procedure is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs. We also show that performances can be improved by carefully selecting the constituent methods. M-Coffee outperforms all the individual methods on three major reference datasets: HOMSTRAD, Prefab and Balibase. We also show that on a case-by-case basis, M-Coffee is twice as likely to deliver the best alignment than any individual method. Given a collection of pre-computed MSAs, M-Coffee has similar CPU requirements to the original T-Coffee. M-Coffee is a freeware open-source package available from http://www.tcoffee.org/.

INTRODUCTION The multiple alignment of DNA or protein sequences is one of the most commonly used techniques in sequence analysis. Multiple alignments constitute a necessary pre-requisite in phylogeny, remote homologue detection and structure prediction. Until recently the choice for building multiple sequence alignments (MSAs) was limited to a handful of packages but a recent increase in genomic data has fuelled the development of many novel methods arguably more accurate and faster than the older ones. In practice this widened choice has also made it harder to objectively choose the appropriate method for a speciﬁc problem.

Unfortunately the standard multiple sequence alignment problem is NP-hard, which means that it is impossible to solve it for more than a few sequences. This complexity explains why so many different approaches have been developed (1,2), such as progressive alignment (3), iteration (4–6) and genetic algorithms (7). One very useful development has been the design of consistency-based methods whose purpose is to generate an alignment consistent with a set of pairwise alignments. The use of consistency was ﬁrst described by Gotoh (8) and Kececioglu (9) and independently re-formalized by Vingron and Argos (10) as a dot matrix multiplication procedure that bears much resemblance with T-Coffee. Consistency was later re-discovered by Morgenstern et al. (11) who refers to it as overlapping weights. In 2000, Notredame et al. (12) described a novel algorithm combining the overlapping weights with a progressive alignment strategy. This algorithm was implemented in T-Coffee and resulted in signiﬁcant accuracy improvement over existing methods. Since then, consistency based objective functions have been used within several new multiple alignment packages, including POA (13), MAFFT 5 (14), Muscle 6 (5), ProbCons (15) and PCMA (16). More than 50 MSA methods have been described over the last 10 years (Medline, January 08, 2006), with no less than 20 new publications in 2005 alone. The complexity and variety of these algorithms and the fact than none provides a deﬁnite answer to the problem makes it almost impossible to tell them apart from a theoretical point of view. In practice however, these methods are compared using empirical evaluations made on structure-based sequence alignments. This popular approach suffers from at least two shortcomings, most notably the fact that MSA methods trained and evaluated this way are biased toward generating structurally rather than evolutionarily correct alignments. Furthermore, although structural information is more resilient than sequence signal over long evolutionary distances, the assembly of a structure-based MSA is in itself a difﬁcult task, which has resulted in the

*To whom correspondence should be addressed. Tel: +33 491 825 427; Fax: +33 491 825 420; Email: cedric.notredame@gmail.com The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors Ó The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Nucleic Acids Research, 2006, Vol. 34, No. 6

1693

development of several alternative structure-based MSA collections. Some, like BaliBase (17) or Prefab (5), have been designed speciﬁcally for the validation of MSA methods while others like HOMSTRAD (18) are more generic. As there is no simple way to evaluate and rank these reference collections, it has become common practice to use them all when benchmarking new packages. The rationale for these analyses is that the average best performing package will constitute the safest choice when computing an MSA of uncharacterized sequences. However, such a choice is no guarantee for success as it is well established that the best performer is only more likely, but not certain, to be the most accurate on any speciﬁc dataset. Using this average best is merely an attempt to increase the chances of success, just like betting on the horse with the best odds. The description of a complex problem partially solved by several more or less different methods calls for comparisons with other similar situations in computational biology like secondary structure and gene predictions. In these contexts, Meta-methods, or Jury-based methods (19,20) have often proven to be superior to the constitutive methods. However, in the case of gene or structure predictions, the output is relatively easy to combine into the intersection or union of individual predictions. Such a combination protocol is harder to deﬁne when it comes to MSAs where each pair of aligned residues constitutes an element of prediction. Fortunately, consistency-based objective functions provide an elegant and simple solution to the problem of averaging several alignments into one meaningful consensus. Given a collection of alternative alignments, consistency-based objective functions deﬁne the optimal alignment as the one having the highest level of consistency with the collection. It is realistic to consider this optimally consistent alignment as some sort of consensus. This approach, ﬁrst described by Bucka-Lassen et al. (21) for the combination of alternative DNA alignments, is the core of the T-Coffee algorithm. While any consistencybased packages currently available would probably be equally well suited to the combination of MSAs, T-Coffee bears the advantage of having been speciﬁcally designed for that purpose thanks to the concept of a library. T-Coffee does not explicitly align sequences but compiles libraries based on externally produced alignments. During the alignment process, the libraries are combined into the ﬁnal MSA. Originally generated using ClustalW and Lalign, the libraries can also be produced by structural alignment packages or any sequence alignment program, pairwise or multiple. In this work, we took this concept much further and showed that T-Coffee can easily combine up to 15 alternative MSAs of the same sequences. We call this meta-mode M-Coffee and using several well-known benchmarks, we show that M-Coffee is the most accurate and ﬂexible MSA meta-method described so far. SYSTEM AND METHODS The benchmark system The main benchmark dataset was derived from the February 2005 release of HOMSTRAD (18). HOMSTRAD is a database of protein alignments assembled automatically by the structural alignment program COMPARER from sets of sequences

where all members have a known 3D structure. Alignments containing <4 sequences were removed, leaving a set of 233 alignments. Alignment accuracy was calculated with Column Score (CS) using the aln_compare program (12). CS is the proportion of columns of residues correctly aligned between the test and reference alignments (22). Two other Benchmark datasets were used to further validate the method: Balibase v3.0 (23) and Prefab v4.0 (5). Balibase (17) is a collection of hand-curated alignments. It contains ﬁve categories of typical alignment problems, including long internal insertions and terminal deletions. Balibase v3.0 includes two different sets of sequences: realistic fulllength protein sequences and artiﬁcial short sequences where the homologous regions are extracted from the fulllength sequences. The shorter sequence sets are denoted by an S in the Results section. This removes the bias for global alignment programmes in the earlier versions of Balibase (24). Prefab is based on pairs of structurally aligned sequences with known 3D structures. Each pair is supplemented with up to 50 homologous sequences, used to compute the MSA and removed when the resulting alignment is compared with the structural reference.

Generating MSAs Fifteen widely used multiple alignment programs from eight different laboratories were selected for this study. They were chosen to cover a wide range algorithms used to align protein sequences. ClustalW (25,26) version 1.83 is the most widely used multiple alignment program. It uses a progressive alignment scheme where an initial guide tree (calculated from pairwise alignments) is used to guide a full multiple alignment by progressively incorporating all the sequences into the MSA. T-Coffee (12) version 2.03 uses a consistency-based objective function (27) optimized using progressive alignment. It tries to maximize the score between the ﬁnal multiple alignment and a library of pairwise residue-by-residue scores derived from a mixture of local and global pairwise alignments. ProbCons (15) version 1.09 is, like T-Coffee, a consistencybased method. Alignments are generated using a library of paired hidden Markov models. It is currently the most accurate method as benchmarked on the HOMSTRAD dataset (14). PCMA (16) version 2.0 uses a consistency-based objective function to align distantly related sequences, and a ClustalW like algorithm to align similar sequences. Muscle (5) version 3.52 and version 6.0. Muscle v3.52 uses a progressive alignment algorithm with a Log Expectation score to align sets of sequences along a guide tree. Muscle v6.0 uses the same objective function as in ProbCons to further reﬁne the alignment from Muscle v3.52. Dialign2 (28) version 2.2.1 is a local multiple alignment method and is an improvement on the original segmentto-segment based approach of Dialign (11,29). Dialign-T (30) v0.1.3 is a new version of Dialign, which incorporates the Dialign objective function in a progressive alignment algorithm.

1694

Nucleic Acids Research, 2006, Vol. 34, No. 6

The MAFFT (14) suite version 5.531 is a series of progressive alignment programs (31). The package consists of ﬁve alignment programs: FFT-NS1: A progressive alignment algorithm that uses a fast Fourier transform (FFT) algorithm to calculate the guide tree. FFT-NS2: Same as FFT-NS1 except, the guide tree is re-calculated after a ﬁrst alignment and the alignment is repeated. FFT-NSI: Same as FFT-NSI but includes an iterative alignment reﬁnement step. F-INSI: Incorporates local pairwise alignment information. G-INSI: Incorporates global pairwise alignment information. POA (32) version 2 uses partial order graphs to build an MSA. Two options of POA were used in this study (i) default, and (ii) do-global. The default setting is a local alignment algorithm (called POA-local), while do-global uses a global alignment algorithm (called POA-global). The method tree A method tree was calculated to visually display the level of similarity between various methods. The ﬁrst step is the computation of a distance matrix where each entry is a measure of the average differences between two methods on the entire HOMSTRAD dataset. This value is estimated by aligning each HOMSTRAD dataset with both methods, and by estimating with aln_compare the proportion of residues identically aligned in corresponding alignments. These ﬁgures are converted to a distance and averaged to yield the ﬁnal entry. Note that in this context HOMSTRAD is used as a source of homologous sequences rather than a collection of reference alignments. The tree is calculated by applying the UPGMA algorithm onto the distance matrix. This tree (shown on Figure 1) can also be used to compute the method weights. Combining the alignment methods T-Coffee (12) was used to combine outputs from different alignment programs into one improved multiple alignment. The ability of T-Coffee to take data in the form of a library was exploited to combine the alignment methods. This functionality has already been used very successfully to incorporate structural information when PDB entries are available for one or more sequences (33). A library is a generated by assigning each pair of aligned residues in a pairwise alignment weight. T-Coffee then tries to ﬁnd an alignment with the maximum sum of weights. In this case the libraries are generated from alignments created by the different MSA packages. All of the libraries are then input into T-Coffee to produce one alignment. The default weight used in the library is the percent identity of the parent sequences. To apply one of the extra method weighting schemes described below, the original T-Coffee weight is multiplied by the method weight. Method weighting Four different schemes were used to generate weights for each of the alignment methods. Two of the schemes are
Figure 1. Methods Tree. A UPGMA tree which shows the clustering of all the multiple alignments. Pairwise distances are calculated on the HOMSTRAD benchmark by computing the SP differences of the alignments produced by individual methods.

tree-based, and were calculated based on the method tree described earlier. (i) Variance/Covariance (VarCov) weights are calculated from the inverse of a variance/covariance matrix, as described for sequence weights by Altschul et al. (34). For the variance (the diagonal elements of the matrix) of a method we use the number of columns differing between the generated alignment and the corresponding reference alignment in HOMSTRAD. For the covariance between two methods (the off-diagonal elements) we use the number of columns identical between two alignments generated by two different alignment methods, which are wrong when compared with the reference alignment. The row sums of the pseudo inverse of the variance/covariance matrix are the method weights. (ii) Altschul Carrillo Lipman (ACL) weights are calculated using a tree connecting the methods. This was described in relation to sequences by Altschul et al. (34), but can be used to weight any data related by a tree. A

Nucleic Acids Research, 2006, Vol. 34, No. 6

1695

variance/covariance matrix is calculated from the tree and the row sums from the inverse of this matrix are the method weights. In this case, the variance of a method is the distance from the root of the tree. The rationale is that the nearer an object is to the root, the better an estimate of the root, it provides. Covariances are calculated using the function below. l2 pq ppq ¼ ‚ lp lq where Ppq is the covariance between p and q, lpq is the shared branch length between p and q, and lp is the distance of p from the root. Using this weighting scheme a method receives a low weight if it is far away from the root of the tree or if it has close neighbours. The underlying assumption is that although a divergent method contains lots of information it is hard to exploit this information without bringing in too much extra noise. (iii) Thompson Higgins Gibson (THG) weights are also treebased (35) and are used by ClustalW. Weights are assigned based on the distance of the method from the root of the tree. Methods, which have a common branch with other methods, share the weight derived from the shared branch. For example if three methods share a branch then each method will receive a third of the weight derived from the common branch. Under this scheme a method only gets down weighted for having closely related neighbours. Methods that have a common branch share the weight derived from the shared branch. Groups of related methods receive low weights as they contain a lot of duplicated information. Highly divergent methods receive high weights as these contain unique information. (iv) Accuracy (ACC) weights are crude heuristic weights rewarding accuracy. They are set to the score of the alignment method over HOMSTRAD, normalized so that their sum equals the number of methods. Highly accurate methods are up weighted and less accurate methods are down weighted.

Availability M-Coffee is part of the T-Coffee package. T-Coffee is written in Perl and in C, and will run on any UNIX-type platform. It is a freeware open source distributed under a GNU Public license and available from http://www.tcoffee.org/. RESULTS Our ﬁrst task was to determine how the 15 MSA methods considered here should be combined into one consensus alignment. Given 15 methods, one should consider either deﬁning an optimal subset or devising a weighting scheme that makes it possible to combine all the methods at once. Our ﬁrst attempt was to use a greedy procedure in order to deﬁne an optimal subset of methods. Methods were ranked according to their overall accuracy on the 233 HOMSTRAD reference datasets and the order thus deﬁned was used to deﬁne subsets of methods used within M-Coffee. Results are shown on Figure 2, where subset 1 only contains the best method (ProbCons), subset 2 contains the best and the second best (ProbCons + Muscle 6), and so on. The graph clearly shows a peak, which is signiﬁcantly better than the point before it (Wilcoxon P < 0.001), suggesting that an accumulation of low accuracy methods eventually affects the overall results. On the other hand, the graph also indicates that except for the two ﬁrst subsets, the accuracy of M-Coffee is clearly higher than any of the constituting methods, thus establishing the efﬁciency of the combination. The degradation in accuracy when very similar methods are added, like the MAFFT family of programs (FFTNSI, FFTNS2, FFTNS1 and so on), is not surprising when considering the underlying principle of consistency. Consistency is only useful as an accuracy indicator when methods are unlikely to commit exactly the same error. However, this assumption is no longer true when nearly identical methods are being combined. When this happens, incorrect alignment portions ﬁnd their way into the ﬁnal model simply because they appear highly consistent to the T-Coffee algorithm. This intuition that all methods cannot be considered as equally independent is well conﬁrmed by the tree topology on

Figure 2. CS after combining multiple alignment methods with T-Coffee. Alignments are added in order of decreasing performance as single methods (as determined on HOMSTRAD) from left to right. The peak of 67.59 is achieved using a combination of six methods. It is significantly better than ProbCons, the best single method (Wilcoxon signed rank test, P < 0.001), whose performance is materialized by the straight line.

1696

Nucleic Acids Research, 2006, Vol. 34, No. 6

Figure 1. The objective function plays an important role in grouping the methods, with, for example, most consistencybased methods clustered around T-Coffee (ProbCons, Muscle6, ﬁnsi, ginsi). The tree also reveals how methods developed by the same laboratory tend to be highly correlated, possibly because of arbitrary code settings. Estimating the level of independent information contributed by one sequence or method is a recurring problem in biology. It is especially important when dealing with multiple sets of sequences (proﬁles, alignments) where the sequences are assumed to be independent although they are known to be evolutionarily related (and therefore correlated). Weighting schemes are used to deal with these contradictions, by estimating the amount of independent information contained in each sequence. For instance, given the tree on Figure 1, one would expect outlier methods like POA or ClustalW to have a high weight while methods with lots of close neighbours like T-Coffee, ProbCons and Muscle6 or the MAFFT series are expected to have lower weights, as if it were split between close relatives. We tested this hypothesis by applying two known tree-based weighting schemes on the method tree (THG and ALT) as well as the VarCov weighting scheme on the distance matrix (compare System and Methods). We also designed a fourth scheme, based on accuracy and that is meant to be used as a control. Results on Table 1 show the high level of correlation between the two tree-based weighting schemes that differ mostly by the magnitude of their values (0.24–2.51 for ACL, 0.56–1.87 for THG). The ACC weights are simply correlated to the methods accuracy, and attribute similar high weights to highly similar methods like T-Coffee and ProbCons. Tree-based methods tend to overweight outliers, which may be a shortcoming when the outliers display very low levels of accuracy, like POA. The VarCov weights work on a different principle and give credit to the methods containing the most unique (outliers) and accurate information. For instance, under this scheme, ClustalW receives a high weight (1.77) because it is an outlier and an accurate program. FFTNSI on the other hand is down weighted because it does not provide any useful information aside from what is already
Table 1. The first column lists the individual methods used Alignment method CLUSTALW v1.83* DIALIGN DIALIGN-T* FFTNS1 FFTNS2 FFTNSI FINSI* GINSI Muscle v3.52 Muscle v6.0* PCMA* POA-global* POA-local ProbCons v1.09* T-Coffee v2.03* %CS for M-Coffee15 %CS for M-Coffee8 Default %CS 61.15 55.71 57.92 58.27 60.47 63.07 64.22 63.43 64.49 66.04 63.73 51.90 49.28 66.41 65.37 VarCov weight 1.77 1.31 1.34 0.81 0.40 0.17 0.79 0.50 1.02 0.78 1.41 1.37 1.42 0.73 1.18 67.33 67.32

in either FFTNS1 or FFTNS2. The ACC weights are used as a control and simply reﬂect benchmarks results. We tested these four sets of weights to combine the 15 methods into M-Coffee (M-Coffee15). The results (second last line, Table 1) indicate that although the VarCov weights deliver the best overall results they fail to signiﬁcantly outperform a simple combination of all the methods (No weights). These results, combined with the observations made on Figure 2 led us to believe that the weighting schemes do not appear to properly address the problem of method redundancy, while the overall results suggest a need for some crude and discrete ﬁltering. We eventually considered that arbitrary code setting (e.g. choosing between alignments with equal scores) could be one of the reasons for misleading consistency between packages of the same groups. This lead us to hand pick one method per developer (the most accurate) and use the resulting subsets to run our tests. The eight selected methods were POA-global, Dialign-T, ClustalW, PCMA, FINSI, T-Coffee, Muscle v6 and ProbCons. This combination of methods will be called M-Coffee8. Results are shown on the last line of Table 1 and on Figure 3. Interestingly, M-Coffee8 outperforms any of the constitutive method all along the combination process, thus suggesting an always beneﬁcial combination. Figure 3 also shows that M-Coffee8 is more accurate than ProbCons, even before inclusion of that method. Finally, in order to further analyse the effect of method redundancy, we increased the number of occurrences of the ClustalW MSAs, from 1 copy (normal) up to 4. The results indicate that over-representing some MSA methods ends up reducing M-Coffee average accuracy, with a drop correlated with the number of extra copies (Figure 4). Yet this effect is moderate and even with three extra ClustalW MSA copies, the overall accuracy remains signiﬁcantly higher than that of ClustalW (66% versus 61%). Further validation of M-Coffee8 with was carried out by testing this procedure on two other benchmarks: the new BaliBase and Prefab. As suggested by results on Table 1, the tests were carried out by combining for each dataset the eight un-weighted MSAs with M-Coffee. The results (Table 2)

THG weight 1.41 1.33 1.33 0.74 0.64 0.64 0.74 0.74 0.85 0.56 0.94 1.87 1.87 0.56 0.78 66.96 66.33

ALT weight 1.86 1.83 1.83 0.64 0.44 0.44 0.38 0.38 0.54 0.24 0.75 2.51 2.51 0.24 0.42 65.79 64.89

ACC weight 1.01 0.92 0.95 0.96 1.00 1.04 1.06 1.04 1.06 1.09 1.05 0.85 0.81 1.09 1.08 67.16 67.85

No weight 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 67.11 67.75

Methods in boldface marked with an asterisk are part of the M-Coffee8 selection of methods. Column 2 indicates the average performance of each individual method on HOMSTRAD. Columns 3–7 are the weights for each method as calculated by the indicated weighting schemes. The last two lines show the average score of M-Coffee15 and M-Coffee8, using all the indicted weighting schemes.

Nucleic Acids Research, 2006, Vol. 34, No. 6

1697

Figure 3. M-Coffee8. The top line (closed diamonds) is the CS on the HOMSTRAD benchmark after combining multiple alignments using only one method per developer. The bottom line (closed squares) is the default performance for each method on the benchmark.

M-Coffee15 requires 220 s (these ﬁgures do not include the pre-computation of the alignments with individual methods).

DISCUSSION AND CONCLUSION In this paper we describe M-Coffee, an extension of the T-Coffee package able to efﬁciently combine the output of various MSA packages into one ﬁnal MSA. We show that M-Coffee is on average 1–3 point percent more accurate than the best individual method and nearly twice more likely to deliver the best multiple alignment. Apart from delivering high quality alignments, M-Coffee constitutes a simple and efﬁcient platform for the combination of various MSAs into one unique accurate model. As such it provides a convincing solution to the daunting task of choosing the right method and it should prove quite robust with respect to the evolution of novel individual methods. M-Coffee relies on consistency and is therefore based on the assumption that incorrect alignments are less likely to be consistent than correct ones. This holds well as long as the combined methods are independent, but it breaks down when correlated methods are introduced. We have shown here that the best results can be obtained when carefully selecting the right combination of methods. The main issue with such a selection is that it may be hard to automate and will always require expert knowledge. We propose two reasonable alternatives, one based on a weighting scheme that makes it possible to include all known methods, without a priori knowledge, and a second simpler (and less efﬁcient recipe) where all methods are included, except the less accurate ones. Both these means of selecting methods can easily be adapted to an increasing number of methods, by setting up some centralized accuracy evaluation, in the style of the EVA server (36), a server used to continuously test the accuracy of protein secondary structure prediction methods. This being said, we also show that the effect of incorporating duplicated methods is not dramatic, and that even with 4 duplicated alignments of 12 MSAs, M-Coffee

Figure 4. Effect of adding in 1, 2 or 3 extra ClustalW alignments to M-Coffee8. The average accuracy of ClustalW is materialized by the solid line.

show that M-Coffee signiﬁcantly outperforms individual methods on every category of HOMSTRAD and Prefab (>1400 MSAs altogether) and on 6 out of 10 Balibase categories. Total results conﬁrm M-Coffee to be the average best performer on the three datasets. Further analysis on individual datasets (Table 3) also reveals that on average M-Coffee is about twice more likely to deliver the most accurate MSA than any of the individual methods (1104 versus 614). In terms of CPU time, M-Coffee is very similar to the standard T-Coffee with the difference that it does not require the estimation of the pairwise library. For instance, if we consider 1bxkA-1he2A, a standard prefab dataset of 50 sequences, 200 amino acid long and 47% average identity, the default T-Coffee requires 270 s to align that dataset on a standard PC (Pentium 2 MHz, 500 MB RAM), while M-Coffee8 requires 180 s on a similar machine and

1698

Nucleic Acids Research, 2006, Vol. 34, No. 6

Table 2. The CS accuracy performance of M-Coffee8 and various individual methods on the HOMSTRAD, Prefab and Balibase references M-Coffee8 Homstrad Prefab <10% Prefab 10 to <20% Prefab 20 to <30% Prefab 30 to <40% Prefab 40 to <100% Prefab total BaliBase Set: 11 BaliBase Set: 12 BaliBase Set: 20 BaliBase Set: 30 BaliBase Set: 40 BaliBase Set: 50 BaliBase Set: S11 BaliBase Set: S12 BaliBase Set: S2 BaliBase Set: S3 BaliBase Set: S5 BaliBase total 67.75* 27.19 59.80* 84.58* 92.54* 97.05* 72.91* 43.18* 85.91* 43.12 59.19* 58.17 59.81 59.50 86.59 56.76 69.41* 60.60 62.02 ClustalW 61.15 18.25 43.27 74.79 87.27 94.91 61.68 22.68 71.43 21.68 25.48 39.04 33.69 40.76 79.05 44.37 49.69 43.27 42.83 Dialign-T 57.92 15.51 44.11 75.28 85.62 96.07 62.05 25.32 72.57 29.20 35.19 44.75 44.25 33.34 76.20 36.90 47.31 45.47 44.59 FINSI 64.22 24.86 58.76 83.76 91.81 96.92 72.01 38.95 82.68 45.85 57.59 60.02 57.69 50.63 84.02 53.85 63.83 57.73 59.34 Muscle 6 66.04 24.14 54.76 82.09 90.42 96.17 69.56 34.37 84.80 36.49 41.04 48.42 50.56 59.37 86.95 55.78 63.14 60.33 56.47 PCMA 63.73 25.53 55.96 81.47 89.84 95.03 69.76 37.45 82.61 44.83 58.15 53.83 59.88 44.76 82.91 51.85 64.10 56.73 57.92 POA 51.9 9.09 32.26 64.42 79.96 94.30 52.61 11.18 51.05 13.37 7.89 14.42 21.63 31.37 68.14 35.24 36.14 28.47 29.00 Probcons 66.41 24.81 56.21 82.85 91.68 96.20 70.54 39.55 84.80 37.78 47.26 51.25 55.25 58.45 87.05 54.46 65.03 59.80 58.24 T-Coffee 65.37 23.41 55.28 82.39 91.51 96.68 69.97 32.68 83.00 39.68 47.48 55.58 57.31 47.61 83.75 49.78 64.45 55.67 56.10

HOMSTRAD was evaluated with aln_compare, Prefab with Qscore and BaliBAse with BaliScore. Methods significantly better (P < 0.05) than the next best are marked with an asterisk. The highest score in each benchmark is highlighted in bold. Table 3. Individual dataset analysis M-Coffee8 better Homstrad Prefab <10% Prefab 10 to <20% Prefab 20 to <30% Prefab 30 to <40% Prefab 40 to <100% Prefab total BaliBase Set: 11 BaliBase Set: 12 BaliBase Set: 20 BaliBase Set: 30 BaliBase Set: 40 BaliBase Set: 50 BaliBase Set: S11 BaliBase Set: S12 BaliBase Set: S2 BaliBase Set: S3 BaliBase Set: S5 BaliBase total Total Total versus ProbCons 139 49 326 278 64 62 779 19 26 16 16 24 12 12 13 21 19 8 186 1104 1249 M-Coffee8 worse 65 37 226 132 35 25 455 5 7 14 5 10 4 15 11 13 6 5 95 615 615 P(Wilcoxon Signed) 0.000 0.16 0.000 0.000 0.003 0.002 0.000 0.002 0.008 0.967 0.013 0.333 0.078 0.793 0.437 0.397 0.024 0.623 0.002 Best single method ProbCons PCMA Finsi Finsi ProbCons Finsi / ProbCons ProbCons Finsi PCMA Finsi PCMA Muscle 6 ProbCons Muscle 6 ProbCons Muscle 6 / / ProBcons

The data are the same as in Table 2. On each subset, M-Coffee8 is compared with the best performing method. Column 2 indicates the number of times M-Cofee8 is better/worse than the best single method on that subset. The two last lines indicate the total for the table (Total) and the result of a comparison against ProbCons, the best individual method.

remains more accurate than most individual methods (including the duplicated one). These results suggest that the combination procedure is a rather robust process able to cope with a signiﬁcant amount of noise. The problem with Meta-methods is their tendency to harmonize a ﬁeld of research by unfairly competing against the individual methods they are made of. In the case of M-Coffee it is interesting to stress the importance of original and independent individual methods, as illustrated by the method tree. It is also worth pointing out that our analysis reveals several method convergences (Figure 1) that may not be entirely obvious for a non-specialist basing his judgement on their technical descriptions. Overall,

M-Coffee will perform best and improve, as long as independent methods keep being produced. Such a concept resonates strongly with the notions of ‘crowds’ and ‘mobs’ and how a group of non-expert people can arrive at more accurate decisions than a small number of ‘experts’ (37). Crowds are described as having the potential to be wise but only as long as the crowd members are independent and not forming a mob. Mobs are consistent but easily lead to the wrong decision. ACKNOWLEDGEMENTS We are especially grateful to Martin Vingron for his advice in using the Variance/Covariance weighting system. We thank

Nucleic Acids Research, 2006, Vol. 34, No. 6

1699

Prof. Jean-Michel Claverie (head of IGS) for useful discussions and material support, Fabrice Armougom, Sebastien Moretti, Olivier Poirot and Vladimir Saudek for their help in maintaining and debugging the T-Coffee package. C.N. was supported by CNRS (Centre National de la Recherche Scientifique), ´ Sanofi-Aventis Pharma SA., Marseille-Nice Genopole and the French National Genomic Network (RNG). Part of this work is funded by the Science Foundation Ireland. Funding to pay the Open Access publication charges for this article was provided by Centre National de la Recherche Scientifique. Conflict of interest statement. None declared. REFERENCES
1. Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131–144. 2. Wallace,I.M., Blackshields,G. and Higgins,D.G. (2005) Multiple sequence alignments. Curr. Opin. Struct. Biol., 15, 261–266. 3. Hogeweg,P. and Hesper,B. (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol., 20, 175–186. 4. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. 5. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. 6. Wallace,I.M., O’Sullivan,O. and Higgins,D.G. (2005) Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics, 21, 1408–1414. 7. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. 8. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. 9. Kececioglu,J.D. (1993) Lecture Notes In Computer Science. SpringerVerlag, Vol. 684, pp. 106–119. 10. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. 11. Morgenstern,B., Frech,K., Dress,A. and Werner,T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290–294. 12. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 13. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464. 14. Katoh,K., Kuma,K.I., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 15. Do,C.B., Mahabhashyam,M.S., Brudno,M. and Batzoglou,S. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. 16. Pei,J., Sadreyev,R. and Grishin,N.V. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19, 427–428. 17. Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Nucleic Acids Res., 15, 87–88.

18. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. 19. Cuff,J.A., Clamp,M.E., Siddiqui,A.S., Finlay,M. and Barton,G.J. (1998) JPred: a consensus secondary structure prediction server. Bioinformatics, 14, 892–893. 20. Allen,J.E. and Salzberg,S.L. (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics, 21, 3596–3603. 21. Bucka-Lassen,K., Caprani,O. and Hein,J. (1999) Combining many multiple alignments in one improved alignment. Bioinformatics, 15, 122–130. 22. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. 23. Thompson,J.D., Koehl,P., Ripp,R. and Poch,O. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. 24. Lassmann,T. and Sonnhammer,E.L. (2002) Quality assessment of multiple alignment programs. FEBS Lett., 529, 126–130. 25. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 26. Chenna,R., Sugawara,H., Koike,T., Lopez,R., Gibson,T.J., Higgins,D.G. and Thompson,J.D. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res., 31, 3497–3500. 27. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 28. Morgenstern,B. (1999) DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment. Bioinformatics, 15, 211–218. 29. Lenhof,H.P., Morgenstern,B. and Reinert,K. (1999) An exact solution for the segment-to-segment multiple sequence alignment problem. Bioinformatics, 15, 203–210. 30. Subramanian,A.R., Weyer-Menkhoff,J., Kaufmann,M. and Morgenstern,B. (2005) DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics, 6, 66. 31. Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. 32. Grasso,C. and Lee,C. (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics, 20, 1546–1556. 33. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 34. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. 35. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci., 10, 19–29. 36. Eyrich,V.A., Marti-Renom,M.A., Przybylski,D., Madhusudhan,M.S., Fiser,A., Pazos,F., Valencia,A., Sali,A. and Rost,B. (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242–1243. 37. Surowiecki,J. (2004) The Wisdom of Crowds. Abacus, London.

Published online 17 April 2008

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 doi:10.1093/nar/gkn174

R-Coffee: a method for multiple alignment of non-coding RNA
´ Andreas Wilm1, Desmond G. Higgins1 and Cedric Notredame2,*
1 2

The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and Centre for Genomic Regulation (CRG), Dr Aiguader, 88, 08003 Barcelona, Spain

Received December 20, 2007; Revised March 14, 2008; Accepted March 25, 2008

ABSTRACT R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (www.tcoffee.org). INTRODUCTION A number of recent discoveries have cast new light on the importance of RNA, revealing a functional scope much broader than realized only a few years ago. Small noncoding RNAs (ncRNAs) are actively involved in a wide range of cell processes, including gene regulation, cell diﬀerentiation, genome maintenance, RNA maturation

and protein synthesis (1,2). The ncRNA big picture could change even further in the coming years, as suggested by a recent report of the ENCODE consortium (3) showing an unexpected level of ncRNA transcription across the entire human genome. While the problem of aligning sequences has been regularly addressed over the last 40 years (4), delivering accurate alignments for ncRNAs remains a challenging task for at least two main reasons. First of all, RNA molecules have a low chemical complexity compared to proteins with just a four-letter alphabet. This limited information content makes it hard to use sequence similarity as an estimator of the biological relevance of RNA alignments. The most notable consequence is the limited sensitivity of RNA alignments, and it is generally accepted that the RNA twilight zone (i.e. the level of identity below which pairwise alignments become uninformative) is close to 70%, as opposed to 25% for proteins (5–7). The second reason for the diﬃculty in aligning ncRNA comes from their rate and pattern of evolution. In general, functional ncRNAs have a well-deﬁned structure and their evolution seems to be mainly constrained to retain a speciﬁc folding, mostly based on Watson and Crick base pairs. Maintaining such a structure can be achieved through compensatory mutations, a phenomenon that explains why sequences can diverge a lot while still coding for the same structure (8). Therefore, sequence identity alone is a very crude measure of biological similarity, as it does not reﬂect very well the level of structural conservation. Because of these problems, it is highly desirable to take RNA secondary structure into account while aligning ncRNA sequences, in order to assure optimal usage of the positional interdependence. Sankoﬀ’s algorithm, published 20 years ago (9), does exactly this, but suﬀers from enormous runtime and memory requirements. Given two sequences of length L, the pairwise alignment requires O(L6) in time and O(L4) in computer space, while its extension to N sequences is exponential: O(L3N) in time and O(L2N) in space. Only a few simpliﬁed implementations exist, usually constrained to pairwise alignment (10–13). Recently a number of multiple alignment versions

*To whom correspondence should be addressed. Tel: +34 93 316 02 71; Fax: +34 93 316 00 99; Email: cedric.notredame@crg.es
ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 2 OF 10

have been published (11,14–18), which employ various techniques to reduce run-time and memory consumption, for example by limiting the length or types of structure motifs or by using banding techniques during the dynamic programming stage [for review, see (19)]. The most accurate of theses heuristics are restricted to sequences shorter than approximately 200 residues, a limitation that explains why it is often more practical to use regular sequence aligners when dealing with larger sequences such as ribosomal RNA, or RNA motifs embedded in long genomic sequences. Most of these aligners treat RNA sequences like regular DNA and rely on an identity-based scoring scheme only suitable for closely related sequences. For instance, ClustalW has long been used for establishing reference collections of ribosomal RNA alignments (20). Following manual curation and visual inspection of the conserved secondary structures, these alignments have been widely used to infer phylogenetic relationships between most living organisms. Taking secondary structures into account may not, however, improve alignment accuracy across the entire spectrum of known ncRNA. For instance, secondary structure-based alignments will not improve the comparison of mature miRNAs or mRNAs that are not structurally conserved. In this work, we address the problem of RNA multiple sequence alignments by taking advantage of the T-Coﬀee framework (21). T-Coﬀee is a multiple sequence alignment method able to combine the output of diﬀerent sequence alignment packages. It takes as input, a collection of alignments (pairwise or multiple) and outputs a multiple sequence alignment containing all the sequences. The input, which is referred to as a ‘library’, can consist of alternative and possibly inconsistent alignments of the same sequences. The purpose of the algorithm is to generate a ﬁnal alignment that is as consistent as possible with the original input alignments. The main advantage of this procedure is its ﬂexibility. For instance, in the original T-Coﬀee, the library was compiled from pairwise local and global alignments of the sequences. In M-Coﬀee (22) the compilation was made using alternative multiple sequence alignments while in 3D-Coﬀee (23) or Expresso (24), the library is derived from structure-based pairwise alignments. This simple protocol can easily be built on top of any existing method, as illustrated by two RNA alignment packages: Marna (25) and T-Lara (19). Both packages focused on the development of a novel pairwise RNA alignment algorithm, which was then used to generate an alignment library fed to T-Coﬀee in order to produce a multiple alignment. In the present work we decided to go further and modify the library processing/ extension algorithm so that it could take advantage of known and predicted secondary structures. This is done when compiling the library and while evaluating the matching score of two residues. This novel strategy forms the core of R-Coﬀee. Our primary goal was not to produce a stand-alone method, but rather a novel component that can seamlessly be added on top of any existing alignment method. We demonstrate here that it is possible to improve the alignment quality of most existing methods by means of R-Coﬀee, with only minor computational over-head.

SYSTEMS AND METHODS Reference alignments and evaluation We used two diﬀerent benchmark sets: BRAliBase 2.0 (5), the standard RNA reference alignment dataset and Cmﬁnder (26), a smaller dataset speciﬁcally designed for testing local analysis of long sequences. BRAliBase is collection of RNA reference alignments especially designed for the benchmark of RNA alignment methods. We only used its multiple alignment component made of 388 multiple sequence alignments. These datasets are evenly distributed between 35% and 95% average sequence identity. Each of these datasets was originally produced by extracting sub-alignments from larger seed alignments coming from four RNA families (tRNA, group II intron, 5S rRNA and U5 RNA). Two of these were seed alignments obtained from the Rfam database (27). This procedure may appear slightly circular as it involves comparing sequence-based reference alignments with other sequence-based alignments. In order to address this issue, BRAliScore, the benchmarking scoring scheme, was designed in such a way that it not only depends on the similarity between the reference and the evaluated alignment but also on the intrinsic structural conservation of the target alignment [see also (6)]. This tradeoﬀ illustrates the diﬃculties in establishing a gold standard for RNA analysis. The main problem comes from the lack of suﬃcient experimentally validated RNA structures, in contrast to protein sequence analysis where literally hundreds of accurate 3D structures exist. The BRAliScore combines two measures: the Sum of Pairs Score (SPS) and the Structural Conservation Index (SCI) (28). The SPS is the ratio between the number of residue pairs identically aligned in the target and the reference, divided by the number of pairs in the reference. It was measured using a variant of compalignp [based on Sean Eddy’s compalign; see also (6)] adapted to restrict the evaluation to a pre-deﬁned core region. The SCI is a reference-independent measure. It is deﬁned as the ratio between the average free energies of all single sequences of the MSA [as calculated by RNAfold; (29)] and the free energy of the MSA consensus structure [as calculated by RNAalifold; (30)]. A value of 0 indicates a lack of a conserved structure, 1 corresponds to a perfect agreement between the energies of the single sequences and the consensus energy, while values higher than 1 indicate a very good agreement supported by additional co-variation. The BRAliScore is the product of the SCI and the SPS score. This combination can lead to problems when either the SPS or the SCI are close to 0. In practice however, this situation rarely arises, and the combination of these two scores provides a very robust measure, less sensitive than the SPS, to the eﬀective accuracy of the reference alignment. To test for statistical diﬀerences between pairs of methods, we applied the Wilcoxon signed rank test as in (6). All analyses were carried out using tools provided from http://www.biophys.uni-duesseldorf.de/ bralibase/. Our second dataset is named after the RNA motif ﬁnder program Cmﬁnder (26). It contains Rfam sequences embedded in 200 nt of their original ﬂanking genomic

PAGE 3 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

Table 1. Programs used for benchmarking and as input for T/R-Coﬀee Program ClustalW Consan Dynalign Foldalign FoldalignM Maﬀt Marna M-Locarna Murlet Muscle Pcma Pmcomp Pmmulti Poa Proalign Probcons Prrn Rnasampler Stemloc Stral T-Lara T-Coﬀee Reference (33) (10) (12) (13) (15) (35) (25) (17) (14) (32) (45) (11) (11) (46) (47) (34) (48) (44) (16) (49) (19) (21) Version 1.83 1.2 Dec-06 2.0.3 1.0.1 5.861 Jan-07 0.99 Mar-06 3.6 2 Jun-04 Jun-04 2 0.5.a3 1.1 SCC 3.0.a 1.3 Dart 0.19b 0.5.4 1.31 5.19 Structure N Y Y Y Y Y Y Y Y N N Y Y N N N N Y Y Y Y N Sankoﬀ N Y Y Y Y N N Y Y N N Y Y N N N N Y Y N N N Alignment mode Multiple Pairwise Pairwise Pairwise Multiple Multiple Multiple (T-Coﬀee) Multiple Multiple Multiple Multiple Pairwise Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple (T-Coﬀee) Multiple Command line -type = dna -m mixed80.mod Len2-len1 + 5 0.4 5 20 2 1 0 -global -score_matrix global.fmat ginsi/ﬀtns mlocarna-p -seqtype rna

blosum80.mat

-multiple –slow -global -o lara.params -dp_mode myers_miller_pair_wise

This table lists all the packages evaluated along with their version numbers (or download date). The Structure column indicates whether the consider packages use predicted secondary structures (Y for Yes, N for No). The Sankoﬀ column indicates whether the package is a heuristic implementation of the Sankoﬀ original algorithm. The Alignment Mode column indicates whether the package performs pairwise or multiple alignment or if it’s based on the T-Coﬀee package. The last column gives used command line parameters; most programs were used as in the BRAliBase alignment benchmark publications (5) and (6).

regions, randomly distributed between the 50 and the 30 of the ncRNA sequence (i.e. x nucleotides on the 50 -, y nucleotides on the 30 -end with x and y randomly chosen so that x + y = 200). The unaligned datasets were kindly provided by the Cmﬁnder authors. We limited our choice to datasets having less than 40 sequences thus generating 11 reference alignments (9 to 35 sequences, length between 167 and 368 nt). The average level of identity within the core regions of these alignments ranges from 31% to 73%. These characteristics make the Cmﬁnder dataset a diﬃcult target, especially because of the sequence length and the inclusion of ﬂanking regions. These datasets are also closer to ‘real life’ situations that often involve discovering an RNA motif within poorly characterized sequences. The presence of ﬂanking genomic regions potentially embedded in the Cmﬁnder datasets made it impossible to systematically use the SCI component of the BRAliscore. Instead we used the Sum of Pairs score (SPS) and restricted the scoring to the Rfam core region of the alignment. Note that most available RNA alignment benchmark sets are based on Rfam alignments. These alignments are by no means a gold standard (especially not the ‘full’ Rfam alignments) and are not based on 3D superposition as most protein alignment benchmarks. For example, the BRAliBase benchmark set was created from four RNA families, two of which were full Rfam alignments (U5, g2intron) and two were Rfam seed alignments (tRNA, 5S) i.e. hand-curated and thus more likely of high quality. The Cmﬁnder data sets are exclusively based on Rfam seed alignments. The low number of quality alignments suited especially for benchmarking (i.e. equally distributed over

a wide sequence identity range etc.) of multiple RNA alignment programs is a notorious problem. New RNA alignment benchmarks with a high number of RNAs using expert, hand-curated alignments, which are based on structural superposition [e.g. from the Comparative RNA web site (31)] would constitute a useful advance in this area. Alignment programs To test and benchmark R-Coﬀee, we compared a variety of programs with diﬀerent features (Table 1). These programs can be roughly divided in three categories: pairwise structural aligners, multiple structural aligners and regular multiple sequence aligners. The pairwise structural aligners like Consan (10) are heuristic approximations to the original Sankoﬀ algorithm. Their heavy computational requirements limit them to short sequences. The second category includes structural aligners extended to deal with multiple sequences like FoldalignM (15) or Stemloc (16). Like their pairwise counterparts, they use structure and sequence information during the alignment and are therefore restricted to short datasets. The third category is made of regular multiple sequence alignment programs like Muscle (32) or ClustalW (33). These programs do not rely on any kind of structural modeling, although some of them [like Probcons (34) and Maﬀt (35)] have optimized parameters for BRAliBase i.e. program parameters were trained on BRAliBase alignments by the respective program’s authors. These two last categories of packages can either be used to align multiple sequence datasets or pairs of sequences.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 4 OF 10

Most programs were used as described in (5) and (6). Marna (25), Pmmulti (11) and Stemloc (16) were not able to align all test sets of BRAliBase. In particular, Marna cannot align sequences that contain IUPAC characters and the ability of Stemloc to align sequences seems to depend on the size of the main memory. We therefore had to exclude these packages from the comparison, although it should be noted that they produced accurate alignments on the datasets they could align (data not shown). In case of T-Lara we did not use the pairwise alignments as T-Lara already uses T-Coﬀee. Instead we used R-Coﬀee as a drop-in-replacement for T-Coﬀee. We used the standalone versions of all packages to compute multiple alignments for all the reference datasets. We also used them in combination with either T-Coﬀee or R-Coﬀee. All programs were run on a Quad-Xeon-3 GHz machine with 6 GB RAM running Red Hat Enterprise Linux. Original T-Coffee strategy T-Coﬀee is a versatile MSA package that allows the combination of many pairwise (or multiple) sequence alignments into one unique ﬁnal model. The principle is fairly straightforward. Given a set of sequences, a collection of pairwise alignment is computed. This collection can be redundant (several alternative alignments for each pair of sequences) or not, and is compiled into a data structure called the primary library. The primary library contains the list of all the pairs of aligned residues observed in the alignment collection. Each of these pairs receives a weight equal to the score of the alignment it came from (in practice the percent identity is used). These weights are then re-estimated in a process named library extension. The purpose of the new weights (extended weights) is to reﬂect the level of consistency between each pair of aligned residues and the rest of the library. High-scoring pairs are those in very good agreement with the rest of the pairs and their high score insures that they should easily ﬁnd their way into the ﬁnal alignment. R-Coﬀee uses the Myers and Miller algorithm (command line option: -dp_mode = myers_miller_pair_ wise) to align pairs of sequences or proﬁles rather than the current T-Coﬀee default (-dp_mode = cfasta_pair_wise) that uses a banded dynamic programming implementation extensively tuned for proteins. The Myers and Miller setting corresponds to the T-Coﬀee algorithm as described in the original publication (21). Adaptation of T-Coffee to use RNA structural information The novel RNA-speciﬁc mode of T-Coﬀee described here has been designed to be able to use secondary structure predictions. The current design supports an arbitrary amount of structural prediction, and each sequence can be associated with one or more secondary structure predictions that do not need to be in agreement. It is also possible not to associate any structural information with some sequences. In practice, however, we expect the best results to be obtained when using at least one secondary structure prediction for each sequence in the dataset. These structural predictions are passed to R-Coﬀee, using a data structure similar to the T-Coﬀee primary library

and named a structural library. In this structural library, each line indicates a pair of nucleotides predicted to be base-paired. Like its primary sequence counterpart, this structural library can be redundant, contain conﬂicting pairs or lack data for some pairs. RNA structures were computed using either a global or a local prediction method. Global structure predictions were obtained with RNAfold (29) which ﬁnds a structure with Minimal Folding Energy (MFE). When using a MFE structure as input, each predicted base pair was directly added to the structural library without any further ﬁltering. This global MFE-based prediction has two major limitations: the algorithm is computationally demanding when being applied to very long sequences and its prediction accuracy decreases with sequence length (36,37). When dealing with long sequences, a sensible alternative is to use local RNA structure prediction methods such as RNAplfold (38). RNAplfold predicts local pair probabilities for base pairs within a certain span (default is set to 100 nt). The program outputs base pair probabilities rather than one precise structure and in order to reduce noise, we excluded pairs exhibiting a low thermodynamic probability. We determined a suitable probability threshold by varying the ﬁltering threshold between 0.0 and 0.8 (in steps of 0.1) and estimating the accuracy of the resulting R-Coﬀee alignments (Figure 2). We found 0.3 to be the optimal threshold, although our results indicate a relative stability of the system around this value (ﬂat peak). The structural pairs thus gathered are then fed to R-Coﬀee, the version of T-Coﬀee using the R-Score (see later). The structural libraries used here contain un-weighted structure pairs, although it is, in principle, possible to apply a weighting scheme onto these pairs, possibly reﬂecting the thermodynamic stability or the likelihood of each considered pair. For testing purposes we also used random structures as input. These structures were computed by shuﬄing the input sequences before predicting the structures using RNAfold/RNAplfold as described earlier. For shuﬄing we used the program shuﬄe from Sean Eddy’s squid package. The R-score: a novel T-Coffee scoring scheme The original T-Coﬀee algorithm was modiﬁed in order to incorporate structural information within the library compilation process. This novel evaluation procedure is named the R-score and gives its name to R-Coﬀee, with the letter R standing for RNA. The R-score requires the secondary structures of the considered sequences to be pre-computed and it also involves two modiﬁcations of the original T-Coﬀee algorithm: one when compiling the pairwise alignment library and the other when evaluating the score for aligning two residues. The new library compilation procedure involves extending the original T-Coﬀee library with any residue pair not observed within the pairwise alignments but whose relevance is suggested by the secondary structure predictions (Figure 1). For instance, let A $ X be two nucleotides predicted to form a secondary structure in sequence 1 and B $ Y two other paired nucleotides in sequence 2.

PAGE 5 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

primary library. The R-score of that same pair is then deﬁned as: RscoreðA,BjA $ X,B $ YÞ ¼ MAX ðTscoreðA,BÞ,TscoreðX,YÞÞ 2

where X pairs with A and Y with B as indicated by the structural library (A $ X, B $ Y). Whenever the structural library is ambiguous (i.e. A $ X, A $ W, B $ Y, B $ W), the ﬁnal score is estimated by considering all the resulting combinations: RscoreðA,BjA $ X,A $ Z,B $ Y,B $ WÞ ¼ MAXðTscoreðA,BÞ,TscoreðX,YÞ, TscoreðX,WÞ,TscoreðZ,YÞ,TscoreðZ,WÞÞ The R-score, like the regular T-Coﬀee scoring scheme is then used as a position-speciﬁc substitution matrix while computing an alignment. R-Coﬀee uses the progressive alignment strategy described in the original T-Coﬀee algorithm and inspired from the ClustalW implementation. Sequences are all aligned two by two, using a standard identity based matrix and the Myers and Miller implementation of dynamic programming. These alignments are then used to derive a distance matrix that is turned into a Neighbor-Joining tree (39). This tree is used as a guide tree to deﬁne the order in which the sequences are aligned to create the multiple alignment. These alignments are made using the R-score as a positionspeciﬁc scoring scheme and the Myers and Miller pairwise algorithm. Apart from the use of the Myers and Miller pairwise alignment option, all the other T-Coﬀee parameters have been left to their original default values. Availability R-Coﬀee is part of the T-Coﬀee package, an open source freeware distributed under the GPL license and available at no cost along with documentation from www.tcof fee.org. R-Coﬀee can be compiled on most platforms (UNIX, Mac OS X and Windows). RESULTS AND DISCUSSION R-Coﬀee is an RNA multiple sequence alignment method able to use RNA secondary structure information while computing a multiple sequence alignment. One of the key properties of R-Coﬀee is its low computational complexity. Given predicted structures, R-Coﬀee can compute structure-based sequence alignments in a time and space complexity similar to that reported for T-Coﬀee or Probcons [in the order of O(N3L2) for N sequences of length L]. Nonetheless, the computation of the predicted structures can be a limiting factor. For example, for global Minimal Folding Energy methods like RNAfold (29), can be quite demanding with growing sequence length and prediction quality depends on sequence length (36,37). Our ﬁrst concern was therefore to check whether the replacement of RNAfold with the less-demanding RNAplfold (38) could prove useful. RNAplfold is able to predict the fold of long sequences thanks to its local 3

Figure 1. R-Coﬀee’s RNA-extension. The two Gs correspond to a pair of matched residues observed in the input pairwise alignment. This gets incorporated in the library as a constraint. Both of these nucleotides are predicted to be base paired (Bp) with two Cs that have not been found aligned. The RNA extension involves incorporating the associated constraint (C matched to C), based on the information contained in the provided structures. This structure-based extension is one of the two main ingredients of the R-Coﬀee scoring scheme.

In the standard T-Coﬀee procedure, if the primary alignment of sequences 1 and 2 contains the aligned pair A–B, this pair will be added as an entry to the library and associated with a weight equal the average identity of the alignment of sequences 1 and 2 (the weights will be added if several alternative alignments contribute the same pair). The R-Coﬀee library procedure goes further and involves incorporating the pair X–Y into the library (with the weight of A–B) even it was not aligned in any of the input library alignments. The rationale is that if the alignment of A–B is correct and if the predicted structures are correct as well, then the pair X–Y should be aligned and it therefore makes sense to add it to the library (if X–Y is already part of the primary library, its weight is increased by the A–B weight). Whenever more than one structure has been provided for each sequence, the secondary structure library may be ambiguous and provides several alternative base pairings for one or both residues (e.g. A $ X, A $ W in one sequence and B $ Y, B $ Z in the other). In this case, the update will consider a combination of all the potential structure-induced aligned pairs (i.e. X–Y, X–Z, W–Y, W–Z) and increase their primary weight with that of A–B. The R-score also uses a new evaluation procedure. The regular T-Coﬀee scoring scheme computes the matching score of a given residue pair A–B by summing up over the score of all the residue triplets including A, B and a third residue x from a third sequence. This can be formalized as follows: X MINðWeightðA,xÞ,WeightðB,xÞÞ 1 TscoreðA,BÞ ¼
x

where Weight(A,x) is a primary weight and x is any residue reported aligned both to A and B within the

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 6 OF 10

Figure 2. R-Coﬀee/RNAplfold base pair probability threshold optimization. Base pairs predicted by RNAplfold above a certain probability threshold were used as input for R-Coﬀee. Then all BRAliBase sets were aligned and the average alignment accuracy (BRAliscore) calculated. The optimal threshold was determined to be 0.3.

Figure 3. Eﬀect of the RNA-extension on T-Coﬀee’s performance on BRAliBase 2.0. The plot shows the alignment accuracy as function of the sequence identity. Scores are averaged over 5% sequence identity bins. Standard T-Coﬀee is compared to R-Coﬀee using structure input from RNAfold and RNAplfold as well as random structures.

structure prediction algorithm. In practice, this result is achieved by restricting the computation to the local neighborhood of each nucleotide (default is a span of 100 nt). RNAplfold outputs base-pairing probabilities rather than a single secondary structure. We therefore determined an optimal threshold for ﬁltering out unreliable base pairs. We did so by extensive testing on the BRAliBase dataset (see ‘Material and methods’ section and Figure 2). The cutoﬀ value thus determined (0.3) was used throughout the remaining experiments. Given this cutoﬀ value, we systematically compared the BRAliscore obtained by R-Coﬀee when using RNAfold and RNAplfold structural libraries. Both structural libraries (RNAfold and RNAplfold) give similar results. Interestingly, this correlation is high regardless of whether the considered sequences are closely or distantly related (Figure 3). The mean BRAliscore for the two methods is the same (0.64) with 53% of the 388 BRAliBase datasets having their BRAliscore within 5% of each other when using the RNAfold or the RNAplfold structural library. We therefore decided to use RNAplfold as the default provider of secondary structure predictions for the rest of the analysis. This allows R-Coﬀee to deal with sequences of arbitrary size. In order to check the eﬀect of the accuracy of the predicted structures, we also tested R-Coﬀee using random structure predictions, as input. The performance then returns to the default T-Coﬀee accuracy (Figure 3), i.e. alignment quality does not get worse as compared to default T-Coﬀee, but clearly decreases when compared with genuine structure predictions. Altogether these results suggest that it makes little diﬀerence in accuracy whether we use RNAfold or RNAplfold for secondary structure prediction in R-Coﬀee. They also conﬁrm the eﬀectiveness of the incorporation of structural information within the alignment procedure. We wish to note here, that although we limited our analysis to these two approaches, the ﬂexibility of R-Coﬀee’s RNA extension allows

incorporating and combining any kind of structure prediction. Alternatives include using RNAfold’s partition function and an applied threshold (as done with RNAplfold here) or using methods with a higher selectivity like Contrafold (40). But one could also include, for example sub-optimal structure or pseudoknot predictions (41). Next, we examined the merits of R-Coﬀee in comparison with other methods. It should be stressed here that our primary goal was not to produce a stand-alone method, but rather to use R-Coﬀee as a novel component that can seamlessly be combined with any existing RNA alignment method. We therefore focused our eﬀorts on the evaluation of the combination between R-Coﬀee and other established methods. In order to determine the baseline of our analysis we ran common sequence alignment methods on the 388 BRAliBase datasets (top part of Table 2). Our results are relatively consistent with previous reports (42,43) of accuracy on protein sequence alignments: Maﬀt (35), Probcons (34) and Muscle (32) deliver the best alignments. The default T-Coﬀee is notably inaccurate with RNA (5), most likely because it uses, by default, a banded dynamic programming heavily tuned on protein sequences. The second part of Table 2 (structural aligners) is also consistent with previous reports and conﬁrms that RNA alignment methods making use of structural information have a higher accuracy than sequence aligners. Our results show that FoldalignM (15), Rnasampler (44), T-Lara (19) and Murlet (14) clearly outperform all the regular sequence alignment methods, with more than ﬁve points diﬀerence between the best structure-based alignment methods (FoldalignM/Rnasampler) and their best non-structurebased counterpart (Maﬀt ginsi). The most straightforward way to embed these methods within R/T-Coﬀee is to use each individual method to generate libraries of pairwise alignments. This protocol merely requires a pairwise alignment for each pair of sequences within a dataset and using the resulting

PAGE 7 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

Table 2. BRAliBase evaluations Method BRAliscore Net improvement

Default +T-Coﬀee +R-Coﬀee +T-Coﬀee +R-Coﬀee T-Coﬀee Poa Pcma Prrn ClustalW Proalign Maﬀt ﬀtns Probcons Muscle Maﬀt ginsi M-Coﬀee4 M-Locarna Stral Murlet Rnasampler FoldalignM Dynalign Foldalign T-Lara Consan 0.59 0.62 0.62 0.64 0.65 0.66 0.68 0.69 0.69 0.70 0.71 0.66 0.71 0.73 0.75 0.75 / / / / / 0.65 0.64 0.61 0.65 0.68 0.68 0.67 0.69 0.68 / 0.69 0.70 0.70 0.70 0.76 0.62 0.62 0.74 0.79 0.63 0.70 0.67 0.66 0.69 0.71 0.72 0.71 0.73 0.72 0.74 0.71 0.72 0.72 0.71 0.76 0.62 0.77 0.73 0.79 / 48 34 À63 À7 30 17 À74 À17 À49 / 101 À4 À132 À101 72 / / / / 125 154 120 45 83 128 68 51 42 39 84 133 19 À73 À95 76 / / / /

Each line in the table corresponds to the evaluation of the package listed in the Method column. The BRAliscore section indicates the average BRAliscore performance of the package. The default column indicates the score obtained by the considered package. The +T-Coﬀee indicates the average BRAliscore using the corresponding package combined with T-Coﬀee. The +R-Coﬀee column indicates the average BRAliscore of the same package combined with R-Coﬀee. The slash / indicates values that could not be computed, either because the method only produces pairwise alignments (Dynalign, Foldalign and Consan), or because the method is a derivative of or uses T-Coﬀee (e.g. T-Lara). The Net Improvement section indicates the net improvement over the stand-alone methods.

alignments as a primary library for either T-Coﬀee or R-Coﬀee. The structural libraries were computed once on the entire dataset and then re-used. This protocol was used on all the aligners with the exception of T-Lara for which we followed the combination protocol described by T-Lara’s authors. It involves compiling partial T-Coﬀee libraries with Lara (i.e. libraries restricted to aligned stems) and combining them with the default T-Coﬀee libraries made of global and local pairwise alignments, that same protocol was used when combining Lara with R-Coﬀee. We ﬁrst evaluated the eﬀect of using the regular T-Coﬀee to compute an MSA with pairwise libraries generated either with regular sequence or structural aligners. The results are displayed in the +T-Coﬀee column of Table 2. For each T-Coﬀee/method X combination (X being any of the tested methods), we calculated the average BRAliScore and the Net Improvement (NI), which is the absolute improvement induced by combining that method with T-Coﬀee. It is deﬁned as the number of test cases where a method X outperforms that method combined with T-Coﬀee (T-Coﬀee/X) minus the number of times the T-Coﬀee/X combination outperforms method X:     T-Coffee XoutperformsT-Coffee NI¼ À 4 XoutperformsX X

The NI provides a guide as to whether one of the methods outperforms another. Results in Table 2 are easier to interpret when the regular sequence aligners and the structural aligners are separately considered. The regular aligners show little beneﬁt from the T-Coﬀee combination of their pairwise output (Column +T-Coﬀee), probably because these methods already make an eﬃcient use of their sequence information, or at least because they use it as eﬃciently as T-Coﬀee could. It is not a surprising result since most of these methods either use a T-Coﬀee inspired consistency-based scoring scheme (Maﬀt g/linsi, Probcons) or a sophisticated iterative method (Muscle, Prrn) to improve the original progressive MSA. R-Coﬀee, on the other hand, provides a clear improvement to all the regular sequence alignment methods tested here (Table 2, +R-Coﬀee column). This improvement remains regardless of the metrics used (BRAliscore or Net Improvement). The results obtained when combining R/T-Coﬀee with structural aligners follow a similar albeit less marked pattern. When added on the top of structural aligners, T-Coﬀee improves two methods out of ﬁve and R-Coﬀee improves three out of ﬁve. These observations are fairly consistent with the underlying principles of the alignment programs (sequence and structural aligners). They suggest that the potential beneﬁts of using R-Coﬀee come as much from the T-Coﬀee consistency-based scoring scheme as they do from the R-extension. The relatively small beneﬁt coming from the R-extension in this case also makes sense if one considers that the structural aligners already use structural information and are therefore less likely to beneﬁt from the incorporation of RNAplfold predictions than their sequence-based counterparts. This is especially true when combining T-Coﬀee with Consan. It is worth mentioning, however, that the use of the R-scoring scheme outperforms similar T-Coﬀee combinations in most cases with ﬁve methods out of nine being improved when switching from the T-Coﬀee to the R-Coﬀee combination and four methods remaining unchanged. Altogether, the data collected in Table 2 strongly suggest that consistency-based scoring schemes provide an eﬃcient framework for making the best out of pairwise alignment methods. T/R-Coﬀee/Foldalign and T/RCoﬀee/Consan provide the best illustration of this concept (bottom of Table 2). Consan is computationally too expensive to be easily extended to MSAs, yet, a straightforward combination with R-Coﬀee results in a method that outperforms all the other methods analyzed in this work (Tables 2 and 3). Figure 4 shows a detailed performance plot on BRAliBase and compares R-Coﬀee/ Consan with the best sequence alignment method (Maﬀt ginsi) and FoldalignM. This plot shows, that R-Coﬀee/ Consan performs better than FoldalignM across the full range of sequence identities, even if the diﬀerence is not statistically signiﬁcant (Table 3). It is important to point out that the shape of this curve is a side eﬀect of the two components that comprise BRAliScore (SCI, the structural component and SPS the sequence one). High levels of sequence identity naturally result in high-scoring alignments. At the other side of the spectrum at low identity levels, numerous compensating base pair

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 8 OF 10

Table 3. Net Improvement of R-Coﬀee/Consan and RM-Coﬀee4 over programs on BRAliBase Method Poa T-Coﬀee Prrn Pcma Proalign Maﬀt ﬀtns ClustalW Probcons Maﬀt ginsi Muscle M-Locarna Stral FoldalignM Murlet Rnasampler T-Lara versus R-Coﬀee-Consan 241ÃÃÃ 241ÃÃÃ 232ÃÃÃ 218ÃÃÃ 216ÃÃÃ 206ÃÃÃ 203ÃÃÃ 192ÃÃÃ 170ÃÃÃ 169ÃÃÃ 234ÃÃÃ 169ÃÃÃ 146 130Ã 129Ã 125Ã versus RM-Coﬀee4

Table 4. Cmﬁnder data set comparison Method SPS Net improvement

Default +T-Coﬀee +R-Coﬀee +T-Coﬀee +R-Coﬀee 217ÃÃÃ 199ÃÃÃ 198ÃÃÃ 151ÃÃÃ 150ÃÃ 148Ã 136ÃÃÃ 128Ã 115 111 183ÃÃ 62 61 À12 À27 À30 ClustalW Maﬀt ginsi Maﬀt ﬀtns Muscle Pcma Poa Proalign Probcons Prrn M-Locarnap T-Coﬀee R/M-Coﬀee4 0.54 0.64 0.60 0.32 0.49 0.31 0.40 0.50 0.43 0.53 0.54 / 0.57 0.64 0.64 0.40 0.55 0.38 0.39 0.45 0.54 0.63 / 0.63 0.58 0.64 0.64 0.42 0.58 0.42 0.41 0.51 0.56 0.63 0.53 0.65 5 À1 6 4 8 4 À4 À3 3 6 / / 5 2 6 8 8 8 À2 2 4 5 2 0

This table indicates the relative performance of the methods listed in the Method column in comparison with the R-Coﬀee/Consan and RMCoﬀee4 combinations, as net improvement. Asterisks indicate statistically signiﬁcant diﬀerences according to Wilcoxon tests (Ã P = 0.05; ÃÃ P = 0.01; ÃÃÃ P = 0.001). The upper part of the table contains sequence aligners only, the lower part structural alignment programs. Within these sections, programs are sorted by net improvement.

Each line in the table corresponds to the evaluation of the package listed in the Method column. The SPS section indicates the averaged sum-of-pairs scores (applied to the Rfam core alignment) measured on the considered package; +T-Coﬀee is the same score measured on the package combined with T-Coﬀee (+T-Coﬀee); the +R-Coﬀee column corresponds to that same package combined with R-Coﬀee. The slash / indicates values that could not be computed because the method is a derivative of T-Coﬀee (T-Coﬀee and M-Coﬀee). The Net Improvement section indicates the net improvement for similar combinations.

Figure 4. Comparison of R-Coﬀee/Consan and RM-Coﬀee with other programs. The plot shows the alignment accuracy on BRAliBase 2.0 as function of the sequence identity. Scores are averaged over 5% sequence identity bins. We included the best stand-alone sequence aligner (MAFFT ginsi), one of the two best structural aligners (FoldalignM), the best R-Coﬀee combination (R-Coﬀee/Consan) and RM-Coﬀee4 that combines the pairwise alignments of Probcons, MAFFT ginsi/ﬀtns and Muscle by means of R-Coﬀee.

mutations can result in high scores, because they are taken into account by the SCI (see also Reference alignments and Evaluation). Nonetheless, and across the whole identity spectrum, our data supports well the idea that R-Coﬀee/Consan is probably the most accurate RNA MSA alignment method currently available for the kind of datasets found in BRAliBase (i.e. less than 150 nt). We next assessed whether R-Coﬀee is also useful for aligning long sequences. We analyzed the Cmﬁnder dataset made of Rfam alignments embedded within surrounding genomic sequences of varying lengths. None of the structural aligners except M-Locarna (17), was able

to run on all the 11 datasets and the analysis was restricted to regular sequence aligners (Table 4). With the notable exception of Muscle (32), the ranking in this table is not dramatically diﬀerent from that in Table 2. The behavior of these methods when combined with T- or R-Coﬀee is also similar. When considering the 10 sequence aligners with T-Coﬀee, we observed an improvement on 7 methods out of 10. This ﬁgure rises to 9 out of 10 when making the combination with R-Coﬀee. Although these results are based on too small a dataset (11 alignments) to be considered statically signiﬁcant, they are in very good agreement with those reported on BRAliBase in Table 2 and conﬁrm R-Coﬀee’s ability to improve over most sequence alignment methods. The main practical problem with using R-Coﬀee is that to reach its highest level of accuracy, it requires the installation of RNA alignment packages, which may be extremely greedy with memory and CPU usage. We therefore checked whether a simpler alternative could be better suited for more modest computational conﬁgurations, or for high throughput applications. In a previous paper, Wallace et al. reported and characterized a novel mode of T-Coﬀee named M-Coﬀee (22). M-Coﬀee is a meta-aligner that combines alternative multiple sequence alignment methods into one consensus alignment. This combination usually results in an improvement over the constituting methods. We used the M-Coﬀee approach to combine the four best regular alignment methods (i.e. non-structure based), and tested them on BRAliBase. Following the strategy outlined in the original M-Coﬀee paper, we incorporated the sequence aligners in order of decreasing performances and kept the combination with the highest average. This protocol resulted in RMCoﬀee4, a combination of Muscle, Probcons, Maﬀt ginsi and Maﬀt ﬀtns fed to T-Coﬀee (M-Coﬀee4) or R-Coﬀee

PAGE 9 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

(RM-Coﬀee4). The results (Table 2 and Table 3, Figure 4) are unambiguous and indicate that RM-Coﬀee4 clearly outperforms all the sequence alignment methods while delivering the best BRAliBase alignments one may obtain without using a structural aligner. These results were not conﬁrmed on the 11 Cmﬁnder datasets (Table 4), either because this dataset is too small to reveal the trend or because of the negative eﬀect of Muscle on RM-Coﬀee4 on this speciﬁc dataset. CONCLUSION We have presented a modiﬁed version of the T-Coﬀee (21) multiple sequence alignment method, named R-Coﬀee, designed for delivering highly accurate multiple ncRNA alignments. R-Coﬀee is a heuristic, able to take advantage of secondary structure predictions carried out beforehand. It is best described as an alignment improver and we show in this work that it can eﬀectively improve all sequence alignment packages, taken oﬀ the shelf and without tuning. Among all the combinations tested here, one clearly outperformed the alternatives: the combination of R-Coﬀee and Consan (10). Most of these tests were carried out on the BRAliBase reference datasets (5). We also checked whether R-Coﬀee was able to deal with datasets of longer sequences, combining a mixture of related and unrelated segments. For that purpose, we used a dataset designed for the Cmﬁnder algorithm (26). We found that the R-Coﬀee combination improved, to a greater or lesser extent, all the tested alignment methods. The combined observations made on the BRAliBase and Cmﬁnder datasets suggest that the R-Coﬀee scoring scheme is able to make eﬀective use of RNA predicted secondary structures in order to improve accuracy over most regular sequence aligners. This strategy also works when applied to structural aligners, although less dramatically than when considering regular sequence aligners. These results conﬁrm the strength of consistency-based scoring schemes over regular alignment methods. They suggest that most pairwise alignment methods can usefully be incorporated in a consistency-based framework such as T-Coﬀee. Our results also indicate that the meta-method approach originally described for M-Coﬀee (22) can be applied to R-Coﬀee, and that whenever the computation of highly accurate structure-based RNA pairwise alignments is not feasible, one may obtain alignments of reasonable quality by combining purely sequence-based alignments via R-Coﬀee. Further progress will also require the assembly of more demanding reference datasets, especially for long sequences. Such datasets are hard to assemble because RNA structural information is scarce (compared to protein structure information). RNA alignment remains a rapidly developing ﬁeld. With an increasing number of novel biological functions associated with yet poorly characterized RNA genes, there is an ever growing need for methods allowing accurate comparison of RNA sequences and the identiﬁcation of distant homologues. Any improvement in alignment accuracy is likely to have a big impact. In this context,

R-Coﬀee can easily be further improved. The ﬂexible way in which secondary structures are fed to the program allows a seamless combination of data from heterogeneous sources. It is important to point out that all the possibilities supported by the current software implementation have not yet been explored. Most notably, we have not yet fully exploited the possibility to associate more than one predicted structure to each sequence. These alternative structures could either be suboptimal structures, or the output of alternative structure prediction programs, such as ContraFold or Rfold. One could also combine structure predictions of any kind, including local, global or even tertiary interactions like pseudoknots, with experimentally veriﬁed structures. The possibility of combining data from various sources is, perhaps, the major strength of R-Coﬀee. ACKNOWLEDGEMENTS We thank Iain M. Wallace for useful discussions and all authors for their assistance with using their programs. This work was partly supported by funding from the Science Foundation Ireland. C.N. thanks the centre for genomic regulation for support and funding. Funding to pay the Open Access publication charges for this article was provided by Centro de Regulacio Genomica (CRG). Conﬂict of interest statement. None declared. REFERENCES
1. Zamore,P.D. and Haley,B. (2005) Ribo-gnome: The Big World of Small RNAs. Science, 309, 1519–1524. 2. Costa,F.F. (2007) Non-coding RNAs: lost in translation? Gene, 386, 1–10. 3. Birney,E., Stamatoyannopoulos,J.A., Dutta,A., Guigo,R., Gingeras,T.R., Margulies,E.H., Weng,Z., Snyder,M., Dermitzakis,E.T., Thurman,R.E. et al. (2007) Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. 4. Levenshtein,V.I. (1966) Binary codes capable of correcting deletions, insertions, and reversals. Cybern. Control Theory, 10, 707–710. 5. Gardner,P.P., Wilm,A. and Washietl,S. (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. 6. Wilm,A., Mainz,I. and Steger,G. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol., 1, [Epub ahead of print]. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690. 8. van Nimwegen,E., Crutchﬁeld,J.P. and Huynen,M. (1999) Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA, 96, 9716–9720. 9. Sankoﬀ,D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J1 Appl1 Math., 45, 810–825. 10. Dowell,R. and Eddy,S. (2006) Eﬃcient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7, 400. 11. Hofacker,I.L., Bernhart,S.H.F. and Stadler,P.F. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227. 12. Mathews,D.H. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, 21, 2246–2253.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 10 OF 10
information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2. 32. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. 33. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 34. Do,C.B., Mahabhashyam,M.S.P., Brudno,M. and Batzoglou,S. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. 35. Katoh,K., Kuma,K.-i., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 36. Doshi,K., Cannone,J., Cobaugh,C. and Gutell,R. (2004) Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics, 5, 105. 37. Dowell,R. and Eddy,S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71. 38. Bernhart,S.H., Hofacker,I.L. and Stadler,P.F. (2006) Local RNA base pairing probabilities in large sequences. Bioinformatics, 22, 614–615. 39. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. 40. Do,C.B., Woods,D.A. and Batzoglou,S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90–e98. 41. Reeder,J., Hochsmann,M., Rehmsmeier,M., Voss,B. and Giegerich,R. (2006) Beyond Mfold: recent advances in RNA bioinformatics. J. Biotechnol., 124, 41–55. 42. Carroll,H., Beckstead,W., O’Connor,T., Ebbert,M., Clement,M., Snell,Q. and McClellan,D. (2007) DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics, 23, 2648–2649. 43. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. Opin. Struct. Biol., 16, 368–373. 44. Xu,X., Ji,Y. and Stormo,G.D. (2007) RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics, 23, 1883–1891. 45. Pei,J., Sadreyev,R. and Grishin,N.V. (2003) PCMA: fast and accurate multiple sequence alignment based on proﬁle consistency. Bioinformatics, 19, 427–428. 46. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464. 47. Loytynoja,A. and Milinkovitch,M.C. (2003) A hidden Markov ¨ model for progressive multiple alignment. Bioinformatics, 19, 1505–1513. 48. Gotoh,O. (1996) Signiﬁcant improvement in accuracy of multiple protein sequence alignments by iterative reﬁnement as assessed by reference to structural alignments. J. Mol. Biol, 264, 823–838. 49. Dalli,D., Wilm,A., Mainz,I. and Steger,G. (2006) STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599.

13. Havgaard,J.H., Lyngso,R.B., Stormo,G.D. and Gorodkin,J. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. 14. Kiryu,H., Tabei,Y., Kin,T. and Asai,K. (2007) Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics, 23, 1588–1598. 15. Torarinsson,E., Havgaard,J.H. and Gorodkin,J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics, 23, 926–932. 16. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6, 73. 17. Will,S., Reiche,K., Hofacker,I.L., Stadler,P.F. and Backofen,R. (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol., 3, e65. 18. Meyer,I.M. and Miklos,I. (2007) SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput. Biol., 3, e149. 19. Bauer,M., Klau,G.W. and Reinert,K. (2007) Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8, 271. 20. Wuyts,J., Perriere,G. and Van de Peer,Y. (2004) The European ribosomal RNA database. Nucleic Acids Res., 32, D101–D103. 21. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coﬀee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 22. Wallace,I.M., O’Sullivan,O., Higgins,D.G. and Notredame,C. (2006) M-Coﬀee: combining multiple sequence alignment methods with T-Coﬀee. Nucleic Acids Res., 34, 1692–1699. 23. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoﬀee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 24. Armougom,F., Moretti,S., Poirot,O., Audic,S., Dumas,P., Schaeli,B., Keduas,V. and Notredame,C. (2006) Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coﬀee. Nucleic Acids Res., 34, W604–W608. 25. Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics, 21, 3352–3359. 26. Yao,Z., Weinberg,Z. and Ruzzo,W.L. (2006) CMﬁnder-a covariance model based RNA motif ﬁnding algorithm. Bioinformatics, 22, 445–452. 27. Griﬃths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.R. and Bateman,A. (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res., 33, D121–D124. 28. Washietl,S., Hofacker,I.L. and Stadler,P.F. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. 29. Hofacker,I.L. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., 31, 3429–3431. 30. Hofacker,I.L., Fekete,M. and Stadler,P.F. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066. 31. Cannone,J.J., Subramanian,S., Schnare,M.N., Collett,J.R., D’Souza,L.M., Du,Y., Feng,B., Lin,N., Madabusi,L.V., Muller,K.M. et al. (2002) The Comparative RNA Web (CRW) Site: ¨ an online database of comparative sequence and structure

© 2001 Oxford University Press

Nucleic Acids Research, 2001, Vol. 29, No. 1

55–57

A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Sabine Dietmann, Jong Park, Cedric Notredame1, Andreas Heger, Michael Lappe and Liisa Holm*
Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK and 1Structural and Genetic Information, CNRS UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
Received October 6, 2000; Accepted October 17, 2000

ABSTRACT The Dali Domain Dictionary (http://www.ebi.ac.uk/ dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. INTRODUCTION Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB), and a number of derived databases that organise this data into hierarchical classification schemes or in terms of structural neighbourhoods have appeared on the World Wide Web (1–4). We maintain the Dali Domain Dictionary and FSSP database with continuous weekly updates. Because many structural similarities are between substructures (domains), i.e. parts of structures, protein chains are decomposed into domains using the criteria

of recurrence and compactness (5). Each domain is assigned a Domain Classification number D.C.l.m.n.p representing fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). The discrete classification presents views that are free of redundancy and simplify navigation in protein space. The structural classification is explicitly linked to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships (Fig. 1). A MAP OF FOLD SPACE The central concept underlying the classification is a ‘map of fold space’. This map is based on exhaustive neighbouring of all protein structures in the PDB. The all-against-all structure comparison is carried out using the Dali program. As a result of the exhaustive comparisons, each structure in the PDB is positioned in an abstract, high-dimensional space according to its structural similarity score to all other structures. The graph of structural similarities (between domains) is partitioned into clusters at four different levels of granularity. Coarse-grained overviews yield few clusters with many members that share broad architectural similarities, while fine-grained clustering yields many clusters within which structural similarities between members can extend to atomic detail due to functional constraints, for example, in binding sites. Continuing the practice from the FSSP database, fold types are defined by agglomerative clustering so that the members of a fold type have average pairwise Z-scores above 2. The threshold has been chosen empirically to group together structures with topological similarity. Dali Domain Dictionary version 3 introduces two new levels to the fold classification, one above and one below the fold type abstraction. The top level of the fold classification corresponds to secondary structure composition and supersecondary structural motifs. We have previously identified five attractor regions in fold space (1). We partition fold space so that each domain is assigned to one of attractors I–V, which are represented by archetype structures, using a shortest-path criterion. Structures which are disconnected from other structures, are assigned to class X. Domains which are not clearly closer to one attractor than another, are assigned to the mixed class Y. Currently,

*To whom correspondence should be addressed. Tel: +44 1223 494454; Fax: +44 1223 494470; Email: holm@ebi.ac.uk

56

Nucleic Acids Research, 2001, Vol. 29, No. 1

Figure 1. Unification of the histone deacetylase and arginase families. Reuse and adaptation of existing structural frameworks for new cellular functions is widespread in protein evolution. Histone deacetylase and arginase are unified at the functional family level of the classification despite very little overall sequence similarity. The supporting evidence comes from structural and functional similarity. (A) Structure comparison of arginase (left, 1rlaA) (10) and histone deacetylase (right, 1c3pA) (11) yields a high Z-score of 12. Superimposition by Dali, drawing by Molscript (12). (B) Joint structural, evolutionary and functional information for two segments around the active site. Structurally aligned positions are shaded. Arginase has a binuclear metal centre where residues D124, H126 and D234 bind one and residues H101, H128 and H232 the other manganese ion. The former site is structurally equivalent to the zinc binding site of histone deacetylase made up of residues D168, H170 and D258. Sequence variability from multiply-aligned sequence neighbours in HSSP (asterisk, values 10 or larger; 0, invariant) is shown above and the secondary structure summary from DSSP (E,B, beta-sheet, S bend; T,G, hydrogen-bonded turns) is shown below the amino acid sequences.

exhibit extreme structural divergence in regions supporting the active site, while coiled coils and beta-barrels are simple, geometrically constrained topologies which are believed to have emerged several times in protein evolution. To address the evolutionary classification problem, we have chosen to analyse functional and sequence-motif attributes on top of structural similarity in a numerical taxonomy. The more functional features two proteins have in common, the more likely it is that they do so due to a common descent rather than by chance. Currently, our feature set includes common sequence neighbours (overlap of PSI-BLAST families), analysis of 3D clusters of identically conserved residues, enzyme classification (E.C. numbers) and keyword analysis of biological function. A neural network assigns weights to these qualitatively different features. The neural network was trained against the superfamily to fold transition in a manual fold classification (2). To unify families, we exploit the empirical observation that Dali’s intramolecular distance comparison measure gives higher scores to pairs of homologues than to analogues. In practice, we require that functional families are nested within fold families in the fold dendrogram: functional families are branches of the fold dendrogram where all pairs have a high average neural network prediction for being homologous. The threshold for unification was chosen empirically and is conservative. Five hundred and four functional families unify two or more sequence families. Unified families have functional residues or sequence motifs that map to common sites in the 3D context of a fold. The strongest evidence is usually obtained for unifying enzyme catalytic domains. In some cases the expert system fails to capture enough evidence for unification of domains which are believed to be homologous, such as within the varied set of helix–turn–helix motif containing DNA binding domains where several functional families are defined at the same fold type level. A LIBRARY OF STRUCTURE-BASED MULTIPLE ALIGNMENTS OF REMOTE HOMOLOGUES The Dali Domain Classification can be browsed interactively at http://www2.ebi.ac.uk/dali. The server is implemented on top of a MySQL database. The classification may be entered from the top of the hierarchy, or the user may make a query about a protein identifier or a node in the classification hierarchy. Multiple structural alignments including attributes of the proteins are generated on the fly for any user selection of structural neighbours. Precomputed alignments are available for each functional family. The T-Coffee program (9) is used to generate genuine consensus alignments of multiple structures from the library of pairwise Dali alignments. A reliability score is computed to indicate well defined regions (the structural core) and regions where structural equivalences are ambiguous. Technically, T-Coffee improves alignment quality in a few known cases of functional families where active site residues were inconsistently aligned in some of the pairwise Dali comparisons. Scientifically, the definition of functional families and reliable multiple structure alignments for each opens the door to sensitive sequence database searches using position-specific profiles, and to benchmarking the alignment accuracy of threading predictions.

class Y comprises about one-sixth of the representative domain set. In the future, some of these may be assigned to emerging new attractors. AN EVOLUTIONARY CLASSIFICATION The other new level of the classification infers plausible evolutionary relationships from strong structural similarities that are accompanied by functional or sequence similarities. Conceptually, this functional family level is equivalent to the ‘superfamily’ level of scop (2). The computational discrimination between physically convergent (analogous) and evolutionarily related, divergent (homologous) proteins has received much attention recently (6–8). Structural similarity alone is insufficient to draw a line between the two classes. For example, lysozymes

Nucleic Acids Research, 2001, Vol. 29, No. 1

57

ACKNOWLEDGEMENT S.D. and J.P. were supported by EU contract BIO4-CT96-0166. REFERENCES
1. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. 2. Hubbard,T.J., Ailey,B., Brenner,S.E., Murzin,A.G. and Chothia,C. (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res., 27, 254–256. 3. Orengo,C.A., Pearl,F.M., Bray,J.E., Todd,A.E., Martin,A.C., Lo Conte,L. and Thornton,J.M. (1999) The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res., 27, 275–279. Updated article in this issue: Nucleic Acids Res. (2001), 29, 223–227. 4. Marchler-Bauer,A., Addess,K.J., Chappey,C., Geer,L., Madej,T., Matsuo,Y., Wang,Y. and Bryant,S.H. (1999) MMDB: Entrez’s 3D structure database. Nucleic Acids Res., 27, 240–243. 5. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.

6. Russell,R.B., Saqi,M.A., Bates,P.A., Sayle,R.A. and Sternberg,M.J. (1998) Recognition of analogous and homologous protein folds–assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng., 11, 1–9. 7. Kawabata,T. and Nishikawa,K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins, 41, 108–122. 8. Wood,T.C. and Pearson,W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977–995. 9. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 10. Bewley,M.C., Jeffrey,P.D., Patchett,M.L., Kanyo,Z.F. and Baker,E.N. (1999) Crystal structures of Bacillus caldevelox arginase in complex with substrate and inhibitors reveal new insights into activation, inhibition and catalysis in the arginase superfamily. Structure, 7, 435–438. 11. Finnin,M.S., Donigian,J.R., Cohen,A., Richon,V.M., Rifkind,R.A., Marks,P.A., Breslow,R. and Pavletich,N.P. (1999) Structure of a histone deacetylase homologue bound to the TSA and SAHA inhibitors. Nature, 401, 188–193. 12. Kraulis,P. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. Appl. Crystallogr., 24, 946–950.

W604–W608 Nucleic Acids Research, 2006, Vol. 34, Web Server issue doi:10.1093/nar/gkl092

Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee
´ ´ Fabrice Armougom, Sebastien Moretti, Olivier Poirot, Stephane Audic, Pierre Dumas1, 1 Basile Schaeli , Vladimir Keduas and Cedric Notredame*
´ Laboratoire Information Genomique et Structurale, CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientifique de Luminy, 163 Avenue de Luminy, FR- 13288, Marseille cedex 09, France and 1Laboratoire `mes periphe ´ ´riques, Ecole Polytechnique Fede ´ ´rale de Lausanne, CH 1015 Lausanne, Switzerland de syste
Received February 14, 2006; Revised March 1, 2006; Accepted March 7, 2006

ABSTRACT Expresso is a multiple sequence alignment server that aligns sequences using structural information. The user only needs to provide sequences. The server runs BLAST to identify close homologues of the sequences within the PDB database. These PDB structures are used as templates to guide the alignment of the original sequences using structure-based sequence alignment methods like SAP or Fugue. The final result is a multiple sequence alignment of the original sequences based on the structural information of the templates. An advanced mode makes it possible to either upload private structures or specify which PDB templates should be used to model each sequence. Providing the suitable structural information is available, Expresso delivers sequence alignments with accuracy comparable with structure-based alignments. The server is available on http://www.tcoffee.org/.

INTRODUCTION Over the past years, multiple sequence alignments (MSAs) have become one of the most widely used tools in biology along with database search methods. MSAs are needed for proﬁle analysis, phylogenetic reconstruction, structure prediction and a wealth of minor but important applications such as PCR primer design or sequence reconciliation. The ever-growing reliance on MSAs is even more pronounced now that hundreds of complete genomes are being made available. This window opened on evolution provides an ideal context for MSAs to fulﬁll their potential as key tools in functional genomics.

Unfortunately, MSA packages are not yet accurate enough to deliver all their promises and the sharp increase in the number of methods recently published (25 novel programs over the last 5 years) illustrates well the expectation for improvement within the community. MSAs are not always good enough for large-scale analysis and while immense progress has been made to accurately align multiple sets of sequences with >40% average identity, recent benchmarks published with the MAFFT 5 package (1) reveal that state of the art methods still fail to reliably align distantly related sequences. In the so-called ‘Twilight zone’ (2), sequences with <20% identity cannot be aligned with >30% average accuracy (as judged by comparison with reference alignments). So far, the most convincing solution to this problem has been to supplement sequences with structural information (3). The reason why structure-based MSAs are more accurate is not so much a consequence of better algorithms but rather an effect of structures evolutionary stability. Structures evolve slower than the sequences (4) and even when sequences have diverged beyond recognition it is often possible to establish homology (i.e. common ancestry) on the basis of 3D folds comparisons (3). The increasing availability of structural data (5) means that relying on structure-based methods for sequence analysis has become much more realistic than it used to be. However, sequences are still being determined much faster than structures, thus creating a context where methods able to efﬁciently combine sequences and structure into accurate MSAs are needed. To the best of our knowledge, only six algorithms have been designed that are able to make use of secondary (6,7) or tertiary (8–10) structure information. In the context of this work, we used 3D-Coffee (11) for its ability to combine the output of several methods into one unique model. 3D-Coffee is based on the T-Coffee algorithm (12), a heuristic method that uses a progressive algorithm to compute an MSA having a high consistency with a collection of pre-computed pairwise alignments (the library).

*To whom correspondence should be addressed. Tel: +33 491 825 427; Fax: +33 491 825 420; Email: cedric.notredame@gmail.com Ó The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Nucleic Acids Research, 2006, Vol. 34, Web Server issue

W605

In 3D-Coffee, the principle is the same except that the library’s pairwise alignments are derived from structural superpositions using methods like Sap (13), Lsqman (14) or possibly any alternative structure alignment package [for a review see (15)]. When using combinations of structures and sequences, 3DCoffee can also incorporate structure-sequence (threading) alignment methods like Fugue (16) to ease the diffusion of structural information onto the sequences. 3D-Coffee has been available via the web server 3DCoffee@igs for >2 years (17). The original implementation made it possible to combine sequences and structures using the most advanced T-Coffee options through a simple web interface. Although it provides access to most of the T-Coffee inline functions, this server requires the user to explicitly specify which structural template is to be associated with each sequence. This speciﬁcation, made through a cumbersome procedure of sequence renaming, was complicated and impractical for non-specialists. The novel version of 3D-Coffee@igs is named Expresso because it makes it possible for non-specialists to rapidly and automatically beneﬁt from the strength of 3D-Coffee. The term Expresso also conveys the notion of aroma extraction and concentration, a notion that resonates with the way structures are ‘expressed’ within the MSA. In Expresso, we implemented an automated identiﬁcation of suitable structural templates via a BLAST search against the PDB database. 3D-Coffee uses the selected structures to assemble a genuine structure-based MSA during a process that merely looks like a standard sequence alignment procedure from the user’s point of view. Providing the appropriate structural information is available, Expresso is signiﬁcantly more accurate than regular homology-based methods and its alignments are often indistinguishable from reference structure-based alignments (11). METHODS Selection of the Structural Template The core idea of Expresso is to reliably identify structures that can be used as templates for the sequences (source) one wishes to align. The rationale is that any alignment carried out on the templates can easily be transposed onto the source sequences as long as the source and the template are highly homologous. The most basic and important step in Expresso is a BLAST search of the source sequences against PDB, in order to identify suitable templates. A BLAST match is considered a suitable template if it displays a minimum of 60% sequence identity with the source sequence and a minimum coverage of 70% (i.e. 70% of the source sequence residues matched). These rather conservative criteria were chosen to limit the template selection to close homologues whose alignment with the source is entirely non-ambiguous. No effort is made to identify structures with special conformations, or resolutions, although this could easily be added to the pipeline. However, whenever the automatic procedure appears inappropriate, the user can explicitly declare the source–template association using the advanced mode of the server. Integration of the Structural Template Once every sequence with a structural homologue has been assigned its template, 3DCoffee undertakes the

library computation step. It applies a collection of pre-deﬁned pairwise alignment methods on every pair of sequences. The methods are either sequence-based alignment (e.g. lalign) or structure-based (e.g. SAP). When using structural methods, a structure-based alignment of the templates is ﬁrst computed. The two source sequences are then aligned to their respective templates, and the induced pairwise alignment of the two sources is integrated within the library (Figure 1). The accuracy of this delicate process relies on a high level of identity between the source and the template sequence, hence the stringency of the original BLAST search. Alignment computation Once the library assembly step is ﬁnished, the MSA is assembled in a progressive fashion, using the standard T-Coffee algorithm. The default mode of the server for running T-Coffee is t_coffee <seq> -in Mslow_pair, Msap_pair, Mlalign_ id_pair –template_ﬁle SCRIPT_blast.pl, where SCRIPT_blast.pl is a stand-alone script that BLASTs every source sequence against PDB in order to identify suitable structural templates (if they exist).

USING EXPRESSO Default mode The server can be accessed at http://www.tcoffee.org/, by clicking on the Expresso link, either advanced or regular. To use the regular mode, one simply needs to cut and paste FASTA sequences. No special precaution is needed to name the sequences. Advanced mode The advanced mode of the server offers many more possibilities and guides the user with a series of bulleted points:  Cut and paste your sequences.  Upload your PDB structures. Should be used when some of the structures are not in the public domain. When uploading a PDB template, the associated source sequence is automatically generated using the SEQRES field. PDB files must follow the standard PDB format and the server requires a TITLE, a HEADER, an ATOM and a SEQRES section.  Select the methods. The default selection corresponds to 3DCoffee. Further structure alignment methods will soon be added, along with new multiple sequence alignment packages. Users are welcome to suggest the incorporation of any public domain method.  PDB template selection. By default no template is used in the advanced mode. Users should check the SCRIPT box to automatically fetch the templates with BLAST, or specify the source to template correspondences in the box below. The format for doing so is indicated in the corresponding section. Figure 2 shows a typical output, computed on the HOMSTRAD thioredoxin family (18). The ﬁrst alignment

W606

Nucleic Acids Research, 2006, Vol. 34, Web Server issue

Figure 1. Computation of a template-based library. Structural templates are assigned to each original source sequence and these templates are used to generate a structure-based sequence alignment. The final library alignment is generated by aligning each source sequence with its template, thus generating a template-based alignment of the two sources.

(Figure 2a) was computed using the standard T-Coffee protocol, while the other (Figure 2b) is an Expresso MSA computed using the regular mode. In the T-Coffee alignment, 15% of the columns are correctly aligned (as judged by comparison with the HOMSTRAD reference alignment) while in the Expresso MSA, 49% of the columns appear to be correct. Figure 2c shows which template was selected for each sequence. When selecting the template, no attempt is made to match the source sequence name with the template name, which sometimes results in apparent discrepancies (1aaza modelled with 1de2A). While in most cases, these arbitrary choices should not affect the output, better control can be achieved by

specifying the template/sequence correspondence in the advanced mode.

CONCLUSION AND FUTURE DEVELOPMENTS Expresso is an improved version of the original 3DCoffee@igs server. Structures are now fetched automatically and used to guide the alignment. This procedure can result in a dramatic improvement of the sequence alignment when homologue PDB structures are available. From the user point of view, Expresso is a regular multiple sequence alignments server that seamlessly includes structural information in MSAs, allowing

Nucleic Acids Research, 2006, Vol. 34, Web Server issue

W607

(a)

(b)

(c)

Figure 2. Computation of an Expresso Alignment. (a) Default T-Coffee alignment of the thioredoxin HOMSTRAD dataset. Red portions have a high reliability and are expected to be more accurate that the rest. Blue and green portions are the less consistent. Consistency is estimated from a sequence-based T-Coffee library. In this MSA 15% of the columns are similar to the reference HOMSTRAD MSA. (b) Expresso Alignment. Consistency is now estimated from a library computed using template-based alignments. In this alignment 49% of the columns are similar to the HOMSTRAD reference MSA. (c) Automatic template assignment.

non-specialists to beneﬁt from the power of structure-based sequence alignment without having to address all the technical issues it implies. Future developments will involve a gradual extension of the methods available for combination in the advanced section. We strongly encourage users to send us their feedback.

pay the Open Access publication charges for this article was provided by CNRS. Conflict of interest statement. None declared. REFERENCES
1. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 2. Sander,C. and Schneider,R. (1991) Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins, 9, 56–68. 3. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 283, 595–602. 4. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol., 136, 225–270.

ACKNOWLEDGEMENTS We thank Prof. Jean-Michel Claverie (head of IGS) for stimulating scientific discussions and material support. We also thank Prof. Roger Hersch (EPFL) for useful advices on code optimization. The development of the server was supported by CNRS (Centre National de la Recherche Scientifique), ´ Sanofi-Aventis Pharma SA., Marseille–Nice Genopole and the French National Genomic Network (RNG). Funding to

W608

Nucleic Acids Research, 2006, Vol. 34, Web Server issue

5. Kouranov,A., Xie,L., de la Cruz,J., Chen,L., Westbrook,J., Bourne,P.E. and Berman,H.M. (2006) The RCSB PDB information portal for structural genomics. Nucleic Acids Res., 34, D302–D305. 6. Heringa,J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comput. Chem., 23, 341–364. 7. Simossis,V.A. and Heringa,J. (2005) PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res., 33, W289–W294. 8. Zhang,Z., Lindstam,M., Unge,J., Peterson,C. and Lu,G. (2003) Potential for dramatic improvement in sequence alignment against structures of remote homologous proteins by extracting structural information from multiple structure alignment. J. Mol. Biol., 332, 127–142. 9. Ren,T., Veeramalai,M., Tan,A.C. and Gilbert,D. (2004) MSAT: a multiple sequence alignment tool based on TOPS. Appl. Bioinformatics, 3, 149–158. 10. Kleinjung,J., Romein,J., Lin,K. and Heringa,J. (2004) Contact-based sequence alignment. Nucleic Acids Res., 32, 2464–2473. 11. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395.

12. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 13. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. 14. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta. Crystallogr. D Biol. Crystallogr., 55, 941–944. 15. Kolodny,R., Koehl,P. and Levitt,M. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188. 16. Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243–257. 17. Poirot,O., Suhre,K., Abergel,C., O’Toole,E. and Notredame,C. (2004) 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Res., 32, W37–W40. 18. de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. (2001) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17, 748–749.

Education

Recent Evolutions of Multiple Sequence Alignment Algorithms
´ Cedric Notredame
n ever-increasing number of biological modeling methods depend on the assembly of an accurate multiple sequence alignment (MSA). These include phylogenetic trees, proﬁles, and structure prediction. Assembling a suitable MSA is not, however, a trivial task, and none of the existing methods have yet managed to deliver biologically perfect MSAs. Many of the algorithms published these last years have been extensively described [1–3], and this review focuses only on the latest developments, including meta-methods and template-based alignment techniques. The purpose of an MSA algorithm is to assemble alignments reﬂecting the biological relationship between several sequences. Computing exact MSAs is computationally almost impossible, and in practice approximate algorithms (heuristics) are used to align sequences, by maximizing their similarity. The biological relevance of these MSAs is usually assessed by systematic comparison with established collections of structure-based MSAs (‘‘gold standards’’; for review see [4]). Since only a few sequences have known structures, the accuracy measured on the references is merely an estimation of how well a package may fare on standard datasets. Gold standards have had a considerable effect on the evolution of MSA algorithms, refocusing the entire methodological development toward the production of structurally correct alignments. Their use has also coincided with a notable algorithmic harmonization, most MSA packages being now based on the ‘‘progressive algorithm’’ [5]. This greedy heuristic assembly algorithm involves estimating a guide tree (rooted binary tree) from unaligned sequences, and then incorporating the sequences into the MSA with a pairwise alignment algorithm while following the tree topology. The progressive algorithm is often embedded in an iterative loop where the guide tree and the MSA are reestimated until convergence. Most MSA packages reviewed here [6–18] follow this canvas, albeit more or less extensively adapted for improved performances [1–3]. The scoring schemes used by the pairwise alignment algorithm are arguably the most inﬂuential component of the progressive algorithm. They can be divided in two categories: matrix- and consistency-based. Matrix-based algorithms such as ClustalW [14], MUSCLE [6], and Kalign [19] use a substitution matrix to assess the cost of matching two symbols or two proﬁled columns. Although proﬁle statistics can be more or less sophisticated, the score for matching two positions depends only on the considered columns or their immediate surroundings. By contrast, consistency-based schemes incorporate a larger share of information into the evaluation. This result is achieved by using a recipe initially developed for T-Coffee [10] and inspired by Dialign overlapping weights [20]. Its principle is to compile a collection of pairwise global and local alignments (primary library) and to use this collection as a position-speciﬁc

A

substitution matrix during a regular progressive alignment. The aim is to deliver a ﬁnal MSA as consistent as possible with the alignments contained in the library. Many recent packages have built upon this initial framework. For instance, PCMA [15] decreases T-Coffee computational requirements by prealigning closely related sequences. ProbCons [7] uses Bayesian consistency and ﬁlls the primary library using the posterior decoding of a pair hidden Markov model. The substitution costs are estimated from this library using Bayesian statistics. MUMMALS [17] combines the ProbCons scoring scheme with the PCMA strategy, while including secondary structure predictions in its pair hidden Markov model. The most accurate ﬂavors of MAFFT [8] (i.e., the GNS and LNS modes) use a T-Coffee–like evaluation. A majority of studies indicate that consistency-based methods are more accurate than their matrix-based counterparts [4], although they typically require an amount of CPU time N times higher than simpler methods (N being the number of sequences). Most of these methods are available online, either as downloadable packages or as online Web services (Table 1). The wealth of available methods and their increasingly similar accuracies makes it harder than ever to objectively choose one over the others. Consensus methods such as MCoffee [12] provide an interesting framework to address this problem. M-Coffee is a consensus meta-method based on TCoffee. Given a sequence dataset, it ﬁlls up the library by using various MSA methods to compute alternative alignments. T-Coffee then uses this library to compute a ﬁnal MSA consistent with the original alignments. When combining eight of the most accurate and distinct MSA packages, M-Coffee produces 67% of the time a better MSA than ProbCons, the best individual method [12]. Aside from its ease of extension, M-Coffee’s main advantage is its ability to estimate the local consistency between the ﬁnal alignment and the combined MSAs (CORE index [21]; Figure 1). This useful index has been shown to be well-correlated with the MSAs’ structural correctness [21,22]. M-Coffee is not, however, the ultimate answer to the MSA problem, and its limited performances on remote homologs suggest that

Editor: Fran Lewitter, Whitehead Institute, United States of America Citation: Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8): e123. doi:10.1371/journal.pcbi.0030123 ´dric Notredame. This is an open-access article distributed Copyright: Ó 2007 Ce under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abbreviations: MSA, multiple sequence alignment ´ ´nomique et Structurale, CNRS UPR2589, Cedric Notredame is with Information Ge Institute for Structural Biology and Microbiology, Parc Scientifique de Luminy, Marseille, France. E-mail: cedric.notredame@gmail.com

PLoS Computational Biology | www.ploscompbiol.org

1405

August 2007 | Volume 3 | Issue 8 | e123

Table 1. Summary of the Methods Described in the Review
Method Score Templates Validation Values PreFab
ClustalW [14] Kalign MUSCLE [6] T-Coffee [10] ProbCons [7] MAFFT [8] M-Coffee [12] MUMMALS [16] DbClustal [24] PRALINE [9] PROMALS [16] SPEM [28] Expresso [13] T-Lara [29] Matrix Matrix Matrix Consistency Consistency Consistency Consistency Consistency Profiles Matrix Consistency Matrix Consistency Consistency — — — — — — — — — Profiles Profiles Profiles Structures Structures 61.80 63.00 68.00 69.97 70.54 72.20 72.91 73.10 — — 79.00 77.00 — — [12] [18] [16] [12] [12] [12] [12] [16]

Server

HOMSTRAD
— — 45.0 44.0 — — — — — 50.2 — — 71.9 — http://www.ebi.ac.uk/clustalw/ http://msa.cgb.ki.se/ http://www.drive5.com/muscle/ http://www.tcoffee.org/ http://probcons.stanford.edu/ http://align.genome.jp/mafft/ http://www.tcoffee.org/ http://prodata.swmed.edu/mummals/ http://bips.u-strasbg.fr/PipeAlign/ http://zeus.cs.vu.nl/programs/pralinewww/ http://prodata.swmed.edu/promals/ http://sparks.informatics.iupui.edu/Softwares-Services_files/spem.htm http://www.tcoffee.org/ https://www.mi.fu-berlin.de/w/LiSA/

[9] [9]

[9]

[16] [28]

[11]a

Validation values were compiled from several sources, and selected for comparability. PreFab validations were made using PreFab version 3. HOMSTRAD validations were made on datasets having less than 30% identity. The source of each value is indicated by the accompanying reference citation. a The Expresso value comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical. doi:10.1371/journal.pcbi.0030123.t001

further improvement using only sequence information remains an elusive goal. Progress is nonetheless needed, and, at this point, the most promising approach is probably to incorporate within the datasets any information likely to improve the alignments, such as structural and homology data. Template-based alignment methods [13] follow this approach. Structural extension was initially described by Taylor [23]. The principle is fairly straightforward (Figure 2) and involves identifying with BLAST a structural template in the Protein Data Bank for each sequence, aligning the templates using a structure superposition method, and mapping the original sequences onto their template’s alignment. The resulting sequence alignments are compiled in the primary library and used by a consistency-based method to compute the ﬁnal MSA. Homology extension was originally introduced in the DbClustal package [24] and works along the same lines, using

a proﬁle rather than a structure. PSI-BLAST is used to build a proﬁle for each sequence, and these proﬁles are used as templates to generate better sequence alignments, thanks to the evolutionary information they contain. The only difference between homology and structure extension is the templates’ nature and the associated alignment method. This generic approach can easily be extended to any kind of template. For instance, Expresso [13] uses SAP [25,26] and FUGUE [27] to align structural templates identiﬁed by a BLAST against the Protein Data Bank. PROMALS [17], PRALINE [9], and SPEM [28] make a proﬁle–proﬁle alignment with PSI-BLAST proﬁles used as templates. In PRALINE and PROMALS, the proﬁle can be complemented with a secondary structure prediction in an attempt to improve the alignment accuracy. PROMALS uses ProbCons Bayesian consistency to ﬁll its library with the posterior decoding of a pair hidden Markov model. T-Lara [29] uses

doi:10.1371/journal.pcbi.0030123.g001

Figure 1. Typical Output of M-Coffee This output was obtained on the kinase1_ref5 BaliBase dataset, by combining MUSCLE, MAFFT, POA, Dialign-T, T-Coffee, ClustalW, PCMA, and ProbCons with M-Coffee. Correctly aligned residues (as judged from the reference) are uppercase; noncorrect ones are lowercase. The color of each residue indicates the agreement of the individual MSAs with respect to the alignment of that specific residue. Red indicates residues aligned in a similar fashion among all the individual MSAs; blue indicates very low agreement between MSAs. Dark yellow, orange, and red residues can be considered to be reliably aligned. PLoS Computational Biology | www.ploscompbiol.org 1406 August 2007 | Volume 3 | Issue 8 | e123

doi:10.1371/journal.pcbi.0030123.g002

Figure 2. Framework of a Template-Based Method Structural templates are first identified, mapped onto the sequences, and aligned using SAP. The sequence–template mapping is then used to guide the alignment of the original sequences. This alignment is integrated into the library that is used to compute the final MSA.

RNA secondary structure predictions as templates and ﬁlls a T-Coffee library with the Lara pairwise algorithm. With the exception of PRALINE and SPEM, which use a regular progressive algorithm, most template-based methods described here are consistency-based (some of them taking advantage of T-Coffee modular structure). Their main advantage is increased accuracy. Recent benchmarks on PROMALS (Table 1) show that homology extension results in a ten-point improvement over existing methods. Likewise, structure-based methods such as Expresso produce alignments much closer to the structural references than do any of their sequence-based counterparts. One must, however, be careful not to over-interpret validation values like that given for Expresso in Table 1, since both the reference and the Expresso alignments were computed using the same structural information. This last point raises the important issue of method validation and benchmarking. A recent study [4] shows that with the exception of artiﬁcial datasets, benchmarks carried out on most reference databases tend to deliver compatible results. It also suggests that the best methods have become indistinguishable, except when considering remote homologs (less than 25% identity). Unfortunately, remote homologs are poorly suited to generating reference alignment, owing to the fact that their superposition often yields alternative sequence
PLoS Computational Biology | www.ploscompbiol.org 1407

alignments that are structurally equivalent [30]. However, one can bypass the reference alignment stage by directly comparing the evaluated alignment to some idealized 3-D superposition. Such an alignment-independent evaluation has been described and used by several authors [17,31,32]. Another trend, not well accounted for by current reference collections, is the alignment of very large datasets. While many new methods incorporate special algorithms for aligning several hundred sequences [6,8,18], current reference databases do not allow the evaluation of very large datasets, thus making it unclear how the published accuracies scale with the number of sequences. While this last issue could probably be satisfyingly solved in the current benchmarking framework, another problem remains that is much harder to address. All the existing validation approaches have in common their reliance on the ‘‘one size ﬁts all’’ assumption that structurally correct alignments are the best possible MSAs for modeling any kind of biological signal (evolution, homology, or function). A report on proﬁle construction [33] has recently challenged this view by showing that structurally correct alignments do not necessarily result in better proﬁles. Likewise, it may be reasonable to ask whether better alignments always result in better phylogenetic trees, and, more systematically, to question and quantify the relationship between the accuracy of MSAs and the biological relevance of any model drawn upon them. In this review, I have presented some of the latest additions to the MSA computation arsenal. An interesting milestone has been the development of meta-methods able to seamlessly combine the output of several methods. Aside from easing the user’s work, the main advantage of these consensus methods is probably the local estimation of reliability they provide (Figure 1). Using this estimation to ﬁlter out unreliable regions has already proven useful in homology modeling [34] and could probably be used further. The main improvement reported here, however, is probably the notion of templatebased alignment. Template-based alignment is more than a trivial extension of consistency-based methods. Under this new model, the purpose of an MSA is not to squeeze a dataset and extract all the information it may contain, but rather to use the dataset as a starting point for exploring and retrieving all the related information contained in public databases. This information is to be used not only for mapping purposes, but also for driving the MSA computation. Such a usage of sequence information makes template-based methods a real paradigm shift and a major step toward global biological data integration. &

Acknowledgments
The author thanks the two anonymous reviewers for suggesting several missing references. Author contributions. CN analyzed the data and wrote the paper. Funding. CN is funded and supported by the Centre National de la Recherche Scientiﬁque, France. Competing interests. The author has declared that no competing interests exist.

References 1. Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16: 368–373. 2. Wallace IM, Blackshields G, Higgins DG (2005) Multiple sequence alignments. Curr Opin Struct Biol 15: 261–266.

August 2007 | Volume 3 | Issue 8 | e123

3. Gotoh O (1999) Multiple sequence alignment: Algorithms and applications. Adv Biophys 36: 159–206. 4. Blackshields G, Wallace IM, Larkin M, Higgins DG (2006) Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6: 321–339. 5. Hogeweg P, Hesper B (1984) The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J Mol Evol 20: 175–186. 6. Edgar RC (2004) MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113. 7. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15: 330–340. 8. Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511–518. 9. Simossis VA, Heringa J (2005) PRALINE: A multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33: W289–W294. 10. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302: 205–217. 11. O’Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C (2004) 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. J Mol Biol 340: 385–395. 12. Wallace IM, O’Sullivan O, Higgins DG, Notredame C (2006) M-Coffee: Combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res 34: 1692–1699. 13. Armougom F, Moretti S, Poirot O, Audic S, Dumas P, et al. (2006) Expresso: Automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res 34: W604–W608. 14. Thompson J, Higgins D, Gibson T (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4690. 15. Pei J, Sadreyev R, Grishin NV (2003) PCMA: Fast and accurate multiple sequence alignment based on proﬁle consistency. Bioinformatics 19: 427– 428. 16. Pei J, Grishin NV (2007) PROMALS: Towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23: 802–808. 17. Pei J, Grishin NV (2006) MUMMALS: Multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34: 4364–4374. 18. Lassmann T, Sonnhammer EL (2005) Kalign—An accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298.

19. Lassmann T, Sonnhammer EL (2006) Kalign, Kalignvu and Mumsa: Web servers for multiple sequence alignment. Nucleic Acids Res 34: W596– W599. 20. Morgenstern B, Dress A, Wener T (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc Natl Acad Sci U S A 93: 12098–12103. 21. Notredame C, Abergel C (2003) Using multiple alignment methods to assess the quality of genomic data analysis. In: Andrade M, editor. Bioinformatics and genomes: Current perspectives. Wymondham (United Kingdom): Horizon Scientiﬁc Press. pp. 30–50. 22. Lassmann T, Sonnhammer EL (2005) Automatic assessment of alignment quality. Nucleic Acids Res 33: 7120–7128. 23. Taylor WR (1986) Identiﬁcation of protein sequence homology by consensus template alignment. J Mol Biol 188: 233–258. 24. Thompson JD, Plewniak F, Thierry J, Poch O (2000) DbClustal: Rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res 28: 2919–2926. 25. Taylor WR, Orengo CA (1989) Protein structure alignment. J Mol Biol 208: 1–22. 26. Taylor WR (1999) Protein structure comparison using iterated double dynamic programming. Protein Sci 8: 654–665. 27. Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: Sequence-structure homology recognition using environment-speciﬁc substitution tables and structure-dependent gap penalties. J Mol Biol 310: 243–257. 28. Zhou H, Zhou Y (2005) SPEM: Improving multiple sequence alignment with sequence proﬁles and predicted secondary structures. Bioinformatics 21: 3615–3621. 29. Bauer M, Klau G, Reinert K (2005) Multiple structural RNA alignment with Lagrangian relaxation. Lect Notes Comput Sci 3692: 303–314. 30. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS (2000) ProSup: A reﬁned tool for protein structure alignment. Protein Eng 13: 745–752. 31. O’Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, et al. (2003) APDB: A novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics 19: I215–I221. 32. Armougom F, Moretti S, Keduas V, Notredame C (2006) The iRMSD: A local measure of sequence alignment accuracy using structural information. Bioinformatics 22: e35–e39. 33. Grifﬁths-Jones S, Bateman A (2002) The use of structure information to increase alignment accuracy does not aid homologue detection with proﬁle HMMs. Bioinformatics 18: 1243–1249. 34. Claude JB, Suhre K, Notredame C, Claverie JM, Abergel C (2004) CaspR: A web server for automated molecular replacement using homology modelling. Nucleic Acids Res 32: W606–W609.

PLoS Computational Biology | www.ploscompbiol.org

1408

August 2007 | Volume 3 | Issue 8 | e123

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

A novel, noncanonical mechanism of cytoplasmic polyadenylation operates in Drosophila embryogenesis
Olga Coll, Ana Villalba, Giovanni Bussotti, et al. Genes Dev. 2010 24: 129-134 Access the most recent version at doi:10.1101/gad.568610

Supplemental Material References Email alerting service

http://genesdev.cshlp.org/content/suppl/2009/12/29/24.2.129.DC1.html This article cites 37 articles, 11 of which can be accessed free at: http://genesdev.cshlp.org/content/24/2/129.full.html#ref-list-1 Receive free email alerts when new articles cite this article - sign up in the box at the top right corner of the article or click here

To subscribe to Genes & Development go to: http://genesdev.cshlp.org/subscriptions

Copyright © 2010 by Cold Spring Harbor Laboratory Press

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

RESEARCH COMMUNICATION

A novel, noncanonical mechanism of cytoplasmic polyadenylation operates in Drosophila embryogenesis
Olga Coll,1 Ana Villalba,1 Giovanni Bussotti,2 ´ Cedric Notredame,2 and Fatima Gebauer1,3
´ ` Gene Regulation Programme, Centre de Regulacio Genomica (CRG-UPF), 08003 Barcelona, Spain; 2Bioinformatics ´ ` Programme, Centre de Regulacio Genomica (CRG-UPF), 08003 Barcelona, Spain
1

Cytoplasmic polyadenylation is a widespread mechanism to regulate mRNA translation that requires two sequences in the 39 untranslated region (UTR) of vertebrate substrates: the polyadenylation hexanucleotide, and the cytoplasmic polyadenylation element (CPE). Using a cell-free Drosophila system, we show that these signals are not relevant for Toll polyadenylation but, instead, a ‘‘polyadenylation region’’ (PR) is necessary. Competition experiments indicate that PR-mediated polyadenylation is required for viability and is mechanistically distinct from the CPE/hexanucleotide-mediated process. These data indicate that Toll mRNA is polyadenylated by a noncanonical mechanism, and suggest that a novel machinery functions for cytoplasmic polyadenylation during Drosophila embryogenesis.
Supplemental material is available at http://www.genesdev.org. Received July 20, 2009; revised version accepted November 23, 2009.

Oocyte maturation and early development in many organisms occurs in the absence of transcription. Developmental progression at these times depends largely on differential translation of maternal mRNAs, and cytoplasmic polyadenylation is a major component of this control. In general, mRNAs with a short poly(A) tail remain translationally silent, while elongation of the poly(A) tail in the cytoplasm results in translational activation. Most of the accumulated knowledge on cytoplasmic polyadenylation derives from studies in oocytes of Xenopus (for review, see Belloc et al. 2008; Radford et al. 2008). Two cis-acting sequences in the 39 untranslated region (UTR) of substrate mRNAs are essential for this process: the conserved polyadenylation hexanucleotide —also required for nuclear polyadenylation—with the structure A(A/U)UAAA, and the U-rich cytoplasmic polyadenylation element (CPE), which generally consists of U4–5A1–3U. The hexanucleotide is recognized by the multisubunit complex CPSF (cleavage and polyadenylation specificity factor), and the CPE is recognized by
[Keywords: CPE; hexanucleotide; polyadenylation; Toll] 3 Corresponding author. E-MAIL fatima.gebauer@crg.es; FAX 34-93-3969983. Article is online at http://www.genesdev.org/cgi/doi/10.1101/gad.568610.

CPEB, a protein with a dual function that acts as a switch between translational repression and activation. In immature oocytes, CPEB represses translation by recruiting a set of factors that functionally block the two ends of the mRNA. On the one hand, CPEB recruits Maskin (or 4E-T in growing oocytes), which in turn binds to eIF4E and prevents its recognition by eIF4G during translation initiation (Stebbins-Boaz et al. 1999; Minshall et al. 2007). On the other hand, CPEB recruits the deadenylase PARN, which keeps the poly(A) tail short (Kim and Richter 2006). Upon meiotic maturation, CPEB phosphorylation results in eviction of PARN and enhanced recruitment of CPSF. Together, CPEB and CPSF recruit the cytoplasmic poly(A) polymerase GLD-2, leading to elongation of the poly(A) tail and translational activation (Barnard et al. 2004). The distance between the CPE and the hexanucleotide dictates the timing and extent of poly´ adenylation (Pique et al. 2008). Additional elements reported to function early during oocyte maturation are the U-rich polyadenylation response elements (PREs), which bind the protein Musashi (Charlesworth et al. 2002, 2006). CPEB is a conserved family of four members in vertebrates that, in addition to oocyte maturation, contribute to the regulation of local protein synthesis at synapses that underlies long-term changes in synaptic plasticity (for review, see Richter 2007). In Drosophila, the CPEB1 homolog Orb plays a role in mRNA localization and regulates the polyadenylation of oskar and cyclin B mRNAs during oogenesis (Chang et al. 1999; Castagnetti and Ephrussi 2003; Benoit et al. 2005). Orb2, the homolog of CPEB2-4, is required for long-term memory, but its role in cytoplasmic polyadenylation has not been demonstrated (Keleman et al. 2007). Other conserved factors that contribute to cytoplasmic polyadenylation during Drosophila oogenesis are the canonical poly(A) polymerase Hiiragi, and the GLD-2 homolog Wispy (Wisp) (Juge et al. 2002; Benoit et al. 2008; Cui et al. 2008). Cytoplasmic polyadenylation also occurs during embryogenesis, but the sequences and factors responsible for polyadenylation at these times remain poorly understood. In Drosophila, translation of the transcripts encoding Bicoid, Toll, and Torso is activated by cytoplasmic polyadenylation in early embryogenesis, and this activation is required for appropriate axis formation ´ (Salles et al. 1994; Schisa and Strickland 1998). How this polyadenylation occurs is intriguing, as Orb is barely detectable in early embryos (Vardy and Orr-Weaver 2007). Furthermore, no cis-acting elements responsible for cytoplasmic polyadenylation have been described yet in this organism. Therefore, an important question is whether similar signals, factors, and mechanisms operate for cytoplasmic polyadenylation in different biological settings. To address this question, we used an in vitro cytoplasmic polyadenylation system derived from Drosophila early embryos. Using Xenopus cyclin B1 (CycB1) mRNA as a substrate, we found that the canonical cytoplasmic polyadenylation signals—the CPE and the hexanucleotide—function in Drosophila. Surprisingly, however, these sequences are not necessary for Toll polyadenylation. Rather, a region of the 39 UTR that we term the ‘‘polyadenylation region’’ (PR) is required. Consistently, competition experiments indicate

GENES & DEVELOPMENT 24:129–134 Ó 2010 by Cold Spring Harbor Laboratory Press ISSN 0890-9369/10; www.genesdev.org

129

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

Coll et al.

that PR-mediated polyadenylation is mechanistically distinct from the CPE/hexanucleotide-mediated process, implying that a novel machinery for cytoplasmic polyadenylation operates during Drosophila embryogenesis. Results and Discussion A Drosophila cell-free system for cytoplasmic polyadenylation Cytoplasmic polyadenylation has been observed in oocytes and early embryos of Drosophila. Maximal polyadenylation of Toll and bicoid mRNAs occurs at ;90 min ´ of development (Salles et al. 1994). To study cytoplasmic polyadenylation, we first tested whether extracts from 90-min embryos could recapitulate this process. Similar extracts obtained from nonstaged embryos have been shown previously to support translation (Gebauer et al. 1999). Because translation is the consequence of cytoplasmic polyadenylation for Toll and bicoid, we tested whether polyadenylation of these substrates in embryonic extracts could occur under translation conditions. We incubated nonadenylated full-length bicoid mRNA and the 39 UTR of Toll in staged 90-min embryo extracts. In addition, we used the mRNA encoding the ribosomal protein Sop as a negative control, as this transcript contains a canonical hexanucleotide and has been shown to undergo nuclear but not cytoplasmic polyadenylation (e.g., see Benoit et al. 2008). After incubation, we measured the length of the poly(A) tail by the PCR-based poly(A) test (PAT) assay. We found that both bicoid and Toll RNAs gained a poly(A) tail of ;150 nucleotides (nt) while sop mRNA remained nonadenylated, indicating that this system recapitulates the cytoplasmic polyadenylation process (Fig. 1A). Toll mRNA was selected for further studies because it was consistently polyadenylated more efficiently than bicoid mRNA in the cell-free system. To evaluate whether cytoplasmic polyadenylation resulted in increased translation, we first tested the correlation between both processes in a time-course experiment. We fused the 39 UTR of Toll to the firefly luciferase ORF to yield the Luc-toll transcript. Translation of this transcript closely paralleled polyadenylation of Toll 39 UTR (Fig. 1B, cf. the left panel and the Luc-toll curve in the right panel). In addition, treatment of the mRNA with the chain elongation inhibitor cordycepin (39-deoxyadenosine) dramatically reduced the efficiency of translation, decreasing it to the levels of nonadenylated luciferase mRNA (Fig. 1B, right panel). These data show that both cytoplasmic polyadenylation and polyadenylationdependent translation can be recapitulated in 90-min Drosophila embryo extracts. The canonical cytoplasmic polyadenylation signals function in Drosophila The cis-acting elements for cytoplasmic polyadenylation in Drosophila are unknown. To test whether the CPE and the hexanucleotide were recognized as polyadenylation elements in this organism, we analyzed the polyadenylation of the best-characterized vertebrate substrate, CycB1. The 39 UTR of this transcript contains three CPEs, one of them overlapping with the hexanucleotide (Fig. 1C), and has been shown to undergo strong poly-

Figure 1. The canonical cytoplasmic polyadenylation signals function in Drosophila. (A) Cytoplasmic polyadenylation in Drosophila embryo extracts. Nonadenylated full-length sop and bcd mRNAs, as well as the 39 UTR of Toll, were incubated in 90-min embryo extracts, and the poly(A) tail was measured using the PAT assay. A schematic representation of the oligonucleotides used in this assay is shown in the top panel. For each transcript, a specific 59 oligonucleotide was combined with either a specific 39 primer to reveal the size of the nonadenylated RNA (39 lanes), or with an oligo(dT) anchor to visualize the poly(A) tail (dT lanes). Molecular weight markers are also shown. (B) Cytoplasmic polyadenylation promotes translation of Toll mRNA. (Left panel) Polyadenylation time course of the Toll 39 UTR, as measured by PAT assay. (Right panel) A firefly luciferase reporter containing the 39 UTR of Toll (Luc-toll) was incubated in embryo extracts for different times, and the efficiency of translation was determined by measuring the luciferase activity. As controls, the translation efficiencies of nonadenylated Luciferase and cordycepin-treated Luc-toll mRNAs were determined. Curves represent the average of five independent experiments. (C) The CPE and the hexanucleotide are functional CPEs in Drosophila. (Left panel) Polyadenylation of radioactively labeled wild-type and hexanucleotide-mutated CycB1 39 UTRs. The nature of the mutation is indicated in gray lowercase letters. RNAs were separated in a 6% denaturing acrylamide gel and visualized using the PhosphorImager. (Right panel) Polyadenylation of CAT reporter mRNAs containing either a wild-type CycB1 39 UTR or a derivative with U-to-G transversions in all three CPEs. RNAs were visualized by Northern blot using a probe against the CAT ORF. A schematic representation of a Xenopus CycB1 39 UTR is shown, with the three CPEs and the polyadenylation hexanucleotide (HN) depicted as gray and white boxes, respectively.

adenylation at a late time during Xenopus oocyte matu´ ration and in early embryos (Groisman et al. 2002; Pique et al. 2008). The 39 UTR of CycB1 was polyadenylated in the Drosophila cell-free system and was sufficient to confer polyadenylation when fused to a CAT reporter (Fig. 1C). Importantly, mutation of the hexanucleotide (Fig. 1C, left panel) or the CPEs (Fig. 1C, right panel, CPE0) completely abrogated polyadenylation, indicating that the vertebrate cytoplasmic polyadenylation signals function in Drosophila. These results suggest that a canonical cytoplasmic polyadenylation machinery exists in Drosophila embryos. Intriguingly, however, poly(A) tail length control must occur in the absence of Orb, which is undetectable in early embryos, and PARN, which is not conserved in Drosophila.

130

GENES & DEVELOPMENT

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

Noncanonical polyadenylation

Toll mRNA is polyadenylated in a CPE-independent and hexanucleotide-independent fashion Toll mRNA contains a canonical hexanucleotide, followed by a putative CPE (Fig. 2A). To determine whether these sequences were responsible for polyadenylation, we analyzed the behavior of Toll mutant derivatives. To visualize the polyadenylated products, we used Northern blots, which allow a more accurate measurement of the efficiency of polyadenylation as compared with PAT assays. Surprisingly, mutation or deletion of the CPE and/or the hexanucleotide did not affect polyadenylation of Toll (Fig. 2A). Treatment with oligo(dT) and RNase H confirmed that the size increase of Toll upon incubation was due to polyadenylation (Supplemental Fig. 1). That Toll polyadenylation was unaffected by deletion of the CPE and the hexanucleotide was unexpected, as both elements function as polyadenylation signals in Drosophila (Fig. 1C), and a single point mutation in the hexanucleotide (AAUAAA to AAGAAA) is sufficient to hinder cytoplasmic polyadenylation in vertebrates (e.g., see McGrew and Richter 1990). Thus, it appeared that Toll polyadenylation was independent of the CPE and the hexanucleotide. However, the canonical polyadenylation machinery could, in principle, bind to functional variations of these elements that could pass unrecognized by sequence inspection. To exclude this possibility, we performed competition assays. CycB1 effectively competed the polyadenylation of radiolabeled CAT-CycB1 (Fig. 2B, lanes 10–15). Polyadenylation of Toll was readily competed by an excess of cold Toll 39 UTR, but, re-

markably, not by an excess of CycB1 (Fig. 2B, lanes 1–9, and see also lanes 16–21, showing the same reactions as lanes 4–9 taken 15 min later). These data argue that polyadenylation of Toll is independent of the CPE and the hexanucleotide. Addition of excess Toll competitor also destabilized the Toll substrate, while addition of CycB1 did not (Fig. 2B, lanes 16–21). In addition, nonadenylated CAT-CycB1 was often stabilized in the presence of CycB1 competitor (Fig. 2B, lanes 13–15), suggesting that the Drosophila extracts can be used to monitor both stability and adenylation changes, but that these two processes are not necessarily linked. A proximal region in Toll 39 UTR directs noncanonical cytoplasmic polyadenylation To identify the elements of Toll that were responsible for cytoplasmic polyadenylation, we performed mutational analysis. The distal 40% of Toll 39 UTR could be deleted without significant consequences for cytoplasmic polyadenylation (Fig. 3A, fragment D3). Further deletions progressively reduced the efficiency of polyadenylation (fragments D4 and D5). A region of 183 nt within the first half of the 39 UTR was sufficient to provide detectable levels of polyadenylation (fragment D6), while other regions of the 39 UTR were not (fragments D5 and D7). We refer to the D6 fragment as the PR. Importantly, deletion of the PR from an otherwise wild-type 39 UTR severely blocked polyadenylation and translation of Toll, indicating that the PR is essential for expression of this mRNA (Fig. 3B). Although the PR is necessary for polyadenylation, other sequences within the Toll 39 UTR seem to influence both the efficiency of polyadenylation and the length of the poly(A) tail. Deletions downstream from the PR reduce the polyadenylation efficiency, while deletions upstream of the PR reduce the length of the poly(A) tail (Fig. 3A, cf. fragments D3, D4, and D6). This illustrates the complexity and fine regulation of the process, which is likely to be mediated by a complex interplay of multiple activities. Auxiliary elements located both upstream of and downstream from the polyadenylation signal have also been described for nuclear polyadenylation (Chen and Wilusz 1998; Zarudnaya et al. 2003), including sequences and factors that mediate hexanucleotideindependent polyadenylation (Venkataraman et al. 2005). Elements other than the canonical CPE have been shown previously to stimulate cytoplasmic polyadenylation in other organisms. In Xenopus oocytes, the U-rich PRE and TCS (translational control sequence) stimulate the polyadenylation of a number of mRNAs early after progesterone induction (Charlesworth et al. 2002, 2004; Wang et al. 2008). Similarly, poly(U) and poly(C) sequences promote polyadenylation during Xenopus embryogenesis (Simon et al. 1992; Paillard et al. 2000), and undefined elements other than the CPE and the hexanucleotide direct polyadenylation of lamin B1 mRNA in Xenopus embryos (Ralle et al. 1999). However, no direct evidence exists that these elements function independently of the canonical polyadenylation machinery. To confirm that the PR is responsible for noncanonical polyadenylation, we performed competition assays. Polyadenylation of Toll was competed with the PR, but not with the distal 119-nt fragment of the Toll 39 UTR that contained the CPE and the hexanucleotide (Fig. 4A, lanes 1–6). Conversely, the PR did not compete polyadenylation

Figure 2. Polyadenylation of Toll mRNA is independent of the CPE and the hexanucleotide. (A, top panel) Schematic representation of a Toll 39 UTR (1256 nt) and mutant derivatives. The location and sequence of the putative CPE and the hexanucleotide is detailed. (Bottom panel) Polyadenylation of these mRNAs was measured by Northern blot. (B) Polyadenylation competition assays. Polyadenylation of a 32P-labeled Toll 39 UTR or CAT-CycB1 after addition of excess Toll or CycB1 39 UTRs. RNAs were separated in a 1% denaturing agarose gel. Input (i) RNA probes are also shown. The percentage of polyadenylated transcript with respect to total transcript within each lane is indicated (%pA). Lanes 16–21 in the bottom panel show samples of the same reactions as lanes 4–9 taken 15 min later.

GENES & DEVELOPMENT

131

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

Coll et al.

Figure 3. Elements required for cytoplasmic polyadenylation of Toll mRNA. (A, left panel) Schematic representation of Toll 39 UTR deletion derivatives. Nucleotide positions are shown according to the annotated Drosophila Toll sequence, taking as reference the first nucleotide of the 39 UTR. The location of unique restriction sites and the PR are indicated. Restriction sites at both ends of Toll belong to the vector in which this sequence was cloned. Typical patterns of these RNAs are shown in the middle panel, before (À) or after (+) incubation in the Drosophila extracts. The sizes of the respective poly(A) tails are indicated. (Right panel) Quantification of the efficiency of polyadenylation, measured as the percentage of RNA that was polyadenylated versus the total signal (polyadenylated and nonadenylated) within the ‘‘+’’ lane. Values represent the average of at least three independent experiments. (B) The PR is essential for polyadenylation and translation of Toll. (Left panel) Polyadenylation time course of a wild-type Toll 39 UTR or a mutant derivative lacking the PR. (Right panel) Translation efficiencies of reporter mRNAs containing either wild-type or DPR Toll 39 UTRs. Renilla luciferase mRNA was cotranslated as an internal control. The firefly values were corrected for Renilla expression, and the data are represented relative to the translation of Luc-toll mRNA at the last time point. Curves represent the average of three independent experiments using a single batch of extract.

this region could not compete for polyadenylation of Toll (data not shown), suggesting that it does not function for polyadenylation in early embryos. Nevertheless, it should be noted that translation of Toll is also required at later times in development, where these sequences and/or the canonical signals could become relevant. To map more finely the sequences within the PR that were responsible for polyadenylation, we first searched for regions ultraconserved among Drosophilids. In addition, we looked for sequence words within the PR significantly overrepresented in the 39 UTRs of Drosophila melanogaster transcripts. We found two ultraconserved regions and two related sequence words distributed along the PR (Supplemental Fig. 3A,B, observe shadowed regions and words within red boxes). Fragments of the PR containing or lacking these sequences were used in functional competition assays (Supplemental Fig. 3C). The results suggest the existence of a complex element responsible for polyadenylation of Toll that is not associated with a simple linear sequence. We speculate that a structure—or multiple redundant, interdependent linear sequences—within the PR are necessary for polyadenylation. Interestingly, the conserved region at the 39 end of the PR consisting of TGTTATCTGTAAGC behaved as a stabilization element. All fragments containing this region destabilized Toll when added in excess, while fragments lacking it did not (Supplemental Fig. 3C).

of CAT-CycB1 (Fig. 4A, lanes 7–10), while polyadenylation of this transcript was efficiently competed by an excess of CycB1, as well as by any fragment of Toll containing the CPE and the hexanucleotide, including the full-length Toll 39 UTR (Fig. 4A, lanes 11–18). Consistent with the results of the polyadenylation assays, the PR competed translation of a Toll reporter but not of a CycB1 reporter (Supplemental Fig. 2). The PR competed both polyadenylation and translation less efficiently than the full-length Toll 39 UTR, in agreement with its lower polyadenylation efficiency. The competition results cannot be explained by different affinities of the same factors for the PR compared with the canonical sequences, because neither the PR competes CycB1 mRNA polyadenylation nor CycB1 competes Toll polyadenylation. Thus, we conclude that polyadenylation of Toll is driven by a complex that binds to the PR and differs from the canonical machinery in at least one limiting component. Previously, a region of the Toll 39 UTR that lies downstream from the PR (located between nucleotides 582 and 774) was shown to promote polyadenylation of this transcript (Schisa and Strickland 1998). In our hands,

Figure 4. The PR is required for noncanonical cytoplasmic polyadenylation in vitro and in vivo. (A) Polyadenylation competition experiments using the RNAs schematically represented in the top panel as competitors. Experiments were performed as indicated in the legend for Figure 2B. (B) Excess PRs disrupt polyadenylation of endogenous Toll and reduce viability. (Left panel) Embryos (0–30 min) were microinjected with different amounts of the PR or an unrelated RNA of similar length (159 nt) at a concentration of 10 ng/mL. Samples were collected 1 h after injection, the RNA was extracted, and the poly(A) tail length was tested by PAT assay. Amplified products were visualized by Southern blot using a random-primed probe against the Toll 39 UTR. The results for two independent injections are shown. (Right panel) Embryos were microinjected with RNA solutions at different concentrations or with water as control, and survival was scored as the percentage of hatched embryos as indicated in the Materials and Methods. The average of at least three independent injections of 100 embryos per injection is shown.

132

GENES & DEVELOPMENT

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

Noncanonical polyadenylation

Importantly, fragment 9 (F9) strongly destabilized Toll but did not compete for polyadenylation, showing that polyadenylation and stability of Toll are separable processes. We next wanted to test the relevance of the PR in vivo. We performed in vivo competition experiments by injecting wild-type early embryos with either the PR or an unrelated RNA of the same length as control. We then tested polyadenylation of endogenous Toll mRNA and survival of injected embryos, as timely expression of Toll is essential for early development. The results showed that the PR specifically competed polyadenylation of endogenous Toll and reduced the viability of early embryos (Fig. 4B). These results indicate that PR directs noncanonical polyadenylation in vitro and in vivo. The Toll polyadenylation mechanism described here may affect a variety of mRNAs. Recent in silico EST database analysis indicates an incidence of the hexanucleotide A(A/U)UAAA of 60%–70%, suggesting that a significant fraction of mRNAs lack consensus hexanucleotide signals and may undergo AAUAAA-independent polyadenylation (for review, see MacDonald and Redondo 2002). In addition, activation of Drosophila pgc (polar granule component) mRNA in early embryos appears independent of Orb (Rangan et al. 2009). Which factors could be involved in polyadenylation of Toll? Mutations in cortex and grauzone were identified genetically to affect polyadenylation of Toll (Lieberfarb et al. 1996). Cortex is an activator of the anaphasepromoting complex, and mutations in this gene prevent the completion of meiosis (Chu et al. 2001). On the other hand, Grauzone is a transcription factor necessary for activation of Cortex (Harms et al. 2000). Thus, Cortex and Grauzone may affect polyadenylation of Toll indirectly by precluding the normal initiation of embryogenesis. Similarly, embryos from wisp mutant mothers are defective in polyadenylation of several maternal mRNAs, including Toll (Cui et al. 2008). Wisp is present until ;2 h of development and, therefore, could be directly involved in polyadenylation of Toll. However, Wisp is also required for expression of Cortex (Benoit et al. 2008), so it is unclear to what extent the observed effects on Toll polyadenylation are due to primary defects in Cortex expression. Direct biochemical dissection using the cell-free polyadenylation system, combined with Drosophila genetics, will allow us to decipher the components of both the CPE-dependent and CPE-independent cytoplasmic polyadenylation machineries. Materials and methods
Extract preparation
Extracts were prepared from staged 90-min embryos as described in Gebauer et al. (1999). To stage embryos, collecting trays were exchanged every 90 min, three times, and the third batch of trays was used to prepare extracts. After preparation, extracts were adjusted to 10% glycerol, snapfrozen in liquid nitrogen, and stored at À80°C.

different batches of extract, carrying in parallel the full-length Toll 39 UTR as a positive control. In some cases, Renilla mRNA was cotranslated as an internal control. After incubation, the translation efficiency was determined by measuring the luciferase activity using the Dual Luciferase Assay System (Promega), and firefly luciferase values were corrected for Renilla expression. Polyadenylation was tested by either PAT assay, Northern blot, or direct visualization using radioactively labeled RNAs. For Northern blots, a random-primed probe against the full-length 39 UTR of Toll was used. Radioactively labeled RNAs were resolved in denaturing 6% acrylamide gels and visualized in a PhosphorImager. PAT assays were performed as ´ described by Salles and Strickland (1995) after treatment of RNA samples with Turbo DNase (Ambion). The gene-specific oligonucleotides used for these assays are described in the Supplemental Material. To amplify endogenous Toll, RNA was extracted from embryos using Trizol (Invitrogen), and 150–300 ng of total RNA (100 embryos) were used in the reaction. Amplified products were resolved in 1% agarose gels. For competition assays, extracts were preincubated for 10 min on ice with increasing amounts of ApppG-capped RNA competitor. The remaining reagents needed for translation were subsequently added, and the reaction was further incubated for 10 min before adding the radioactively labeled substrate mRNA.

Plasmids and in vitro transcription
DNA constructs are detailed in the Supplemental Material. mRNAs were synthesized as described previously (Gebauer et al. 1999). mRNAs lacked a poly(A) tail and contained a 7mGpppG cap. RNAs used as competitors contained an ApppG cap. Cordycepin was incorporated to the 39 end of Luc-toll mRNA with yeast poly(A) polymerase (GE Healthcare), following the recommendations of the vendor.

Microinjections
Oregon embryos (0–30 min old) were injected in a ventral–posterior position as described previously (Schisa and Strickland 1998). Embryos were allowed to develop for 72 h at 18°C, and the number of hatched embryos was scored to estimate the percentage of viability. For PAT assays, microinjected embryos were allowed to develop for 1 h before extraction of RNA with Trizol (Invitrogen).

Acknowledgments
´ ´ We thank Raul Mendez, Juan Valcarcel, Martine Simonelig, Josep Vilardell, and Antoine Graindorge for critically reading this manuscript and for ´ ´ useful suggestions. We also thank Cornelia de Moor and Raul Mendez for CycB1 plasmids. This work was supported by grants BMC2003-04108 and BFU2006-01874 from the Spanish Ministry of Education and Science, and grant 2005SGR00669 from the Department of Universities, Information, and Sciences of the Generalitat of Catalunya (DURSI). F.G. is supported by the I3 Program of the Spanish Ministry of Education and Science.

References
Barnard DC, Ryan K, Manley JL, Richter JD. 2004. Symplekin and xGLD-2 are required for CPEB-mediated cytoplasmic polyadenylation. Cell 119: 641–651. Belloc E, Pique M, Mendez R. 2008. Sequential waves of polyadenylation and deadenylation define a translation circuit that drives meiotic progression. Biochem Soc Trans 36: 665–670. Benoit B, Mitou G, Chartier A, Temme C, Zaessinger S, Wahle E, Busseau I, Simonelig M. 2005. An essential cytoplasmic function for the nuclear poly(A) binding protein, PABP2, in poly(A) tail length control and early development in Drosophila. Dev Cell 9: 511–522. Benoit P, Papin C, Kwak JE, Wickens M, Simonelig M. 2008. PAP- and GLD2-type poly(A) polymerases are required sequentially in cytoplasmic polyadenylation and oogenesis in Drosophila. Development 135: 1969– 1979. Castagnetti S, Ephrussi A. 2003. Orb and a long poly(A) tail are required for efficient oskar translation at the posterior pole of the Drosophila oocyte. Development 130: 835–843.

In vitro polyadenylation and translation
Reactions using 90-min embryo extracts were assembled as described in Gebauer et al. (1999), without tRNA. In these reactions, both polyadenylation and translation could be observed. Typically, 0.01 pmol of substrate mRNA was used in the reaction. The use of small amounts of substrate is relevant, as we found that polyadenylation is saturable in this system. To account for batch–to–batch differences in the polyadenylation kinetics and efficiency, polyadenylation of each RNA construct was tested in

GENES & DEVELOPMENT

133

Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press

Coll et al.

Chang JS, Tan L, Schedl P. 1999. The Drosophila CPEB homolog, orb, is required for oskar protein expression in oocytes. Dev Biol 215: 91–106. Charlesworth A, Ridge JA, King LA, MacNicol MC, MacNicol AM. 2002. A novel regulatory element determines the timing of Mos mRNA translation during Xenopus oocyte maturation. EMBO J 21: 2798– 2806. Charlesworth A, Cox LL, MacNicol AM. 2004. Cytoplasmic polyadenylation element (CPE)- and CPE-binding protein (CPEB)-independent mechanisms regulate early class maternal mRNA translational activation in Xenopus oocytes. J Biol Chem 279: 17650–17659. Charlesworth A, Wilczynska A, Thampi P, Cox LL, MacNicol AM. 2006. Musashi regulates the temporal order of mRNA translation during Xenopus oocyte maturation. EMBO J 25: 2792–2801. Chen F, Wilusz J. 1998. Auxiliary downstream elements are required for efficient polyadenylation of mammalian pre-mRNAs. Nucleic Acids Res 26: 2891–2898. Chu T, Henrion G, Haegeli V, Strickland S. 2001. Cortex, a Drosophila gene required to complete oocyte meiosis, is a member of the Cdc20/fizzy protein family. Genesis 29: 141–152. Cui J, Sackton KL, Horner VL, Kumar KE, Wolfner MF. 2008. Wispy, the Drosophila homolog of GLD-2, is required during oogenesis and egg activation. Genetics 178: 2017–2029. Gebauer F, Corona DF, Preiss T, Becker PB, Hentze MW. 1999. Translational control of dosage compensation in Drosophila by Sex-lethal: Cooperative silencing via the 59 and 39 UTRs of msl-2 mRNA is independent of the poly(A) tail. EMBO J 18: 6146–6154. Groisman I, Jung MY, Sarkissian M, Cao Q, Richter JD. 2002. Translational control of the embryonic cell cycle. Cell 109: 473–483. Harms E, Chu T, Henrion G, Strickland S. 2000. The only function of Grauzone required for Drosophila oocyte meiosis is transcriptional activation of the cortex gene. Genetics 155: 1831–1839. Juge F, Zaessinger S, Temme C, Wahle E, Simonelig M. 2002. Control of poly(A) polymerase level is essential to cytoplasmic polyadenylation and early development in Drosophila. EMBO J 21: 6603–6613. Keleman K, Kruttner S, Alenius M, Dickson BJ. 2007. Function of the Drosophila CPEB protein Orb2 in long-term courtship memory. Nat Neurosci 10: 1587–1593. Kim JH, Richter JD. 2006. Opposing polymerase-deadenylase activities regulate cytoplasmic polyadenylation. Mol Cell 24: 173–183. Lieberfarb ME, Chu T, Wreden C, Theurkauf W, Gergen JP, Strickland S. 1996. Mutations that perturb poly(A)-dependent maternal mRNA activation block the initiation of development. Development 122: 579–588. MacDonald CC, Redondo JL. 2002. Reexamining the polyadenylation signal: Were we wrong about AAUAAA? Mol Cell Endocrinol 190: 1–8. McGrew LL, Richter JD. 1990. Translational control by cytoplasmic polyadenylation during Xenopus oocyte maturation: Characterization of cis and trans elements and regulation by cyclin/MPF. EMBO J 9: 3743–3751. Minshall N, Reiter MH, Weil D, Standart N. 2007. CPEB interacts with an ovary-specific eIF4E and 4E-T in early Xenopus oocytes. J Biol Chem 282: 37389–37401. Paillard L, Maniey D, Lachaume P, Legagneux V, Osborne HB. 2000. Identification of a C-rich element as a novel cytoplasmic polyadenylation element in Xenopus embryos. Mech Dev 93: 117–125. ´ Pique M, Lopez JM, Foissac S, Guigo R, Mendez R. 2008. A combinatorial code for CPE-mediated translational control. Cell 132: 434–448. Radford HE, Meijer HA, de Moor CH. 2008. Translational control by cytoplasmic polyadenylation in Xenopus oocytes. Biochim Biophys Acta 1779: 217–229. Ralle T, Gremmels D, Stick R. 1999. Translational control of nuclear lamin B1 mRNA during oogenesis and early development of Xenopus. Mech Dev 84: 89–101. Rangan P, DeGennaro M, Jaime-Bustamante K, Coux RX, Martinho RG, Lehmann R. 2009. Temporal and spatial control of germ-plasm RNAs. Curr Biol 19: 72–77. Richter JD. 2007. CPEB: A life in translation. Trends Biochem Sci 32: 279–285. ´ Salles FJ, Strickland S. 1995. Rapid and sensitive analysis of mRNA polyadenylation states by PCR. PCR Methods Appl 4: 317–321. ´ Salles FJ, Lieberfarb ME, Wreden C, Gergen JP, Strickland S. 1994. Coordinate initiation of Drosophila development by regulated polyadenylation of maternal messenger RNAs. Science 266: 1996–1999.

Schisa JA, Strickland S. 1998. Cytoplasmic polyadenylation of Toll mRNA is required for dorsal-ventral patterning in Drosophila embryogenesis. Development 125: 2995–3003. Simon R, Tassan JP, Richter JD. 1992. Translational control by poly(A) elongation during Xenopus development: Differential repression and enhancement by a novel cytoplasmic polyadenylation element. Genes & Dev 6: 2580–2591. Stebbins-Boaz B, Cao Q, de Moor CH, Mendez R, Richter JD. 1999. Maskin is a CPEB-associated factor that transiently interacts with elF-4E. Mol Cell 4: 1017–1027. Vardy L, Orr-Weaver TL. 2007. The Drosophila PNG kinase complex regulates the translation of cyclin B. Dev Cell 12: 157–166. Venkataraman K, Brown KM, Gilmartin GM. 2005. Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition. Genes & Dev 19: 1315–1327. Wang YY, Charlesworth A, Byrd SM, Gregerson R, MacNicol MC, MacNicol AM. 2008. A novel mRNA 39 untranslated region translational control sequence regulates Xenopus Wee1 mRNA translation. Dev Biol 317: 454–466. Zarudnaya MI, Kolomiets IM, Potyahaylo AL, Hovorun DM. 2003. Downstream elements of mammalian pre-mRNA polyadenylation signals: Primary, secondary and higher-order structures. Nucleic Acids Res 31: 1375–1386.

134

GENES & DEVELOPMENT

Sociological Methods & Research
http://smr.sagepub.com How Much Does It Cost?: Optimization of Costs in Sequence Analysis of Social Science Data
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher and Cédric Notredame Sociological Methods Research 2009; 38; 197 DOI: 10.1177/0049124109342065 The online version of this article can be found at: http://smr.sagepub.com/cgi/content/abstract/38/1/197

Published by:
http://www.sagepublications.com

Additional services and information for Sociological Methods & Research can be found at: Email Alerts: http://smr.sagepub.com/cgi/alerts Subscriptions: http://smr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://smr.sagepub.com/cgi/content/refs/38/1/197

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

How Much Does It Cost?
Optimization of Costs in Sequence Analysis of Social Science Data
Jacques-Antoine Gauthier
University of Lausanne, Switzerland

Sociological Methods & Research Volume 38 Number 1 August 2009 197-231 Ó 2009 SAGE Publications 10.1177/0049124109342065 http://smr.sagepub.com hosted at http://online.sagepub.com

Eric D. Widmer
University of Geneva, Switzerland

Philipp Bucher
Swiss Institute of Bioinformatics and Swiss Institute for Experimental Cancer Research, Lausanne Switzerland

´ Cedric Notredame
Centre National de la Recherche Scientiﬁque, Marseille, France, and Centre for Genomic Regulation, Barcelona, Spain
One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justiﬁcation problem of cost schemes. Keywords: sequence analysis; optimal matching; trajectories; empirical cost optimization

O

ptimal matching analysis (OMA) has emerged since the 1990s as a main methodological innovation in the social sciences for ﬁnding

Authors’ Note: Please address correspondence to Jacques-Antoine Gauthier, University of ˆ Lausanne, SSP–MISC, Batiment de Vidy, CH-1015 Lausanne, Switzerland; e-mail: JacquesAntoine.Gauthier@unil.ch. 197
Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

198

Sociological Methods & Research

patterns in sequences of social events (Abbott and Tsay 2000). It is based on the assumption that succession of social statuses or events constitutes stories throughout the life course that can be measured in a set of data (Abbott 1984, 1990a, 1990b, 1995a, 2001). Usual measures of distance, such as the Euclidean distance, are ineffective for many sequential data, for example, when their lengths differ (Kruskal 1983; Abbott 1995b, 2001). Therefore, multivariate statistical methods falling within the framework of dynamic programming procedures and stemming from molecular biology (e.g., Needleman and Wunsch 1970) have been adapted to the study of social trajectories (Abbott and Hrycak 1990; Erzberger and Prein 1997; Giele and ¨ Elder 1998; Wilson 1998; Aisenbrey 2000, Rohwer and Potter 2002) and embodied in various softwares (TDA,1 Optimize,2 and CLUSTALG3). One problem identiﬁed as major in this set of methods, however, lies in the cost schemes on which empirical analyses are based. As a matter of fact, optimal matching methods decompose the total difference between any two sequences into a collection of individual elementary differences using substitution, deletion, and insertion operations (Kruskal 1983). The determination of the costs attributed to those operations is the subject of an ongoing debate in the social sciences (Abbott and Tsay 2000; Wu 2000) since the setting of costs is not in most cases based on explicit and strong theoretical stances. For example, given a pair of sequences to be aligned, one can wonder if it is the same to substitute 1 year of full-time employment with either 1 year of part-time employment or 1 year of being exclusively at home. If it is not, we should consider weighting the costs of those operations so that they contribute differently to the ﬁnal alignment of the two sequences. Some scholars emphasize the large impact that cost setting has on the ﬁnal results of their ¨ analysis (Rohwer and Potter 2002) whereas others take the opposite stance, underlining its marginal impact on similarity scores among sequences (Levine 2000). However, most argue for both sensitivity and stability of the effect of cost variations on the results of the analysis (Abbott and Hrycak 1990).4 Therefore, researchers in the social sciences are left wondering to what extent the ﬁnal results of their analyses are reproducible and valid. This article ﬁrst describes usual solutions proposed by social scientists in regard to the problem of the determination of costs in sequence analysis. Then, it proposes a method that computationally derives costs from the empirical data, based on state-of-the-art approaches in bioinformatics (Henikoff and ¨ Henikoff 1992; Muller and Vingron 2000; Ng, Henikoff, and Henikoff 2000; Yu and Altschul 2005). The proposed algorithm is then tested on three distinct social science data sets. We further discuss the consequences of the results for empirical analyses of sequence data in the social sciences.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

199

How Are Costs Determined in the Social Sciences?
The issue of costs concerns two operations in sequence analysis: substitution and insertion/deletion. Because this stage of sequence analysis is critical for further results, all publications that use OMA provide some sense of how costs are set, but with unequal degrees of details. Based on a literature review in the ﬁeld, we found ﬁve strategies regarding the setting of substitution costs as they are used in the social sciences. A ﬁrst strategy is to set all substitution costs to a constant, that is, using an identity matrix (Dijkstra and Taris 1995; Rohwer and Trappe 1997; Pentland et al. 1998; Wilson 1998; Schaeper 1999; Billari 2001). Those using this strategy argue that they have no rational way to set costs in another way. This strategy is used largely when no theoretical rationale is available for supporting the setting of costs. It has been criticized, however, for its inability to reﬂect unequal differences between a given set of social characteristics, on one hand, and the distribution of those different positions on the other. Abbott and Hrycak (1990) gave the example of determining the proximity of some occupational positions such as senior executive, ﬁrst-level supervisor, and line worker, which would be impossible. They proposed that in this case substituting or inserting the rarest one should be more costly. A second research strategy uses differentiated costs following theoretical intuitions concerning the ‘‘social weight’’ for substituting one status with another (Chan 1995; Erzberger and Prein 1997; Halpin and Chan 1998; Blair-Loy 1999; Giuffre 1999; Schaeper 1999; Scherer 2001; Widmer, Levy, et al. 2003). For instance, Chan (1995) underlined that decisions about costs have to be grounded in theoretically important divisions between social classes. One may agree only in principle with this and comparable statements, but the social sciences are currently characterized by various contradicting theories rather than by a common theoretical framework such as evolutionary theory in biology (Grauer and Li 2000; Turner 2001; Giddens, Duneier, and Applebaum 2003). Therefore, backing costs with theoretical statements often proves difﬁcult because of the large number of alternatives, depending on the theory chosen. Also, because results from sequence analysis are used to support and contradict theoretical statements at the same time, there is some circularity in building the costs on the same theoretical statements that they are supposed to help prove or disprove. This is as true for research on social classes as for other research areas in the social sciences. A third strategy consists of applying some empirical coding scheme based on common sense or face value. Aisenbrey (2000) set the substitution costs

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

200

Sociological Methods & Research

according to a hierarchical ordering of the statuses that constitute the sequences. Abbott and Forrest (1986, 1989), for instance, categorized the statuses of sequences according to the number of steps up the hierarchy necessary to put them under a common heading. The substitution cost is computed as the ratio of this number to the total number of steps possible. Applying the ‘‘garbage can model’’ to estimate the institutional inﬂuence on the textbook publishing process in physics and sociology, by means of sequence comparison technique, Levitt and Nass (1989) based the setting of their substitution costs on a list of topics and subtopics used in structuring textbooks. The cost was set to 1 for a change from one topic to another (e.g., stratiﬁcation vs. ideology) and to 0.5 for a change between subtopics of the same topic (race vs. gender as substructures of stratiﬁcation). Studying the structure of sociological articles across time, Abbott and Barman (1997) deﬁned two levels of elementary states of sociological articles. Level 1 comprised statuses such as ‘‘introductory,’’ ‘‘hypotheses,’’ and ‘‘literature,’’ whereas Level 2 encompassed subdivisions such as ‘‘topic,’’ ‘‘state of affairs,’’ ‘‘questions,’’ and ‘‘author’s theory/assertion’’ for the introductory heading. A substitution cost of 1 was attributed to subheadings falling under different headings and 0.25 for subheadings falling under the same heading. In all cases reviewed, the setting of costs is not done on strong theoretical bases, but rather on rules that make empirical sense considering the problem at hand. Fourth, some authors set costs based on a combination of common sense (the third strategy) with the likelihood of transitions between statuses in the empirical data (Abbott and Hrycak 1990; Stovel, Savage, and Bearman 1996; Stovel and Bolan 2004). For instance, in their programmatic study of musicians’ careers, Abbott and Hrycak (1990) ﬁrst distinguished for each musician nine spheres of activity (court, town, church, etc.) and 15 positions (vocalist, composer, Kapellmeister, etc.). Among the 135 combinations, they ﬁnally kept 35 different occupational positions as statuses in a musician’s career. To set the costs of substitution, they proposed that a change in both sphere and position is more drastic than a change in only one sphere. They set to 0.75 the cost for a change within either a sphere or a position. The cost was set to 1 when the change occurred on both levels. Second, in order to take into account the fact that some pairs of occupational positions seem to be closely connected with mobility (i.e., they often lie on the same career line), they combined the distance matrix, based on mobility, with a position/sphere dissimilarity matrix. This matrix was constructed by classifying all moves in all careers according to their frequency. The ﬁnal substitution matrix is then a linear combination of corresponding symbols of the two matrices.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

201

An alternative to the development of substitution costs is represented by the use of transition costs, estimated directly on collections of trajectories. Such an option is available in TDA software, but to our knowledge, no empirical results based solely on this way of determining substitution costs have yet been published. In a transition matrix, low costs indicate pairs of symbols that are likely to co-occur in a speciﬁc life trajectory (such as work and retirement). In a substitution matrix, low substitution costs indicate symbols that are likely to occur simultaneously in two different trajectories. A low substitution cost does not imply any transition, but rather an equivalence of some sort between the two considered statuses. While transition matrices are ideal for analyzing individual strings and identifying trajectory anomalies, they are much less suitable to comparisons of alternative trajectories that rely on the comparison of symbols occurring simultaneously in different trajectories. Some scholars have used costs based on transitions, combined with some additional criteria. Stovel et al. (1996) derived the substitution costs from an analysis of the complete transition matrix reporting the distribution of work transitions of all workers of Lloyds Bank over the period 1890 through 1970. They then distinguished costs for positions and for branch changes and combined them. Considering residential trajectories, Stovel and Bolan (2004) used a similar strategy. They ﬁrst constructed a place-type variable (nine categories) based on a continuum ranging from small rural towns to large metropolitan cities. This theoretically based distinction was then combined with the empirical distribution of the frequency of all possible transitions among types of places. The substitution matrix was constructed as a repeated adjustment between the initial theoretical model and the empirical transition rates. In contrast to the previous three strategies, this strategy marks a signiﬁcant improvement as it is at least partially empirically driven. There are, however, various problems existing with the solutions currently proposed. First, all reviewed solutions are at least partially driven by intuition or face value, or by some kind of theoretical stance. Second, the choice of simple frequencies (or a linear function of them) to weight the substitution cost is not supported by any formalized computational methods nor by any statistical theoretical grounds. Third, even in cases where ‘‘pseudo’’ or intuitive iterative methods are used to set the substitution costs (cf. Stovel and Bolan 2004), no formal rules are presented that justify the solution chosen by the researchers. Fourth, none of those models succeed in giving a systematic and fully empirically driven procedure of substitution cost settings. Finally, no attempts are made to optimize costs based on the empirical data at hand.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

202

Sociological Methods & Research

In the ﬁfth strategy, some researchers acknowledge having used a mix of several if not all approaches listed above, insisting on the exploratory dimension of the process and the fact that guidelines are few and rather ¨ fuzzy (Rohwer and Potter 2002). To summarize, researchers in the ﬁeld have underlined that the issue of the determination of costs in OMA remains presently open.

An Alternative: Deriving the Cost Empirically
To develop a more systematic and reliable method for cost setting to the ones currently existing in the social sciences, one should get back to the basics of sequence alignment. Given two strings I and J, Given two strings I and J, a penalty for insertions and deletions (called INDEL), and a cost matrix C, where CSi Sj is the cost for aligning Si, the ith symbol of I, against Sj, the jth symbol of J, the score of the optimal alignment can be computed using the following recursion:
8 < OMAði − 1, j − 1Þ + CSi Sj OMAði, jÞ = Best OMAði − 1, jÞ + INDEL : : OMAði, j − 1Þ + INDEL ð1Þ

In a general sequence comparison perspective, one considers that a substitution is equivalent to a deletion followed by an insertion. Therefore, the value of an INDEL is often arbitrarily set to half of that of a substitution (Kruskal 1983). Each line in equation (1) corresponds to the optimal match score of two substrings. For instance, OMA(i − 1, j − 1) corresponds to the optimal match score of a subsequence containing the symbols 1 to i − 1 of Sequence 1, against a subsequence containing the symbols 1 to j − 1 in the second sequence. As such, this equation deﬁnes a recursion in which the score of any alignment OMA(i, j) can be estimated by considering an optimal extension of the three shorter alignments OMA(i − 1, j), OMA(i − 1, j − 1), and OMA(i, j − 1). Considering that each of these shorter alignments is already an optimal matching of the associated substrings, we know that OMA(i, j) is optimal. This strategy relies on the assumption that each position is independent and that the alignment scores are additive. The alignment of Sequences A and B in Figure 1 is produced by applying recursion, as in equation (1), and iteratively ﬁlling up the OMA(i, j) array until the optimal matching score OMA(I, J) is obtained (Kruskal 1983). By recording the results of all the comparisons made at each step of the recursion, it is possible to trace back the optimal scores from

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

203

Figure 1 Example of Optimal Matching Score Computation and Alignment

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

204

Sociological Methods & Research

the cell OMA(I, J), thus generating an alignment, as shown in Figure 1, where an identity substitution matrix has been used. Such a matrix assigns the value 0 to the matching of two identical letters and the value −2 to the substitution of two different letters. Insertion or deletion occurring at one extremity of the alignment takes the value −1 (terminal INDEL) and the value −2 when they are used within the alignment (internal INDEL). In the OMA(i, j) array below, the traceback is indicated in bold. Starting from the bottom-right corner of the array, vertical moves correspond to an INDEL in Sequence A, horizontal moves to an INDEL in Sequence B, and diagonals to a match or a substitution. One of the main issues that arises in equation (1) concerns the estimation of the substitution costs (Cij). This issue is also central in biology. Given 20 amino acids, some so similar that they are almost interchangeable while others are very different, one cannot simply use any a priori substitution matrix; some modeling is required. Dayhoff, Schwartz, and Orcutt (1978) addressed this problem in the 1970s using a data-driven empirical approach. They manually aligned sets of highly similar, samelength sequences of amino acids and counted the number of mutations tolerated by evolution. A mutation is characterized by the presence of 2 different amino acids at the same position of the alignment. In a general sequence comparison perspective, this is called a substitution (Kruskal 1983). In this context, highly similar sequences are deﬁned as those having more than 80 percent identity, where the percentage of identity is calculated by dividing the number of positions in the alignment in which the same letter appears in both sequences (identities) by the length of the alignment, as shown in equation (2). All positions with a gap in either sequence are nonidentities; thus, only the alignment of two identical sequences yields to 100 percent identity:
Percentage of Identity = W = Number of Identical Matches : Alignment Length ð2Þ

Selecting sequences with a high percentage of identity for computing data-driven costs of substitution prevents biases due to uncontrolled heterogeneity. For instance, the alignment of Figure 1 displays 57 percent identity (four identical pairs of letters found over the seven positions of the alignment). Finally, for each pairwise alignment, the relative frequency of substitutions occurring between two particular amino acids is compared to what was expected by chance alone. These values are computed as log odds and tabulated into a data-driven substitution matrix, as in equation (3):

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

205

Dayhoff Costða, bÞ = log



 fab : fa Ã fb

ð3Þ

In equation (3), fab is the relative frequency with which the symbols a and b have actually matched at the same position of a given set of pairwise alignments, while fa Ã fb is the product of the relative frequencies of a and b in the same data set and therefore an estimation of the probability of seeing a and b aligned throughout all the alignments of the data set. If we consider fab to be an estimate of the probability of ﬁnding a and b matched in the data set, then it becomes possible to estimate the ratio of those two probabilities (their odds) and evaluate the extent to which a given substitution (match) between two symbols is over- or underrepresented in the alignments. The most notable property of log odds is to yield negative scores for events observed less often than expected by chance. In the context of optimal matching, this amounts to having a cost matrix that penalizes unexpected matches with negative values while expected matches or identities are rewarded with positive values. As in an alignment, two identical symbols do not systematically match, and the Dayhoff cost for substituting two identical symbols is often different from zero. In biology matching, various pairs of identical symbols can be associated with different positive values. The rationale is that in biology, all conservations are not equally important. In the social sciences, however, the decision was made early to set conservation costs to 0 and substitution to variable costs. This model suggests that all social statuses are equally conserved, regardless of their nature. This may or may not be true. For instance, one may ask if the social cost should be the same for matching years as unemployed or years spent on the labor market. The equality of these statuses cannot be ruled out as long as it has not been formally demonstrated. For the time being, the proposed algorithm sticks to the mainstream procedure in the social sciences, but it would be trivial to adapt it so that different costs may be used for different types of identities. To get a cost of zero for the substitution of two identical symbols, we use a normalized cost (N_cost) that is derived from the cost deﬁned in equation (3) as follows:
N– costða,bÞ=Dayhoff Costða,bÞ− Dayhoff Costða,aÞ+Dayhoff Costðb,bÞ : ð4Þ 2

In equation (4), Dayhoff cost refers to the original Dayhoff cost (equation [3]) that is positive and maximized for identities while yielding lower (often negative) values for mismatches. A substitution matrix based on

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

206

Sociological Methods & Research

N_costs has the same properties as the Dayhoff cost matrix except that it yields a null cost to the alignment of two identical sequences, a convenient property for cluster analysis based on a distance matrix. In biology, it is common practice to use log-odds matrices as a scoring scheme when applying the optimal matching algorithm. The main reason is that the versatility of the log-odds method makes it possible to discriminate between different types of mismatches in an objective and quantitative fashion. As substitution and INDEL operations are mutually dependent, using cost matrices as deﬁned in equation (3) or (4) calls for setting the value of the INDELs according to the cost matrix at hand, as shown in equation (5).
PN − 1 PN INDEL =
i=1 j = i + 1 Cij 2 − NÞ × 0:5 ðN

:

ð5Þ

In equation (5), the cost for not matching a symbol (INDEL) was estimated using the Thompson formula (Thompson, Higgins, and Gibson 1994), where INDEL is set to the average substitution cost of the substitution matrix (i.e., the matrix average ignoring the values in the main diagonals). It is possible to distinguish two kinds of indels, the internal ones that occur between two given symbols and the terminal ones that come at the end of the shorter sequence to make its length equal to the longer one. In the context of this work, we simply attributed the INDEL value of equation (5) to internal indels only and lowered it to INDEL/2 for terminal ones, thus making it easier for indels to be terminal rather than internal. Given a collection of sequences, the main difﬁculty is the proper estimation of an appropriate cost matrix. Using reference alignments is possible but may require some arbitrary knowledge. In the case of Dayhoff, using reference alignments was possible because closely related sequences were available whose alignment could be assembled in an unambiguous manner (i.e., without INDEL). In the social sciences, reference alignments are not available, and a strategy must therefore be worked out to generate them in a systematic and unbiased fashion. Over the last 15 years, several techniques have been introduced in biology and aimed at training position-speciﬁc substitution matrices through iterative sequence-alignment procedures (Lawrence et al. 1993; Hughey and Krogh 1996; Altschul et al. 1997; Bateman et al. 1999). PSI-BLAST (PositionSpeciﬁc Iterative Basic Local Alignment Search Tool), one of the most popular tools in biology, is the one whose principle resembles most the one developed here. In PSI-BLAST, a biological sequence is ﬁrst compared to all the others in the database, using an off-the-shelf substitution matrix. The

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

207

Figure 2 Pseudocode Describing the Iterative Training Procedure
0 - set CURRENT matrix to the identity matrix I - Optimal matching of the sequences with the CURRENT matrix II - Estimation of the NEW matrix on the alignments produced in (I) Measure the Percent Identity on the alignments Select alignments Yielding more than 60% identity Count the matches/mismatches on the selected alignments Weighting the counts of each alignment with its percent identity Compute the NEW matrix III–Comparison of NEW matrix and CURRENT matrix If CURRENT==NEW, terminate Else set CURRENT to NEW and proceed to I.

best alignments thus generated are selected according to their percentage of identity and used to update the matrix in a process that goes on, cycle after cycle, until successive cycles fail to modify the matrix, in which case the algorithm is said to have reached convergence. The proposed strategy is directly adapted from this iterative method and is outlined in the pseudocode shown in Figure 2. In this context, the matrix can be viewed as a model used for generating optimal matches of the sequences. In other words, a correct matrix must be able to generate alignments similar to those it was estimated from. This equivalence is sought in the iteration procedure, in which matrices and alignments are alternatively generated until they both become invariant, suggesting an equivalent information content. Overall, this amounts to generating matrices whose purpose is to optimally summarize the information contained in the sequences. In this context, the alignments and the matrix can be viewed as two alternative models of the relationships among sequences. The convergence is meant to ensure that these two models are equivalent.

Empirical Cost Matrix Estimation of Social Science Data
Given a set of sequences of social statuses and a preestimated matrix (in this case, an identity matrix), pairwise alignments are generated with the OMA algorithm. This can be done either by exhaustively considering

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

208

Sociological Methods & Research

all possible pairs of sequences or by restricting the training procedure (cf. Figure 2) to a random subset of the sequences if computation time is an issue. When computing matrix statistics from these alignments, the main caveat is the uneven alignment quality. While mismatches measured on almost identical strings can be expected to be meaningful substitutions, matches and mismatches measured on poorly matched strings may be suspicious. Dealing with low-quality alignments is a delicate issue in the social sciences as well as in biology. The simplest approach to deal with this limitation is to ignore alignments with a low percentage of identity, as done in PSI-BLAST (discussed previously). For instance, in the context of this article, we excluded all the alignments yielding less than 60 percent identity (equation [2]). Such a conservative threshold ensures the quality of the considered alignments and therefore the relevance of the observed substitutions. Furthermore, based on strategies developed in biology, we also applied an extra weight on the selected alignments in order to ensure that the best alignments contribute more to the ﬁnal matrix. This extra step is similar to the selection made for empirically estimating the costs of substitution (equation [2]), but it speciﬁcally helps smooth the convergence of the iterative process and also guarantees a stronger contribution of the most reliable alignments. We again used percentage of identity (as measured on the alignment) as a weighting scheme. This parameter is often regarded as a good indicator of correctness and was successfully used by Notredame, Holm, and Higgins (1998) to design local scoring schemes. We therefore used equation (6) to derive a collection of weights that are speciﬁcally applied to the relative frequency of each possible substitution associated with a given alphabet:
fab = PS
1

Wi Ã Ni ða, bÞ : PS Ã 1 Wi Li

ð6Þ

Thus, weighting the relative frequency fab of symbol a matching symbol b in the alignments is estimated in the following fashion, where Wi is the weight associated with the alignment i, Li is the length of that alignment, and Ni(a, b) is the number of pairs ab perfectly matched in that alignment. The term Ni(a, b) indifferently represents identical matches if a and b are the same symbols or mismatches if a and b are different symbols. In equation (6), the weight Wi is meant to increase the contribution of trustworthy alignments, thus speeding the convergence process and decreasing the amount of noise contributed by spurious alignments. In the case of social science data in which sequence patterns are shaped

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

209

following less stringent rules than in biological sequences and therefore show more diversity, this approach allows us at the same time to take a greater variability of sequences into account and to limit the inﬂuence of outliers. To prevent possible underﬂow (i.e., division by 0) caused by a rare mismatch or match, a small value (0.001) is added to every frequency. Given frequencies tabulated for every possible pair of symbols, the substitution matrix is then computed using equation (3). This matrix is the new matrix that will be used in the next training round (cf. Figure 2). The iteration procedure is meant to optimize the cost matrix so that it summarizes as accurately as possible the information contained in the alignments from which it is estimated. That procedure is complete when a matrix is able to generate alignments with statistical properties similar to those it originates from. That convergence can easily be measured by estimating the difference between two successive matrices in the evaluation procedure ( ), using, for instance, the mean square differences between them:
= r ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ   Avg ðM1 ða, bÞ − M2 ða, bÞÞ2 : ð7Þ

The iterative procedure is stopped when becomes equal to 0. However, this procedure is merely an attempt to reach optimality, with no proven guarantee (Hughey and Krogh 1996). In this context, the simplest criterion to ensure optimality is to check that alternative trainings converge on similar matrices as indicated by low values, as in equation (7). To validate this, we randomly selected 10 sets of 100 sequences in the test data set and trained the corresponding matrices, keeping the intermediate matrices obtained at every cycle. Figure 3 shows the average measured between all these matrices against the iteration number. Given a data set of 100 sequences 40 symbols long, Figure 3 shows the typical proﬁle of several training procedures. The is an estimation of the difference between the matrices of two successive rounds (low s indicate highly similar matrices). While values tend to decrease over cycles, increasing values (peaks) usually result from the exceeding of a local minimum by the training procedure. Each curve in Figure 3 corresponds to one matrix estimation run. For each run, a set of 1,000 sequence pairs was chosen randomly (out of the 100*100 possible pairs) and kept through all the iterations. The results suggest that the estimation procedure is insensitive to this initial choice with a convergence systematically occurring in 5 to 6 cycles and ﬁnal matrices highly correlated.5 Altogether, this high correlation and the constant number of cycles suggest an efﬁcient and robust training procedure.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

210

Sociological Methods & Research

Figure 3 Value of Against the Number of Iterations of the Training Procedure of Substitution Costs for 10 Randomly Selected Sets of Sequences From the Swiss Household Panel Data
12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8

All the procedures described here have been encapsulated in a sequence analysis package called SALTT (Search Algorithm for Life Trajectories and Transitions). It can be compiled and installed on any UNIX-like platform including Linux, Cygwin, and Mac OSX. The package and its documentation are distributed under the General Public License and available free of charge from the authors (Notredame et al. 2005).

Criteria for Comparing Outcomes From Various Cost-Setting Strategies
To test whether the proposed solution provides more adequate results than previous methods, one may consider some criteria speciﬁc to the training procedure as well as a set of criteria widely used to estimate how well sequence analysis and cluster analysis perform. The ﬁrst and simplest criterion to establish the validity of the proposed strategy is to apply it to biological sequences and train matrices that could be compared to standard biological matrices. We have done so on a

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

211

Figure 4 Random Splitting of Symbol A into Two New Symbols (M and N) Not Belonging to the Original Alphabet

Seq1:

AABDEEBADBEDA Seq2:

BDAEAEAADBADA

A -> M ou N Seq1b: MNBDEEBNDBEDM Seq2b: BDNEMEMNDBNDM

well-known collection of 500 related human sequences known as the kinome (Manning et al. 2002). The procedure delivered a substitution matrix highly correlated to a standard point accepted mutation in which all the known mutational preferences between amino acids could easily be recognized. We then do the same by comparing three distinct sets of social sciences data representing the same sequential reality. Then, the training procedure is evaluated by testing its ability to correctly identify the closeness of two different symbols, using solely the information contained in the data. To do so, we use a set of sequences to compare a reference cost of substitution between two given symbols produced by the training procedure (e.g., AB), with the cost produced by the training procedure for the same substitution, in the case where one of the symbols (e.g., A) has been randomly split into two new symbols (M and N) not belonging to the alphabet. As symbols M and N are actually ‘‘hidden A,’’ we expect the training procedure to determine the substitution costs AB, MB, and NB as equivalent. Figure 4 shows for two given sequences how a symbol is randomly split into two new symbols not belonging to the original alphabet.

Testing the Quality of the Clustering
A third set of criteria pertains to quality testing of cluster analysis. One of the main difﬁculties with clustering methods lies in the determination of the number of clusters really present in the data (Milligan and Cooper 1985, 1987). There is no perfect method to establish this number, but several indicators may be used to help decide (Everitt 1979; Bock 1985; Hartigan 1985; Milligan and Cooper 1985; SAS Institute 2004). For

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

212

Sociological Methods & Research

Milligan and Cooper (1987), there are two categories of tests concerning the quality of cluster analysis: The ﬁrst considers that internal criteria are able to validate the results of the clustering, that is, to justify the number of clusters chosen. The second one uses external criteria. Such criteria represent information that is external to the cluster analysis and was not used at any other point in the cluster analysis (Milligan and Cooper 1987). In terms of internal criteria, Milligan and Cooper (1985) have evaluated and compared 30 statistics known as stopping rules that help in deciding how many ‘‘real’’ clusters are present in the data. The availability of such indices in main statistical software packages (such as SAS or SPSS) is of course a nonnegligible element of choice concerning what criteria to use. Two of the most efﬁcient indices among the 30 that Milligan and Cooper (1985, 1987) have evaluated are part of the SAS software. The ﬁrst one is a pseudo-F developed by Calinski and Harabasz (1974); it represents an approximation of the ratio between intercluster and intracluster variance. The second index is expressed as Je(2)/Je(1) (Duda and Hart 1973) and may be transformed into a pseudo-t2.6 The third criteria we used is R2, which expresses the size of the experimental effect. It is reasonable to look for a consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). We can then deﬁne the stopping rule for a statistically optimal cluster solution as a local peak of the pseudo-F (high ratio between inter- and intracluster variance), associated with a low value of pseudo-t2 that increases at the next fusion and a marked drop of the overall R2.7 Generally, a cluster solution is said to be statistically optimal when the number of classes is kept constant across strategies, when the intercluster variance is highest, and when the intracluster variance is lowest. Put another way, clusters should exhibit two properties, external isolation and internal cohesion (Punj and Stewart 1983). Therefore, using comparative scree plots is a straightforward way of dealing with the issue of testing cluster solutions drawn from distances based on various cost schemes, including the computationally derived one. A given cluster solution is retained for analysis only if at least two among those three criteria (pseudo-F, pseudo-t2, and R2) support its validity. External criteria refer to the extent to which clusters drawn from the data correlate with either independent variables or outcomes (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help in social research as the ultimate goal of social sciences is explanation rather than description. A third criterion is more intuitive: To what extent are empirical clusters easily comprehended, based on prior knowledge of the phenomenon and the central hypothesis of the research? This criterion

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

213

can be approached by using experts and computing interreliability estimates. The procedure in that case is as follows: Provide cluster solutions based on the various cost schemes, and have a set of raters who decide independently which is their favorite solution. Then one may compute interrater reliability and see which coding scheme comes up ﬁrst in the list. Given the importance of the debate concerning the inﬂuence of sociostructural factors on the occupational trajectories of women in the sociological ﬁeld and the availability of high-quality data on occupational status during entire life courses, we test these methods on data sets addressing this topic.

Description of the Test Samples
Considering the fact that women’s labor market participation is more diverse than that of men (Myrdal and Klein 1956; Levy 1977; Mott 1978; ¨ Elder 1985; Moen 1985; Hopﬂinger, Charles, and Debrunner 1991; Moen ¨ and Yu 2000; Blossfeld and Drobnic 2001; Kruger and Levy 2001; Levy, Widmer, and Kellerhals 2002; Moen 2003; Widmer, Kellerhals, and Levy ¨ 2003; Bird and Kruger 2005; Levy, Gauthier, and Widmer 2006), and in order to facilitate the comparisons between the data sets, for each database we selected only women who were married or living with a partner at the time of the interview. Moreover, in order to maximize the quality of the data, we retain only the trajectories that had less than 10 percent of missing values.

Sample Test 1: Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families
The ﬁrst sample of occupational trajectories is drawn from a retrospective questionnaire of the study Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families (SCF) that was conducted in 1998 with 1,400 individuals living as couples in Switzerland (Widmer, Kellerhals, and Levy et al. 2003; Widmer, Kellerhals, and Levy 2004). Respondents were asked to provide information about every year of their occupational trajectory starting from age 16, onward to 64. Every year of the trajectories was coded using a seven-category code scheme: full-time employment, part-time employment, positive interruption (sabbatical, trip abroad, etc.), negative interruption (unemployment, illnesses, etc.), housework, retirement, and full-time education. Data were right truncated as most individuals had not yet reached the age of 64 at the time of the interview. Sociostructural indicators (socioeconomic status of orientation family, educational

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

214

Sociological Methods & Research

level, number of children, and income) were measured for the time of the interviews only. The ﬁnal sample size was 564 women.

Sample Test 2: The Swiss Household Panel
Since 1999, the Swiss Household Panel (SHP) has collected data on a representative sample of private households in Switzerland on a yearly basis.8 In its third wave, the SHP included a retrospective questionnaire sent to 4,217 households (representing 8,913 individuals). For reasons of validity, the analysis of the subsample of individuals who answered the retrospective questionnaire was restricted to those aged 30 and older, decreasing the sampled female population to 1,935. The SHP asked respondents to provide information on their educational and occupational status from their birth to the present. Each change in status is associated with a starting year and an ending year. We recoded these the same way as for Sample Test 1. Sociostructural indicators comparable to those in Sample 1 were also obtained. This sample included 1,107 women.

Sample Test 3: Female Job Histories From the Wisconsin Longitudinal Study
The Wisconsin Longitudinal Study (WLS) is a long-term study of a random sample of 10,317 men and women who graduated from Wisconsin high schools in 1957. This data set is for public use and available at the University of Wisconsin–Madison Web site (http://www.ssc.wisc.edu/ wlsresearch). The female job histories of 1957-1992 were constructed by Sheridan (1997) from the 1957, 1964, 1975, and 1992 WLS data collections. The data also include social background, youthful aspirations, schooling, military service, family formation, labor market experiences, and social participation. The ‘‘female job histories’’ data concern 5,042 women born in 1938 and 1939. We could retain only three main occupational statuses, namely, full-time paid work, part-time paid work, and fulltime housewife. There were 2,243 women in this sample.

Results
Production of Data-Driven Costs of Substitution
From a sociological point of view, we could expect a relative stability of the costs of substitution from one set of sequences to another, the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

215

occupational trajectories of contemporary Swiss and North American women being to a certain extent comparable, at least in regard to the inﬂuence of the birth of children on the reduction or cessation of paid work. The individual sequences of occupational statuses are built by attributing a single symbol (a code corresponding to a given occupational status) to each year of life of the respondents.9 Table 1 compares the different costs of substitution either set arbitrarily to identity, following theoretical arguments concerning differences among types and rates of occupational activities (for details, see Widmer, Levy, et al. 2003), or by means of a training procedure in the different databases. Table 1 shows that the training procedure produces costs that are more differentiated than identity costs. The range of costs is also broader, partly because the procedure is sensitive to very rare substitutions. The stability of the trained costs of substitution from one database to another conﬁrms the ability of the training to produce meaningful cost schemes. The training procedure reﬂects some relations between the different statuses that are sociologically relevant. Compared to identity costs that may not be differentiated between men and women, the trained costs reveal, for example, the relative ease (the low costs) with which women in the samples go from paid work to housework. The comparison of knowledge-based costs and trained costs of substitution shows a high similarity between the two sets of values, which are correlated at .68 (p < .01) with trained costs for SCF data, at .63 (p < .01) for SHP data, and at .73 (p < .05) for WLS data. Table 2 shows Pearson’s coefﬁcient of correlation between the costs by method of cost setting and database. Table 2 shows that the trained costs of substitution are more strongly associated with each other from one data set to another than they are with costs set to either identity or to knowledge-based values. On the other hand, even if it remains relatively high, the associations between trained, knowledge-based, and identity costs are systematically weaker than those between trained costs. This conﬁrms the stability of the results stemming from the training procedure and explains at least partly the slightly but systematically different (and more highly correlated) results it provides compared to the two other strategies (identity and knowledge based).

Validation of the Training Procedure
An important issue in the use of a computerized data–based determination of substitution costs is to assess the extent to which this procedure is able to process information in a sociologically relevant way. Three

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

216

Sociological Methods & Research

Table 1 Comparisons of Identity, Knowledge-Based, and Trained Costs of Substitution for Three Data Sets: SCF, SHP, and WLS
Costs of Substitution Identity Knowledge Trained Trained Trained Based SCF SHP WLS 1.0 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.8 1.0 0.8 1.0 1.0 1.0 0.3 0.8 0.8 1.0 0.8 0.8 0.3 1.0 1.0 1.0 0.8 0.3 1.0 1.0 0.8 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.6 0.7 0.7 0.5 0.9 0.5 0.6 0.7 0.9 0.5 1.0 0.7 0.7 1.5 0.7 1.3 0.8 0.9 0.9 1.1 0.8 0.9 1.0 0.6 0.6 1.5 1.3 0.7 0.8 0.5 0.7 0.7 0.5 0.8 0.5 0.6 0.7 0.8 0.5 0.8 0.7 0.7 1.2 0.8 0.9 0.9 1.0 0.8 1.4 0.8 0.9 0.8 0.7 0.7 1.6 1.5 0.7 0.8 0.5 0.4

Substitutions of Occupational Status Full-Time * Part-Time Full-Time * Negative Interruption Full-Time * Positive Interruption Full-Time * At Home Full-Time * Retirement Full-Time * Education Full-Time * Missing Part-Time * Negative Interruption Part-Time * Positive Interruption Part-Time * At Home Part-Time * Retirement Part-Time * Education Part-Time * Missing Negative Interruption * Positive Interruption Negative Interruption * At Home Negative Interruption * Retirement Negative Interruption * Education Negative Interruption * Missing Positive Interruption * At Home Positive Interruption * Retirement Positive Interruption * Education Positive Interruption * Missing At Home * Retirement At Home * Education At Home * Missing Retirement * Education Retirement * Missing Education * Missing Insertion or Deletion

0.5 0.5

0.5

0.5

0.5

Note: SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study.

different tests were used. The ﬁrst one referred to the ability of the procedure to evaluate the closeness of a symbol belonging to the alphabet with an unknown symbol not belonging to it. The second one focused on the degree of agreement between classiﬁcations of social trajectories made by

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

217

Table 2 Pearson’s Correlation Between Costs Matrices, by Method of Cost Setting and (Full) Data Sets
Identity Identity Knowledge based SCF trained SHP trained WLS trained 1.00 .98*** .66*** .61*** .71* Knowledge .98*** 1.00 .68*** .63*** .73* SCF Trained .66*** .68*** 1.00 .96*** .97*** SHP Trained .61*** .63*** .96*** 1.00 .94*** WLS Trained .71* .73* .97*** .94*** 1.00

Note: UNIX command line to produce the trained matrix: saltt -e ‘-in dataset.dat -action + pavie_seq2pavie_mat _TGEPF50_THR60_TWE04_SAMPLE50000_’. SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .05. ***p < .001.

specialists in the ﬁeld compared with classiﬁcations of the same data based on identity, knowledge-based, and trained costs of substitution. The third one consisted of measuring the extent to which clusters drawn from the data correlate with some independent sociostructural variables or outcomes.

Identifying the Proximity of Unknown Symbols
A ﬁrst way of validating the training procedure consists of measuring the extent to which it is able to unravel the proximity of two given symbols, based on no other information than the data itself. We tested this for the SCF set of sequences by randomly replacing a given symbol of the sequences alphabet A = {A, B, C, D, E, F, G, X}, which corresponds in this case to an occupational status, with two symbols that did not belong to the original alphabet of that set of sequences, that is, symbols whose actual identity was hidden. Using the training procedure, we then compared the original costs for substituting, for example, Symbol A with Symbol B, with the costs we obtained after having randomly replaced every A with either the hidden symbol M or N (cf. Figure 4). In a second run, we did the same by replacing each B with the hidden symbol O or P. We ﬁnally got ﬁve different expressions of the same initial substitution (in this example, AB = NB = MB = AO = AP), each associated with a speciﬁc cost. This procedure was applied to all pairs of symbols of the data set in turn. If we consider Ei and Ej to be respectively the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

218

Sociological Methods & Research

ith and jth elements of the original alphabet and their two random substitutes—respectively S1(Ei), S2(Ei) and S1(Ej), S2(Ej)—there are ﬁve costs of substitution to take into account if we consider only the substitutions involving at least one symbol belonging to the original alphabet. Under these conditions, as they are actually different expressions of the same initial substitution, we should expect those ﬁve trained costs to be identical, or at least close to each other. To compare all those values in a synthetic way for the entire alphabet, we computed a standardized difference between the trained costs of substitution associated with a given pair of symbols belonging to the original alphabet and the trained costs of substitution between one of those original symbols and the substitute of the other one, as shown in equation (8).
Std Difference= ðcostðEi ½S1 ðEj ÞÞ−costðEi Ej Þ+ðcost½S1 ðEi ÞEj Þ−costðEi Ej ÞÞ : ð8Þ 2 Ã costðEi Ej Þ

The proximity of the ﬁve substitution costs associated with a given original pair of symbols and their substitutes was compared in two ways, using either the ﬁrst substitute of that pair of symbols (as shown in equation [8]) or the second one (where S2 replaces S1 in equation [8]). All those values were then tabulated in Table 3. Its lower part contains the standard differences between the substitutions of Ei, Ej, and their ﬁrst substitute (cost EiEj compared to Ei[S1(Ej)] and [S1(Ei)]Ej), whereas the upper part contains the values associated with their second substitute (cost EiEj compared to Ei[S2(Ej)] and [S2(Ei)]Ej)). Table 3 shows clearly that the training procedure identiﬁes very precisely the closeness of two distinct, but actually identical, symbols.10 Among the 56 different costs of substitution in Table 3, 49 (87 percent) show a difference not larger than 10 percent compared with the original cost. The greater differences may be attributed to the fact that the training procedure is relatively sensitive to rare symbols. For example, symbols C, D, F, and X represent altogether only about 2 percent of the total symbols used in the sequences. The great majority of the hidden costs differing notably from their original costs are concerned with such rare symbols. The difference is maximal when it concerns two rare symbols. The ability of the training procedure to identify the similarity of two unknown symbols based on the data set at hand is one of the main strengths of this way of setting costs of substitution. Even if it stays relatively close to identity costs of substitution, this procedure takes into account the real relations of the different symbols present in the sequences and is therefore highly informative. On one hand, it avoids particular relationships remaining undetermined; on the other hand, it works as a predictive tool in the sense

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

219

Table 3 Standardized Difference Between the Original Trained Costs of Substitution and Their Substitutes
A (%) B (%) C (%) D (%) E (%) F (%) G (%) X (%) Relative Frequency (%) A B C D E F G X — 0 7 6 0 6 10 0 8 — 0 6 –10 10 0 0 7 0 — 5 –7 0 0 –15 6 6 0 — –6 0 6 6 0 0 –7 0 — 13 8 0 0 25 9 11 13 — 0 0 10 7 0 0 8 0 — –7 0 7 –15 –28 0 0 –7 — 33.5 19.5 0.5 1.0 31.0 0.1 14.0 0.4

Note: Rows and columns are given the name of a symbol belonging to the alphabet, although each cell of the table compares the substitution cost of three pairs of symbols (the original one and two substitutes) according to equation (8).

that two different symbols with low substitution costs can be predicted to substitute easily for one another in real life.

Automatic Versus Classiﬁcation by Judges
Another way to validate the training procedure is to test the extent to which automatic classiﬁcation succeeds in replicating a classiﬁcation of sequences made by experts on a small subset of well-identiﬁed sequences. To do so, we extracted a sample of 100 occupational trajectories of women from each data set. Four judges were asked to classify them in a number of clusters that corresponded to previous empirical ﬁndings (Widmer, Levy, et al. 2003; Levy, Gauthier, and Widmer 2006) and to theoretical schemes (i.e., Kohli 1986). In each case, we retain only the sequences that were classiﬁed the same way by at least three (out of four) judges. The interrater agreement lies between 83 percent and 88 percent. To keep the computation procedures as parsimonious as possible, we ﬁrst exactly replicated with SALTT the results we obtained with TDA using two different cost settings (identity and knowledge based). That allowed us to produce optimal alignments and to compare the distance matrices for the three strategies (identity, knowledge based, and training) from within SALTT. For each set of sequences, we ran three optimal matching analyses, the ﬁrst one using identity costs of substitution (for details, see above), the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

220

Sociological Methods & Research

Table 4 Association (khi2 and Symmetric) Between Judges and Automatic Classiﬁcation, by Method of Cost Setting
Database Method Identity * Judges SCF 213.2454* (khi2, df = 16) symmetric 0.8034 (value) 0.0458 (ASE) 206.1951* (khi2, df = 16) symmetric 0.8120 (value) 0.0440 (ASE) 224.5436* (khi2, df = 16) symmetric 0.8291 (value) 0.0434 (ASE) SHP 213.4108* (khi2, df = 16) symmetric 0.7500 (value) 0.0582 (ASE) 228.4631* (khi2, df = 16) symmetric 0.7705 (value) 0.0623 (ASE) 235.1387* (khi2, df = 16) symmetric 0.7797 (value) 0.0602 (ASE) WLS 143.9678* (khi2, df = 9) symmetric 0.7037 (value) 0.0684 (ASE) 148.6864* (khi2, df = 9) symmetric 0.7196 (value) 0.0677 (ASE) 143.2652* (khi2, df = 9) symmetric 0.7037 (value) 0.0677 (ASE)

Knowledge Based * Judges

Trained * Judges

Note: ASE = Asymptotic standard error; SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .001

second one using knowledge-based costs, and the third one using costs stemming from the training procedure. A distance matrix was computed for each set of sequences and for each strategy and then entered into a cluster analysis. Table 4 shows the degree of association of khi2 and (Goodman and Kruskal 1979; Olszak and Ritschard 1995) between the clusters made by the judges and those stemming from automatic classiﬁcation. Table 4 shows that results provided by a trained matrix lead to signiﬁcant associations with the classiﬁcation by judges for the three data sets considered. For the Wisconsin study, results are about the same when using either identity or trained costs of substitution. Trained costs never lead to a weaker association ( symmetric) with judges’ classiﬁcations than identity costs or knowledge-based costs for the SCF and SHP data sets. Results are less straightforward concerning the WLS data, with knowledge-based costs performing slightly better than trained costs. The fact that Wisconsin data are less differentiated (sequences with only three

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

221

different statuses as opposed to seven in the other databases and respondents all about the same age) may explain why trained costs do not lead to a markedly different solution than the two alternative strategies. In all cases, the associations are quite high and signiﬁcant, suggesting the ability of the method to provide meaningful cost schemes. Given the fact that the reference classiﬁcation based on judges responses was very consensual and based on predeﬁned categories, results of that test express the ability of the procedure to differentiate clear-cut, sociologically relevant categories out of the data rather than to evaluate the extent to which those results and the underlying costs of substitution reﬂect the theoretical and subjective conceptual frame of an expert.

Association With External Criteria
A third validation procedure consisted of measuring the extent to which clusters drawn from the data correlate with either independent sociostructural variables or outcomes (Milligan and Cooper 1987; Rapkin and Luke 1993), an approach that seemingly few studies have used so far (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help as the ultimate goal of social sciences is explanation rather than description. For each strategy, the three stopping-rule criteria aimed at determining the number of clusters in the data (pseudo-t2, pseudo-F, and R2) suggested the presence of three clusters in the SCF and SHP data and of four clusters in the WLS data. A closer look at the data reveal that those clusters correspond precisely to typical female trajectories, as described elsewhere ¨ (Moen 1985; Hopﬂinger et al. 1991; Erzberger and Prein 1997; Widmer, Levy, et al. 2003; Levy et al. 2006), namely, trajectories characterized by full-time employment, part-time employment, and full-time as a housewife. In the Wisconsin data, the clusters are the same, but with a fourth one representing a return to the labor market after a period at home. Such a cluster also appears when the clusters of SCF and SHP data are further subdivided. The greater homogeneity of WLS data in terms of age of respondents and completeness of the sequences (no right truncatures) may explain the better visibility (consensus between stopping-rule criteria) of that fourth cluster, which is also documented in the literature (Widmer, Levy, et al. 2003; Levy et al. 2006). We ﬁrst ran a multinomial logistic regression11 on each data set (SFC, SHP and WLS), using cluster membership (which represents in this case types of occupational trajectories) as response variables and a set of

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

222

Sociological Methods & Research

Table 5 khi2 Values of the Likelihood Ratio Test by Database and Cost-Setting Method
Cost-Setting Method Identity Data Sets Set 1: SCF data, 3 clusters Set 1: SCF data, 5 clusters Set 2: SHP data, 3 clusters Set 2: SHP data, 5 clusters Set 3: WLS data, 4 clusters df 272 596 404 808 258 khi
2

Knowledge Based
2

Trained khi
2

p > khi

khi

2

p > khi

2

p > khi2

290.19 .2143 553.02 .8956 568.36 < .0001 897.67 .0150 307.35 .0189

280.87 547.03 562.04 863.32 323.81

.3428 .9250 < .0001 .0865 .0034

288.60 .2339 522.11 .9867 512.34 < .0002 740.12 .9574 288.37 .0939

Note: SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study.

indicators of social positioning (socioeconomic position of the orientation family, including level of education, number of children, age, and household income) generally considered (cf. description of the sample) as intervening variables in shaping female occupational trajectories. To be consistent with the stopping-rule criteria—that is, a consensus between pseudo-t2, pseudo-F, and R2—we retained in this ﬁrst step the threecluster solutions that those criteria pointed out for each data set. As they are more homogeneous, they represent about the same social reality in each data set and therefore remain sociologically relevant. We then performed the tests on the ﬁve-cluster solutions for SCF and SHP data to check the efﬁciency of the different cost-setting methods on other empirically founded classiﬁcations (Widmer, Levy, et al. 2003; Levy et al. 2006). We felt justiﬁed in doing this because two new clusters emerged from further subdivision of the ﬁrst three clusters deﬁned by the proposed criteria (R2, pseudo-F, and pseudo-t2). Table 5 shows the test of likelihood ratio applied to those multinomial regressions. The likelihood tests compare a given model with the saturated one (a model that exactly replicates the data), meaning in this case that the smaller the value of khi2 (i.e., the larger the p value), the better the model ﬁt to the data.12 One can read from Table 5 that the trained costs of substitution allow building a model that ﬁt better to the data in all cases compared to identity costs and in four out of ﬁve cases compared to knowledge-based costs. Put another way, clusters produced by trained costs of substitution are

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

223

more sensitive to predictors than clusters produced by either identity costs or knowledge-based costs. This is true, although not with the same strength, for the three sets of sequences. The ﬁt is signiﬁcantly better (i.e., the model stemming from trained costs does not differ signiﬁcantly from the saturated model, whereas the two others do) in two cases and with two data samples.

Discussion
Setting costs of substitution in the process of aligning sequences of social statuses is controversial because it may signiﬁcantly inﬂuence the results of the analysis. We propose a method to determine costs of substitution empirically, which we tested using three distinct sets of social science data. The training procedure that we present appears to be, to our knowledge, the only one that is exclusively empirically grounded and optimized. First, we considered the correlation between the substitution matrices for a given alphabet across three data sets of the social sciences realm representing the same social realities (sequences of occupational statuses along the life course) and three cost-setting strategies. The training procedure leads to results that are very similar to those stemming from the two other methods (substitution costs represented as an identity matrix or following some knowledge-based weighting). In this sense, cost variability did not appear to modify the general results of the analysis. Nevertheless, the costs stemming from the training procedure may claim a greater legitimacy as they reﬂect the actual relationships of the symbols considered. That legitimacy is reinforced by the very high correlation existing between the substitution matrices stemming from the training procedures applied to the three data sets at hand. In this sense, the values of the trained cost matrices may even be considered as a validation a posteriori of the use of alternative costs of substitution (knowledge based or identity) found in the literature for the speciﬁc case of occupational trajectories. Moreover, the training procedure shows some interesting features that should be further explored, such as the possibility to differentiate speciﬁc substitution costs according, for example, to gender. The ability of the trained costs to provide a clustering that is better associated to some sociologically unequivocal classiﬁcation of reference than the identity costs and the knowledge-based costs did illustrate the ability of the training procedure to discover some structural features of the data that are sociologically relevant.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

224

Sociological Methods & Research

Second, based on likelihood ratio tests of multinomial logistic regressions, we compared the associations between cluster solutions (response variables) and a set of relevant sociostructural variables (intervening variables) for the three cost-setting strategies across the three data sets at hand. Here again, the training procedure led to better results than the identity and the knowledge-based costs did. That is, the data-driven costs of substitution contributed to classiﬁcations that ﬁt better with widely recognized sociological models of women’s labor market participation than the two other strategies. Taking into account the actual structure of the data provides models that ﬁt better with external factors than undifferentiated or knowledge-based costs schemes. Finally, the ability of the training procedure to discover certain actual internal relationships of the data and therefore to offer an efﬁcient and empirically grounded way to determine costs of substitution is demonstrated in another way as it is able to accurately identify the closeness of two formally identical, but artiﬁcially differentiated, substitution costs (here, between two occupational statuses). Moreover, the degree of closeness between the substitution costs is also informative about the relative proximity of the symbols and the sociological reality they represent. The training procedure offers signiﬁcant improvements compared to the methods generally used until now in social sciences. By revealing every symmetric relationship among those symbols, this procedure avoids assigning a cost based on prior knowledge that would later appear to be erroneous when compared to the actual data. The results show that for any pair of symbols of a given alphabet, the produced trained costs of substitution remain remarkably similar from one data set to another. It means that those costs do reﬂect some important information concerning the actual (in this case, social) signiﬁcance of the symbols constituting the sequences and do not represent just abstract values varying from data set to data set (or from one training session to another). Therefore, these costs also constitute a predictive feature, in the sense that two different symbols with low substitution costs can be predicted to easily substitute for one another in real life. Identiﬁcation of these low substitution costs therefore make it possible to predict situations likely to occur in similar contexts at similar ages and at similar frequencies. In comparison with approaches based on transitions costs, which are computed within each single sequence taken separately, the proposed method aims to determine substitution costs by looking for a match or mismatch at each speciﬁc position throughout all pairs of sequences. In this sense, the latter method is based on richer information and grants a

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

225

higher importance to time (i.e., to age and social age) and to the relations between sequences than cost schemes based on transitions rates. There is on one hand a constant and clear similarity between the results stemming from the three cost-setting strategies (identity, knowledge based, and training) and on the other hand a signiﬁcant improvement in the tests of internal and external validity of the results provided by the training procedure. The conditions under which the method is most appropriate remain to be systematically tested. The experiments presented in this article point in several directions. First, the method provides a strong leverage when no or few theoretical arguments may be brought up to the scene in support of a cost solution or when contradicting theories propose different cost schemes. In other words, it is best suited for an exploratory research design. Second, this method is ideal whenever too many statuses have been used to characterize the data. We show, for instance, in this study that the proposed procedure reveals the identity between two statuses that may have been coded separately. Finally, the cost estimation provides a means for quantifying the relationship among symbols; as such, it can be used to identify and discover equivalences among categories. In itself, this means of quantiﬁcation may prove to be a useful investigative tool for the social sciences. There are several limitations to the solution proposed in this article. First, the method deals poorly with symbols occurring rarely in sequences. Whenever this happens, the estimations of substitution costs are less accurate and more variable. Second, a key property of the optimal matching algorithm is to rely on the assumption that events deﬁning a life trajectory are chronologically ordered and collinear among the considered sequences. This is, of course, a simpliﬁcation, but it seems to hold reasonably well when considering sequences with a high percentage of identity. However, it should be mentioned that if recurrent subsequences were to be found scattered in different periods of life, they could probably be recovered using techniques related to the one that we describe here, such as Gibbs sampling (Lawrence et al. 1993; Abbott and Barman 1997) or the local alignment algorithm (Smith and Waterman 1981). Second, this algorithm, like other optimal matching algorithms, assumes the independence of each position constituting a sequence. This may be oversimplifying as one can argue that life trajectories are not homogeneous. They may be substructured in smaller units (life stages, transitions, turning points, speciﬁc life events, etc.), whose sizes may vary but should be kept intact in the alignments. This issue is likely to arise when comparing very distinct sequences. When this situation occurs, it may be worthwhile to modify the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

226

Sociological Methods & Research

proposed algorithm. Nevertheless, the issue remains to automatically identify meaningful borders deﬁning those subsequences. In biology, multiple sequence alignments have been used successfully to identify the exact extent of subsequences conserved across related sequences (Notredame, Higgins, and Heringa 2000). It is certainly worthwhile to explore the potential of this method in the social sciences.

Notes
¨ 1. This freeware is available from the Ruhr-Universitat Bochum Web site at http://steinhaus .stat.ruhr-uni-bochum.de/tda.html. 2. This freeware is available from the University of Chicago Web site at http://home .uchicago.edu/aabbott/om.html. ~ 3. This freeware is available from the Strasbourg Bioinformatics Platform Web site at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX. 4. ‘‘Thus, while substitution must be carefully handled, it is not a supersensitive task whose errors will be compounded by later stages in the analysis’’ (Abbott and Hrycak 1990:176). 5. Student’s t tests performed on the 10 values generated by the training procedures for each cost of substitution reveal that those values do not differ from the mean (p < .0001, df = 9). 6. Hotelling’s T2 is a statistical measure of the multivariate distance of each observation from the center of the data set. This is an analytical way to ﬁnd the most extreme points in the data. 7. This is the ratio between interclass variance and total variance. 8. This data set is for public use. Access to the data is provided by the Swiss Household Panel (SHP) Web site at http://www.swisspanel.ch. 9. Following the availability of the data, the range considered is 16 to 65 years old for Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families and SHP data, and 20 to 56 years old for Wisconsin Longitudinal Study data. 10. Spearman correlation coefﬁcient = .734 (p < .01). 11. We used PROC CATMOD of the SAS software. 12. At p ≤ .05, the tested model is not statistically different than the saturated one.

References
Abbott, Andrew. 1984. ‘‘Event Sequence and Event Duration: Collocation and Measurement.’’ Historical Methods 17:192-204. Abbott, Andrew. 1990a. ‘‘Conception of Time and Events in Social Science Methods: Causal and Narrative Approach.’’ Historical Methods 23:140-50. Abbott, Andrew. 1990b. ‘‘A Primer on Sequence Methods.’’ Organization Science 1:375-92.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

227

Abbott, Andrew. 1995a. ‘‘A Comment on ‘Measuring the Agreement Between Sequences.’’’ Sociological Methods & Research 24:232-43. Abbott, Andrew. 1995b. ‘‘Sequence Analysis: New Methods for Old Ideas.’’ Annual Review of Sociology 21:93-113. Abbott, Andrew. 2001. Time Matters: On Theory and Method. Chicago: University of Chicago Press. Abbott, Andrew and Emily Barman. 1997. ‘‘Sequence Comparison Via Alignment and Gibbs Sampling: A Formal Analysis of the Emergence of the Modern Sociological Article.’’ Sociological Methodology 27:47-87. Abbott, Andrew and John Forrest. 1986. ‘‘Optimal Matching Methods for Historical Sequences.’’ Journal of Interdisciplinary History XVI:471-94. Abbott, Andrew and Alexandra Hrycak. 1990. ‘‘Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.’’ American Journal of Sociology 96:144-85. Abbott, Andrew and Angela Tsay. 2000. ‘‘Sequence Analysis and Optimal Matching Methods in Sociology.’’ Sociological Methods & Research 29:3-33. Aisenbrey, Silke. 2000. Optimal Matching Analyse. Anwendungen in Den Sozialwissenschaften (Optimal Matching Analysis: Applications in the Social Sciences). Opladen, Germany: Leske + Budrich. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jianzhi Zhang, Zhu Zhang, Webb Miller, and David Lipman. 1997. ‘‘Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs.’’ Nucleic Acids Research 25:3389-3402. Bateman, Alex, Evan Birney, Richard Durbin, Sean R. Eddy, Robert D. Finn, and Erik L. Sonnhammer. 1999. ‘‘Pfam 3.1: 1313 Multiple Alignments and Proﬁle HMMs Match the Majority of Proteins.’’ Nucleic Acids Research 27:260-62. Billari, Francesco C. 2001. ‘‘Sequence Analysis in Demographic Research and Applications.’’ Canadian Studies in Population 28:439-58. ¨ Bird, Katherine and Helga Kruger. 2005. ‘‘The Secret of Transitions: The Interplay of Complexity and Reduction in Life Course Analysis.’’ Pp. 173-94 in Towards an Interdisciplinary Perspective on the Life Course, vol. 10, edited by R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, and E. Widmer. Amsterdam: Elsevier JAI. Blair-Loy, Mary. 1999. ‘‘Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.’’ American Journal of Sociology 104:1346-97. Blossfeld, Hans-Peter and Sonja Drobnic. 2001. Careers of Couples in Contemporary Society: From Male Breadwinner to Dual Earner Families. New York: Oxford University Press. Bock, Hans H. 1985. ‘‘On Some Signiﬁcance Tests in Cluster Analysis.’’ Journal of Classiﬁcation 2:77-108. Calinski, Tadeusz and Joachim Harabasz. 1974. ‘‘A Dendrite Method for Cluster Analysis.’’ Communication in Statistics 3:1-27. Chan, Tak Wing. 1995. ‘‘Optimal Matching Analysis: A Methodological Note on Studying Career Mobility.’’ Work and Occupations 22:467-90. Dayhoff, Margaret O., Robert M. Schwartz, and Bruce C. Orcutt. 1978. ‘‘A Model in Evolutionary Change in Proteins.’’ Pp. 345-52 in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, edited by M. O. Dayhoff. Washington, DC: National Biomedical Research Foundation. Dijkstra, Wil and Toon Taris. 1995. ‘‘Measuring the Agreement Between Sequences.’’ Sociological Methods & Research 24:214-31.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

228

Sociological Methods & Research

Duda, Richard O. and Peter E. Hart. 1973. Pattern Classiﬁcation and Scene Analysis. New York: John Wiley. Durbin, Richard, Sean E. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. Elder, Glen H. 1985. Life Course Dynamics: Trajectories and Transitions, 1968-1980. Ithaca, NY: Cornell University Press. Erzberger, Christian and Gerald Prein. 1997. ‘‘Optimal-Matching-Technik: Ein Analysever¨ fahren zur Vergleichbarkeit und Ordnung individuell differenter Lebensverlaufe.’’ [Optimal matching technique: an analytical process to compare and classify individual life courses] ZUMA-Nachrichten 40:52-80. Everitt, Brian S. 1979. ‘‘Unresolved Problems in Cluster Analysis.’’ Biometrics 35:169-81. Forrest, John and Andrew Abbott. 1989. ‘‘The Optimal Matching Method for Studying Anthropological Sequence Data: An Introduction and Reliability Analysis.’’ Journal of Quantitative Anthropology 1:151-70. Giddens, Anthony, Mitchell Duneier, and Richard P. Applebaum. 2003. Introduction to Sociology. New York: W. W. Norton. Giele, Janet Z. and Glen H. Elder. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Giuffre, Katherine A. 1999. ‘‘Sandpiles of Opportunity: Success in the Art World.’’ Social Forces 77:815-32. Goodman, Leo A. and William H. Kruskal. 1979. Measures of Association for Cross Classiﬁcation. New York: Springer. Grauer, Dan and Wen-Hsiung Li. 2000. Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer. Halpin, Brendan and Tak Wing Chan. 1998. ‘‘Class Careers as Sequences: An Optimal Matching Analysis of Work-Life Histories.’’ American Sociological Review 14:111-30. Hartigan, John A. 1985. ‘‘Statistical Theory in Clustering.’’ Journal of Classiﬁcation 2:63-76. Henikoff, Steven and Jorja G. Henikoff. 1992. ‘‘Amino Acid Substitution Matrices From Protein Blocks.’’ Proceedings of the National Academy of Sciences 89:10915-19. ¨ Hopﬂinger, Francois, Maria Charles, and Annelies Debrunner. 1991. Familienleben und Ber¸ ufsarbeit (Family Life and Professional Work). Zurich, Switzerland: Seismo. Hughey, Richard and Anders Krogh. 1996. ‘‘Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method.’’ Computer Applications in Biological Science 12:95-107. Kohli, Martin. 1986. ‘‘The World We Forgot: A Historical Review of the Life Course.’’ Pp. 271-303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. ´ ¨ Kruger, Helga and Rene Levy. 2001. ‘‘Linking Life Courses, Work and the Family: Theorizing a Not So Visible Nexus Between Women and Men.’’ Canadian Journal of Sociology 26:145-66. Kruskal, Joseph. 1983. ‘‘An Overview of Sequence Comparison.’’ Pp. 1-44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Toronto, Canada: Addison-Wesley. Lawrence, Charles E., Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. 1993. ‘‘Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment.’’ Science 262:208-14.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

229

Levine, Joel H. 2000. ‘‘But What Have You Done for Us Lately?’’ Sociological Methods & Research 29:34-40. Levitt, Barbara and Clifford Nass. 1989. ‘‘The Lid on the Garbage Can: Institutional Constraints on Decision Making in the Technical Core of College-Text Publishers.’’ Administrative Science Quarterly 34:190-207. ´ Levy, Rene. 1977. Der Lebenslauf als Statusbiographie. Die weibliche Normalbiographie in makrosziologisher Perspektive. [The life course as a sequence of statuses. The female standard biography in a macrosociological perpsective]. Stuttgart, Germany: Enke. ´ Levy, Rene, Jacques-Antoine Gauthier, and Eric Widmer. 2006. ‘‘Entre contraintes institu´ tionnelle et domestique: Les parcours de vie masculins et feminins en Suisse.’’ [Between institutional and domestic constraints: the life courses of women and men in Switzerland] Revue canadienne de sociologie 31:461-89. ´ Levy, Rene, Eric Widmer, and Jean Kellerhals. 2002. ‘‘Modern Family or Modernized Family Traditionalism? Master Status and the Gender Order in Switzerland.’’ Electronic Journal of Sociology 6(4). Manning, Gerard, David B. Whyte, Ricardo Martinez, Tony Hunter, and Sucha Sudarsanam. 2002. ‘‘The Protein Kinase Complement of the Human Genome.’’ Science 298:1912-34. Milligan, Glenn W. and Martha C. Cooper. 1985. ‘‘An Examination of Procedures for Determining the Number Clusters in a Dataset.’’ Psychometrika 50:159-79. Milligan, Glenn W. and Martha C. Cooper. 1987. ‘‘Methodology Review: Clustering Methods.’’ Applied Psychological Measurement 11:329-54. Moen, Phillis. 1985. ‘‘Continuities and Discontinuities in Women’s Labor Force Activity.’’ Pp. 113-55 in Life Course Dynamics: Trajectories and Transitions, 1968-1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Moen, Phillis. 2003. It’s About Time: Couples and Careers. Ithaca, NY: Cornell University Press. Moen, Phyllis and Yan Yu. 2000. ‘‘Effective Work/Life Strategies: Working Couples, Work Conditions, Gender, and Life Quality.’’ Social Problems 47:291-326. Mott, Frank L. 1978. Women, Work and Family. Lexington, MA: Lexington Books. ¨ Muller, Tobias and Martin Vingron. 2000. ‘‘Modeling Amino Acid Replacement.’’ Journal of Computational Biology 7:761-76. Myrdal, Alva and Viola Klein. 1956. Women’s Two Roles: Home and Work. London: Routledge. Nargundkar, Satish and Timothy J. Olzer. 1998. ‘‘An Application of Cluster Analysis in the Financial Services Industry.’’ Presented at the sixth annual conference of the South East SAS Users Group, September 13-15, Norfolk, VA. Needleman, Saul B. and Christian D. Wunsch. 1970. ‘‘A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.’’ Journal of Molecular Biology 48:443-53. Ng, Pauline C., Jorja G. Henikoff, and Steven Henikoff. 2000. ‘‘PHAT: A TransmembraneSpeciﬁc Substitution Matrix. Predicted Hydrophobic and Transmembrane.’’ Bioinformatics 16:760-66. ´ Notredame, Cedric, Philipp Bucher, Jacques-Antoine Gauthier, and Eric Widmer. 2005. T-Coffee/saltt: User Guide and Reference Manual. Lausanne: Swiss Institute of Bioinformatics. Retrieved from http://www.tcoffee.org/saltt. ´ Notredame, Cedric, Desmond G. Higgins, and Jaap Heringa. 2000. ‘‘T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment.’’ Journal of Molecular Biology 302:205-17.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

230

Sociological Methods & Research

´ Notredame, Cedric, Liisa Holm, and Desmond G. Higgins. 1998. ‘‘Coffee: An Objective Function for Multiple Sequence Alignments.’’ Bioinformatics 14:407-22. Olszak, Michael and Gilbert Ritschard. 1995. ‘‘The Behavior of Nominal and Ordinal Partial Association Measures.’’ Statistician 44:195-212. Pentland, Brian T., Malu Roldan, Ahmed A. Shabana, Louise L. Soe, and Sidne G. Ward. 1998. ‘‘Lexical and Sequential Variety in Organizational Processes.’’ School of Labor and Industrial Relations, Michigan State University, East Lansing. Unpublished manuscript. Punj, Girish and David W. Stewart. 1983. ‘‘Cluster Analysis in Marketing Research: Review and Suggestions for Application.’’ Journal of Marketing Research 20:134-48. Rapkin, Bruce D. and Douglas A. Luke. 1993. ‘‘Cluster Analysis in Community Research: Epistemology and Practice.’’ American Journal of Community Psychology 21:247-77. ¨ ¨ Rohwer, Gotz and Ulrich Potter. 2002. TDA User’s Manual. Bochum, Germany: Ruhr ¨ Universitat Bochum. Retrieved from http://www.stat.ruhr-uni-bochum.de/pub/tda/doc/ tman63/tman-pdf.zip. ¨ Rohwer, Gotz and Heike Trappe. 1997. ‘‘Describing Life Courses. An Illustration Based on NLSY Data.’’ Pp. 30 in POLIS Project Conference. Florence, Italy: European University Institute. SAS Institute, Inc. 2004. SAS/STAT User’s Guide. Cary, NC: Author. ¨ Schaeper, Hildegard. 1999. ‘‘Erwerbesverlaufe von Ausbildungsabsolventinnen und -Absolventen: Eine Anwendung der Optimal-Matching-Technik.’’ [Employment history of girls and boys after completion of vocational education and training: an appli¨ cation of optimal matching technique]. Sonderforschungsbereich 186, Universitat Bremen, Germany. Scherer, Stefani. 2001. ‘‘Early Career Patterns: A Comparison of Great Britain and West Germany.’’ European Sociological Review 17:119-44. Sheridan, Jennifer T. 1997. ‘‘The Effects of the Determinants of Women’s Movement Into and Out of Male Dominated Occupations on Occupational Sex Segregation.’’ CDE Working Paper 97-07, Department of Sociology, University of Wisconsin, Madison. Smith, Temple F. and Michael S. Waterman. 1981. ‘‘Identiﬁcation of Common Molecular Subsequences.’’ Journal of Molecular Biology 147:195-97. Stovel, Katherine and Marc Bolan. 2004. ‘‘Residential Trajectories: Using Optimal Alignment to Reveal the Structure of Residential Mobility.’’ Sociological Methods & Research 32:559-98. Stovel, Katherine, Michael Savage, and Peter Bearman. 1996. ‘‘Ascription Into Achievement: Models of Career Systems at Lloyds Bank, 1890-1970.’’ American Journal of Sociology 102:358-99. Thompson, Julie, Desmond G. Higgins, and Toby Gibson. 1994. ‘‘Clustal W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Speciﬁc Gap Penalties and Weight Matrix Choice.’’ Nucleic Acids Research 22:4673-80. Turner, Jonathan H. 2001. ‘‘Sociological Theory Today.’’ Pp. 1-17 in Handbook of Sociological Theory, edited by J. H. Turner. New York: Kluwer Academic. ´ ´ ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2003. Couples contemporains: Cohesion, reg¨ ulation et conﬂits. [Contemporary couples: cohesion, regulation, conﬂicts] Zurich: Seismo. ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2004. ‘‘Quelle pluralisation des relations familiales?’’ [What pluralization of family relations]. Revue francaise de sociologie ¸ 45:37-67.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

Gauthier et al. / How Much Does It Cost?

231

´ Widmer, Eric, Rene Levy, Alexandre Pollien, Raphael Hammer, and Jacques-Antoine Gauthier. 2003. ‘‘Entre standardisation, individualisation et sexuation: une analyse des trajectoires personnelles en Suisse’’ [Between standardization, individualization and gendering: an analysis of personal life courses in Switzerland] Revue suisse de sociologie 29:35-67 Wilson, W. Clarke. 1998. ‘‘Activity Pattern Analysis by Means of Sequence-Alignment Methods.’’ Environment and Planning A 30:1017-38. Wu, Lawrence L. 2000. ‘‘Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect.’ ’’ Sociological Methods & Research 29:41-64. Yu, Yi-Kuo and Stefen F. Altschul. 2005. ‘‘The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins With Non-Standard Compositions.’’ Bioinformatics 21:902-11. Jacques-Antoine Gauthier is a senior lecturer in sociology at the University of Lausanne and member of the Center for Life Course and Lifestyle Studies (Pavie). He has worked in the ﬁelds of health, addiction, and family sociology. His latest publications have appeared in the Canadian Journal of Sociology, European Journal of Operational Research, and the Swiss Journal of Sociology. Eric D. Widmer is a professor of sociology at the University of Geneva, with an appointment at the Center for Life Course and Lifestyle Studies (Pavie). His long-term interests include life course research, family research, and social networks. His latest publications have appeared in the Journal of Personal and Social Relationships, European Sociological Review, and Journal of Marriage and Family. Philipp Bucher is a group leader at the Swiss Institute for Experimental Cancer Research and a founding member of the Swiss Institute of Bioinformatics. His long-term interests include the development of algorithms for the analysis of molecular sequences and the application of these algorithms in various areas of biomedical research. His latest publications have appeared in PLoS Computational Biology and Nucleic Acids Research. ´ Cedric Notredame is a group leader at the Centre for Genomic Regulation in Barcelona (Spain) and a research investigator for the Centre National de la Recherche Scientiﬁque (France). The focus of his work is the development and improvement of multiple sequence alignment algorithms. His latest publications have appeared in the Journal of Molecular Biology and Nucleic Acid Research. He is also the coauthor, with J. M. Claverie, of a popular introductory textbook in bioinformatics, Bioinformatics for Dummies (New York: Wiley, 2003).

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009

00
MULTICHANNEL SEQUENCE ANALYSIS APPLIED TO SOCIAL SCIENCE DATA Jacques-Antoine Gauthier* Eric D. Widmer† Philipp Bucher‡ C´ dric Notredame** e
Applications of optimal matching analysis in the social sciences are typically based on sequences of specific social statuses that model the residential, family, or occupational trajectories of individuals. Despite the broadly recognized interdependence of these statuses, few attempts have been made to systematize the ways in which optimal matching analysis should be applied multidimensionally— that is, in an approach that takes into account multiple trajectories simultaneously. Based on methods pioneered in the field of bioinformatics, this paper proposes a method of multichannel sequence analysis (MCSA) that simultaneously extends the usual optimal matching analysis (OMA) to multiple life spheres. Using data

We thank the editor and the anonymous reviewers for helpful comments and suggestions. Direct correspondence to Jacques-Antoine Gauthier, University of Lausanne, Faculty of social and political sciences (SSP), research center Methodology, Inequalities and Social Change (MISC), Bˆ timent de Vidy, CH – 1015 Laua sanne, Switzerland. Email: Jacques-antoine.gauthier@unil.ch. *University of Lausanne †University of Geneva ´ ‡Swiss Institute of Bioinformatics and the Ecole Polytechnique F´ d´ rale e e de Lausanne **Centre for Genomic Regulation, Barcelona, and Centre National de la Recherche Scientifique, Marseille

1

2

GAUTHIER ET AL.

from the Swiss household panel (SHP), we examine the types of trajectories obtained using MCSA. We also consider a random data set and find that MCSA offers an alternative to the sole use of ex-post sum of distance matrices by locally aligning distinct life trajectories simultaneously. Moreover, MCSA reduces the complexity of the typologies it allows to produce, without making them less informative. It is more robust to noise in the data, and it provides more reliable alignments than two independent OMA.

1. INTRODUCTION Most multivariate analyses using longitudinal data are based on hard causal models in which one or several independent variables predict the future actualization of some state of a dependent variable. Optimal matching analysis (OMA) offers a more descriptive perspective, that does not emphasize the causal priority of some variables over the others but aims at elaborating a systemic view on the social phenomena that develop over time. However, most applications of OMA have been limited to one dimension at a time, a serious shortcoming for empirical analyses. This paper develops a multichannel sequence analysis (MCSA) which enables researchers to describe individual trajectories on several dimensions simultaneously. Various empirical studies (Elder 1985; Clausen 1986; Kohli 1986; Levy 1991, 1996; Giele and Elder 1998; Heinz and Marshall 2003; Mortimer and Shanahan 2003; Levy et al. 2005; Macmillan 2005) emphasize the multidimensionality of life trajectories based on social, psychological, and biological factors that interact over time (Wetzler and Ursano 1988; Spruijt and de Goede 1997; Repetti, Taylor, and Seeman 2002; Lesesne and Kennedy 2005). A major problem with research on life trajectories, however, lies in the fact that the researcher is confronted with a variety of unequally linked sequences unfolding at various speeds (Abbott 1992). Life course studies therefore require the integration of seemingly heterogeneous trajectories into a unique empirical model (Levy et al. 2005), an ambitious task that even regression-based models cannot accomplish (Esser 1996). In this perspective, Abbott (2001:151) insists on using sequential data as multicase narratives to uncover patterns of careers rather than looking for causal models. Many social scientists have used OMA to model life trajectories. Nevertheless, since its emergence in the social sciences, OMA has neglected the multidimensionality of life trajectories. Social scientists have

MULTICHANNEL SEQUENCE ANALYSIS

3

always lacked a standard approach for undertaking multidimensional sequence analysis of life trajectories. To fill this gap, the present study proposes multichannel sequences analysis (MCSA), a computational approach that makes practical improvements to optimal matching algorithms at two levels. 1 First, it systematizes approaches for dealing with multidimensionality using OMA. Second, it accounts simultaneously for local interdependencies among different social statuses present at each point of the alignment process for all channels. 2 Third, it offers practical improvements toward visualizing parallel processes occurring in various life spheres, a key element to describe and interpret sets of ¨ individual trajectories (Tufte 1997; Muller et al. 2008; Piccarreta and Lior 2010). In this study, we first present the quantitative methods available at the moment in the social sciences for dealing with the multidimensionality of the life course and describe in detail how the method works. To this end we also briefly present an example of substantive results produced by MCSA using social science data. We then illustrate the potential of MCSA by testing its validity and reliability using various formal criteria. Finally, we use random data to compare several multidimensional approaches using OMA. 2. QUANTITATIVE APPROACHES TO LIFE HISTORIES There are a few methodological options for dealing with the multidimensionality of life trajectories. The most often used is event history analysis (EHA; Blossfeld and Rohwer 1995) and sequence analysis (SA; e.g., Sankoff and Kruskal 1983; Abbott and Hrycak 1990), while some attempts have been made with latent class methods (Macmillan and Eliason 2003) and life history graphs (Butts and Pixley 2004). The latter focuses on internal configurations of the life course to reveal general sociological patterns. It uses a formal definition of life
The computations presented in this paper are encapsulated in the program SALTT (Search Algorithm for Life Trajectories and Transitions), an open-source freeware program written in C (Notredame, Bucher, Gauthier, and Widmer 2005). The package and its documentation can be downloaded from: http://www.tcoffee.org/saltt/. Recently, TraMineR, a package of the R software environment, has allowed performing OMA and MCSA (Gabadinho et al. 2008). Otherwise, computations are made using SAS (Sas Institute 2004). 2 By “channel,” we mean each sequence of statuses that constitute the multidimensionality under study.
1

4

GAUTHIER ET AL.

history that applies social network analysis at an intrapersonal level. Individual histories are expressed as structural relationships between life spells, such as centrality, betweenness, or closeness (Wasserman and Faust 1994). Life history graphs (LHG) take multidimensionality into account, but they put little emphasis on time as it focuses on the overlap of life spells for a given individual. The use of latent class methods (LCM) is based on transition probabilities. It allows identifying subsamples characterized by typical (i.e., most probable) family and occupational roles configurations over time. Unfortunately, methodological limitations make it difficult for LCM to consider more than a few time points and to represent life courses at an individual level. Event history analysis estimates time-to-event or risk functions concerning, for instance, the occurrence of specific events, such as divorce or entry into the labor market, that are then used as dependent variables in different regression models (e.g., Kaplan-Meier or Cox). EHA provides strong information at a population level. However, the information concerning the unfolding of individual life history is limited to a dichotomous variable (the occurrence or not of a given event). A major strength of these methods is to allow statistical testing of the model, but they show limited sensitivity to temporality. Overall, LHG, LCM, and EHA are insufficient to account for the multidimensionality of life trajectories because they fail to “take a narrative approach to social reality” (Abbott 2001:185). In contrast, sequence analysis techniques and, more specifically, optimal matching analysis take the entire sequences of statuses held by individuals over a given period of time (e.g., family, occupational) as the analytical unit to find chronological patterns of stability and change (George 1993). Thus, each individual life course is modeled as a specific sequence of social statuses that may be expressed as a specific character string. For instance, the sequence aaaabbcccc may describe the family trajectory of an individual over ten years (e.g., between the ages of 18 and 27), with a standing for living with both biological parents, b for living alone, and c for living in a couple. Basically, OMA involves making pairwise comparisons between individual sequences of statuses to evaluate how similar they are. 3 This is accomplished by counting the minimal (weighted) number of elementary operations (known as
3 There are promising techniques for multiple sequence alignment, whereby all sequences are simultaneously compared to all others, but these tech-

MULTICHANNEL SEQUENCE ANALYSIS

5

“costs”) of insertion, deletion, 4 and substitution that are necessary to transform one sequence into another (Sankoff and Kruskal 1983; Abbott and Hrycak 1990). 5 For instance, in Figure 8, shown later in this paper, one has to delete one m and then substitute two times n for m in the sequence Ac = mmllm to transform it into Bc = nlln. Among all possible ways to transform sequence Ac in sequence Bc , the one associated with the smallest cost is obtained through dynamic programming (Needleman and Wunsch 1970) and is called the optimal distance between two sequences. 6 The alignment of two life-as-sequences takes into account both the relative position of specific statuses in each individual trajectories and the process of their unfolding over time. Moreover, as the modeling of the sequences is only limited by the number of time points and that of possible characters, the possible individual variability of sequence rapidly becomes huge. Thus, the distance computed by OMA summarizes in an elegant manner the extent to which life courses are similar. The smaller the distance between two life trajectories, the more similar they are. Once all pairwise alignments are computed, the researcher performs a cluster analysis on the resulting distance matrix to reveal types of individual trajectories. 7 Eventually, the typologies stemming from the two latter steps may be used as categorical variables in secondary analyses (cross-tabulations, regressions, and so forth). We now turn to the more general issue of the extent to which OMA can be systematically and straightforwardly applied to multidimensional trajectories. Our goal is to evaluate the ability of OMA to adequately model two main tenets of the life course paradigm. The first one (the principle of linked lives) states that individuals participate

niques are poorly suited to large samples and divergent sequences (Claverie and Notredame 2003). 4 Insertion and deletion are equivalent and are referred to as indel. 5 The question of costs necessary to align sequences is a central methodological debate in the use of OMA by social scientists (e.g., Abbott and Tsay 2000; Levine 2000; Wu 2000). Recently, significant advances toward empirical, data-based cost-setting offer objective means of defining the relationships between elements to be compared (Gauthier et al. 2009; Aisenbrey and Fasang 2010). 6 For a closer description of the algorithm, see, for example, Kruskal (1983). 7 The general principle of cluster analysis consists of grouping individuals according to a systematic rule. In this paper, we use the hierarchical Ward’s algorithm, which aims to minimize the intragroup and maximize the intergroup variance of interindividual distances.

6

GAUTHIER ET AL.

simultaneously in various social spheres and that the corresponding positions they hold in each are interdependent, as is the case between family and occupational careers (e.g., Heinz 2003). The second tenet (lifelong development) emphasizes the fact that the interdependence of multiple social participation at an individual level may vary continuously over time (Elder et al. 2003). To develop a methodology that corresponds to the two tenets, we investigate the options currently available and define the prerequisites for simultaneously modeling distinct life sphere alignments, while also taking their interdependence into account.

3. OPTIMAL MATCHING ANALYSIS Due to methodological or computational limitations, sequence analysis in the social sciences has until recently focused mainly on (1) onedimensional social trajectories; (2) recoded statuses belonging to different social spheres prior to data processing; or (3) summed interindividual distances measured independently on distinct one-dimensional trajectories. In doing this, measurement of the similarity between two pairs of trajectories does not take full account of the possible interactions that may occur at some points of these linked sequences during the alignment process. Three different strategies have been used in OMA to measure life trajectories along several dimensions. The first consists of using typologies from one-dimensional analyses (e.g., occupational trajectories) as response variables in a logistic regression model that includes indicators of other trajectories (e.g., number of children) as predictor variables (Widmer et al. 2003; Levy, Gauthier, and Widmer 2006). This approach to a large extent disregards the longitudinal information provided by the predictor variables. A second strategy involves retrospectively combining the results obtained from various independent OMA into distinct types of trajectories (e.g., Han and Moen 1999). Since this approach sums interindividual distances from consecutive OMA, it is akin to cross-analyzing typologies stemming from distinct one-dimensional analyses of the same individuals. The main problem with such an approach is that it does not accurately take into account the local or temporal interdependence of the trajectories under study, because the respective types they

MULTICHANNEL SEQUENCE ANALYSIS

7

belong to are modeled independently of one another. Moreover, this combination of typologies produced by cross-tabulating the categorical variables stemming from one-dimensional OMA may lead to an overestimation of the number of relevant types, with many types being poorly populated and therefore noninformative. In particular, the approach suffers from a lack of parsimony and potential sensitivity to noisy data, as we demonstrate below. Furthermore, as each dimension is analyzed and clustered independently, it is impossible to use regular clustering quality estimates to decide on the number of types present in the data (Mojena 1977; Milligan and Cooper 1985, 1987). Moreover, the data in each dimension may not be equally reliable or informative. While the combination of alternative channels may compensate for this inequality, the separate treatment of dimensions will lead to spurious alignments, which may then result in the creation of artificial typologies. A third and more interesting strategy is based on combining two or more alphabets (e.g., Stovel, Savage, and Bearman 1996; Abbott and Hrycak 1990; Blair-Loy 1999; Pollock 2007; Dijkstra and Taris 1995; 8 Elzinga 2003). An alphabet is a collection of characters bijectively associated with an ensemble of distinct statuses, and characteristic of a given life course dimension (e.g., family, occupational, residential). For this purpose, an extended alphabet is generated by combining individual alphabets associated with specific channels. There is, however, a problem associated with this strategy: since it allows many possibilities for estimating the substitution costs associated with the extended alphabet, it becomes increasingly difficult to justify the choice of a given cost scheme as the number of categories grows larger and more heterogeneous. Furthermore, depending on the number of channels, the extended alphabet becomes uncomfortably large (Han and Moen 1999). Take, for instance, two channels with three statuses. Family statuses are given a specific code for singlehood, marriage, or divorce/widowhood. Occupational status is recorded as “at home,” “part-time,” or “fulltime.” In this scenario, there is no rationale for deciding a priori how to set costs stemming from the combination of “at home”/“marriage” versus “single”/“part-time or other statuses.” Moreover, each dimension’s local contribution to the overall interindividual distance, as well
8 These authors use a different algorithm from that of Needleman and Wunsch (1970), on which many OMA are based.

8

GAUTHIER ET AL.

as the particular unfolding of each set of linked trajectories, remains hidden or unknown.

4. MULTICHANNEL SEQUENCE ANALYSIS The multidimensional approach we have developed is based on the assumption that taking this local contribution into account differs from using an extended alphabet, since each dimension differentially influences the final alignment. Therefore, a systematic approach is needed in order to deal with these issues. We propose a multidimensional approach in which (1) the dimensions under study are used simultaneously during the alignment process; (2) no enumeration of an extended alphabet is needed; and (3) the combination of cost estimations is as explicit as possible and is dealt with using a standard parameter. Given an alphabet containing a finite number of characters, take two sequences I and J based on a finite number of characters belonging to the alphabet. 9 Consider the costs associated with insertions and deletions (henceforth called indel and abbreviated d), as well as with the substitution costs given by a cost matrix C, where Csi s j is the cost for aligning S i , the ith character of I against S j , the j th character of J. 10 In this paper, for simplification purposes, we set a cost of one to all substitutions involving two different characters. The substitution of a character with itself yields a cost of zero, and the costs associated with indel are set to the half of that of a substitution. 11 The optimal alignment score can then be computed using the following recursion:

In practice, most algorithms are based on existing sets of characters, as is, for instance, the English alphabet (26 characters) or the ASCII characters table (127 characters that may be complemented with 128 extended ASCII codes). Taking into account a greater number of characters is not a limitation per se but may require some programming. 10 In the context of this paper, we consider that the matching of identical characters yields a null score and that mismatches are associated with same sign, nonzero, finite costs, although other cost schemes may be found, notably in biology (Durbin et al. 2002). 11 We choose this option to simplify the exposition. Many other weighting schemes for substitutions and/or insertion/deletion may be considered (Thompson et al. 1994; Durbin et al. 2002; Widmer et al. 2003). We propose elsewhere a method that estimates differentiated costs on an empirical basis (Gauthier et al. 2009)

9

MULTICHANNEL SEQUENCE ANALYSIS

9

⎧ ⎪ F(i − 1, j − 1) + Csi s j ⎪ ⎪ ⎨ . F(i, j ) = min F(i − 1, j ) + d ⎪ ⎪ ⎪ F(i, j − 1) + d ⎩

(1)

Each line in equation (1) defines a possible optimal match score of two subsequences, whether it is less costly at this point to insert, delete, or substitute characters to fully align the subsequences. For instance, F(i − 1, j − 1) corresponds to the optimal match score of a subsequence containing the 1 to i − 1 characters of sequence I against a subsequence containing the symbols 1 to j − 1 in sequence J. As such, this equation defines a recursion in which the score of any alignment F(i, j) can be estimated by considering an optimal extension of the three shorter alignments F(i − 1, j), F(i − 1, j − 1), and F(i, j − 1). Considering that each of these shorter alignments was already an optimal matching of associated substrings, F(i, j) will also be optimal (Durbin et al. 2002:20). 12 We take the OMA concept a step further and extend it to the use of different information sources associated with individual trajectories. We name it multichannel sequence analysis (MCSA). In MCSA, each individual is associated with two or more distinct channels, each tapping a distinct life trajectory within a specific sphere (e.g., occupation, family, housing, location, health) by means of a specific alphabet. Channels associated with a given individual are synchronized so that, for example, the xth character of the family channel and the yth character of the occupational channel correspond to the same year for a given individual. For instance, given two individuals A and B, one can express the MCSA example given in Figure 8 as two bidimensional sequences: A = {(m, z), (m, t), (l, t), (l, t), (m, t)} and B = {(n, y), (l, z), (l, z), (n, z)}, where each doublet in parentheses characterise the situation at a given time point; the first and second positions in the doublet correspond to the channels of family and occupational participation respectively. Once
12 Of course, this strategy relies on the assumption that each position is independent and that the alignment scores are additive.

10

GAUTHIER ET AL.

defined for an individual, these doublets remain the same throughout the alignment procedure. Optimal matching analysis is based on a recursive algorithm that parses a pair of sequences from the first to the last element in an array and then estimates an optimal score at each point of the alignment (Sankoff and Kruskal 1983). A given optimal solution for two substrings of the sequences Ac and Bc does not imply that the optimal solution will be the same for any extension of these substrings. The optimal distance is given only after the algorithm has been applied to the entire sequences to be aligned. 13 Therefore, our goal is to analyze multiple social participations while taking into account what each pair of nested individual statuses contributes over time to the overall similarity between two individual life courses. The method is general in the sense that it can use as many channels as needed, with the only condition being their synchronization. In practice, taking into account synchronized channels within an OMA framework, as defined by equation (1), is relatively straightforward and only requires adapting the substitution costs Csi s j and the indel terms so that they reflect the relationship between equivalent channels. The multichannel version of these terms can be expressed as follows:
Nc c Csi s j

Csi s j =

C=1

Nc

.

(2)

While a single cost matrix is used to match two individual life trajectories in standard OMA, our approach considers two or more channels per individual and uses one cost matrix for each channel. These cost matrices are standard and can be generated using any appropriate strategy, such as unitary, knowledge-based, or data-based (Gauthier et al. 2009; 14 Aisenbrey and Fasang 2010). For instance, in equation (2), a channel-specific cost matrix (C c ) is associated with each channel. This matrix controls the cost of matching any character in the
13 For instance, using the cost schemes presented above, aligning Ac = {m} with Bc = {l} implies either a substitution (mismatch) or two insertion/deletions, whereas aligning Ac ’ = {ml} – where Ac ’ is equal to Ac plus character l – with Bc ’ = {l} calls for an insertion/deletion followed by a match. 14 In this paper, we present an empirical method for defining substitution costs using a data-based iterative procedure.

MULTICHANNEL SEQUENCE ANALYSIS

11

channel in question with any counterpart character for another individual. Formally, given two individuals A and B, each associated with c two channels c and d, Csi s j will be the cost associated with matching th the i character of channel c for individual A with the j th character d of channel c for individual B. Csi s j will be the cost of matching the ith character of channel d for individual A with the j th character of channel d for individual B. Eventually, the contribution of channels c and d is averaged to yield the final cost associated with the matching of positions i and j for the two individuals, where Nc stands for the number of channels. 15 Costs for the insertion/deletion (indel) of each channel are averaged the same way. An alternative used below is to define indel as the average off-diagonal value (AOD) of the corresponding substitution matrix (Thompson et al. 1994). This procedure can be extended to any number of channels. In the above example, as cost matrices are unitary, matching the doublets (m, t) with (l, z) is more costly than matching (l, t) with (l, z) as the latter doublets share a common character. Hence, the optimal MCSA alignment presented in Figure 8 inserts a doublet of indels in order to match the most similar doublets. 16 Eventually, the raw score of this bidimensional alignment is computed as 2 ∗ indels + 2 ∗ (mismatch/mismatch) + 2 ∗ (match/mismatch) = 2 ∗ 0.5 + 2 ∗ 2 + 2 ∗ 1 = 7. There are several ways to compute the distance from an optimal pairwise alignment. We may use the raw score provided by the algorithm, 17 or the percentage of identity (PID) between the two sequences (National Centre for Biotechnology Information 2004; May 2004). PID corresponds to the number of aligned identical characters, divided by the length of the longer sequence (see examples in Figure 8). It is an interesting measure, as it is approximately normally distributed (Doolitle 1981) and gives a useful indication concerning the common structure of two sequences (Raghava and Barton 2006).
To simplify the exposition, we have set the combination of the substitution costs at one point of the aligned sequences at the average value of the two substitution costs involved at this point. Future developments should implement some alternative ways of dealing with the relationship between local scores. 16 Matching two doublets leads to either two matches, one match and one mismatch, or two mismatches. Following the cost scheme used, the resulting costs may be quite differentiated. 17 When the length of the sequences to align differ, the resulting distance between them is normalized by dividing it through the length of the longer sequence.
15

12 5. EMPIRICAL ILLUSTRATION

GAUTHIER ET AL.

To test this method and illustrate its strength with an empirical example, we use data from the Swiss Household Panel (Tillmann and Zimermann 2004). It includes in its third wave a retrospective questionnaire that asks respondents to provide information on their educational, family, and occupational status from birth to the year of the interview. Each change in status is therefore associated with a starting date and an ending date. Every year, occupational trajectories are coded using a seven-category code scheme: full-time employment; part-time employment; positive interruption such as a sabbatical or trip abroad; negative interruption, such as unemployment or illness; full-time housework; retirement; and full-time education. A ten-category code scheme is used for family trajectories: living with a biological father and mother; living with only one biological parent, either mother or father; living with one biological parent and her or his partner; living alone; living with a partner; living with a partner and one’s own biological child; living with a partner and a nonbiological child; living with one’s own biological child without a partner; living with friends; other situations. A 12-category scale is used to describe education-to-work trajectories. 18 Given that individual life trajectories require substantial time to differentiate from one another, and to build sequences that are as complete and informative as possible, we consider here only the individuals aged 45 and older who answered the retrospective questionnaire (N = 2,212). As we further restrict our sample to individuals whose trajectories contain less than 50% missing data, our final data set contains 1,847 individuals (54% women, 46% men) characterized by two sequences of statuses describing their family and occupational lives from birth to age 45. 19 Technically, MCSA may be applied to any number of sequences, without restriction regarding censored or incomplete data. However, from a sociological point of view, it makes sense to compare life course sequences that have about

18 This scale is a combination of the seven categories of educational attainment following the classification by the Swiss Federal Statistical Office, which range from compulsory education to university degree, and five post-educational occupational statuses (full-time, part-time, household, unemployment, other). 19 When comparing education-to-work trajectories, we use sequences ranging from age 0 to 25, with 2,153 individuals aged 25 or older.

MULTICHANNEL SEQUENCE ANALYSIS

13

the same size, which results here in the fact that the youngest cohorts are not taken into account. 20 Our choice for an empirical example on which to test MCSA reflects the importance of debates about the influence of sociostructural factors on the divergent occupational and family trajectories ¨ of women and men (Moen 1985; Hopflinger and Debrunner 1991; Sheridan 1997; Levy, Gauthier, and Widmer 2006), as well as the availability of high-quality data on family and occupational status over entire lives. We use one channel to describe occupational statuses and another to model family trajectories over time. The interindividual distance matrix is computed by means of MCSA. 21 We then run a cluster analysis on that matrix using Ward’s hierarchical method to reveal coherent types of individual trajectories. We use stopping rules in order to estimate the relevant number of clusters to retain (e.g., Milligan and Cooper 1985; SAS Institute 2004). 22 We eventually decide to keep three clear-cut, bidimensional types of individual trajectories, in addition to one residual category not presented here. Figures 1, 2, and 3 present three contrasted illustrations of the visualization potentialities offered by a multichannel approach to individual life courses—namely, simultaneous local analysis and parallel visualization of interdependent social trajectories (e.g., see Tufte 1997). The first bidimensional type of trajectories (Figure 1, 26% of respondents) includes individuals that experience a quick transition to parenthood. After a long stay with their two biological parents, they live a few years alone or with a partner before entering a long and stable
20 More generally, we do not know the extent to which missing cases are missing at random or not. As occurs frequently with survey data, we may expect slight selection biases toward, for example, age, sex, occupation, or nationality (e.g. see Groves et al. 2004). 21 We use a unitary substitution cost matrix for both channels; insertion or deletion costs are set to half of one substitution cost. 22 We retain three criteria among those tested by Milligan and Cooper: (1) pseudo F, which represents an approximation of the ratio between the intercluster and intracluster variance of sequences and measures the separation between all clusters at the current level; (2) Je(2)/Je(1) (Duda and Hart 1973), which may be transformed into a pseudo T2, an index that measures the separation between the two most recently joined clusters; and (3) R squared, that expresses the size of the experimental effect. It is reasonable to look for consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). In the present study, a given cluster solution was retained for analysis only if at least one of these three criteria supported its validity.

14

GAUTHIER ET AL.

FIGURE 1. “Parental and non–full-time employment” bidimensional trajectories (26%).

period of parental life in a nuclear family. The associated occupational trajectories of the same individuals show a short period of full-time work after completing education, followed by a long period out of the job market or working part-time. Women are significantly overrepresented in this type, which we label “parental and non-full-time employment” trajectories; indeed, 92% of individuals belonging to this type are female.

MULTICHANNEL SEQUENCE ANALYSIS

15

FIGURE 2. “Nonparental and full-time employment” bidimensional trajectories (24%).

The second type (Figure 2, 24% of the sample) brings together people who experienced a long stay in a family of orientation composed of two biological parents, followed by a relatively long period of predominantly single living and/or childless conjugal life. The occupational trajectories of this type consist nearly exclusively of full-time activity. We name this nongendered type “nonparental and full-time employment” trajectories. In contrast to the first type, the proportions of men and women in this type are roughly equal.

16

GAUTHIER ET AL.

FIGURE 3. “Parental and full-time employment” bidimensional trajectories (30%).

The third type (Figure 3, 30% of the sample) comprises a large majority of men (92%) who follow family trajectories similar to those presented in Figure 1, and whose employment activity is stable and full-time. Further decomposition of the residual category (not presented here) reveals interesting minority patterns, such as conjugal trajectories associated with non–full-time occupational activities (7%, women overrepresented), or parental trajectories combined with long-term full-time

MULTICHANNEL SEQUENCE ANALYSIS

17

paid work of individuals who experienced their own parents’ separation during childhood (7%, no gender bias).

6. VALIDATION In the following sections, which use the data from the Swiss Household Panel, we test the extent to which MCSA produces more consistent results than regular OMA according to three criteria: (1) parsimony (reduces complexity), (2) reliability (takes advantage from channels interdependence), and (3) robustness (resists noise and distortion).

6.1. Reduction of Complexity Based on two distinct sequences of statuses for each individual in our data sets, three distance matrices are produced. Two of them correspond to one-dimensional analyses performed separately on family and occupational trajectories, whereas the third stems from MCSA applied simultaneously to both trajectories and corresponds to the empirical example presented above. 23 We then run a cluster analysis on these matrices, using Ward’s hierarchical method (Wards 1963). The number of clusters actually present in the data is estimated using the stopping rules presented above. For both one-dimensional types of trajectory (family and occupational), the presence of three or five clusters is supported in the data. The same procedure suggests the presence of four clusters in the distance matrix resulting from MCSA. If we cross-combine solutions stemming from the one-dimensional sequence analysis to build ex post multidimensional trajectories, we find typologies ranging from nine to 25 types each, 24 whereas an MCSA performed on the same data drastically reduces complexity, as the stopping rules indicate the presence

23 For this exploratory analysis, to focus on the specific features of MCSA, we use two unitary matrices of substitution, and the cost of insertion/deletion is set to the half of that of substitution. 24 Cross-combinations of three or five types of family trajectories with three or five types of occupational trajectories form, respectively, nine, 15, 15, and 25 distinct types of family-and-occupational trajectories.

18

GAUTHIER ET AL.

in the data of only four types of bidimensional trajectories. 25 In the following, we will not consider the respective semantic value of these typologies but will focus instead on the extent to which this reduction is associated with a loss of information. To measure the ability of multichannel analysis to both reduce complexity and preserve information, we cross-tabulate the four clusters’ multidimensional typology stemming from MCSA with the corresponding cross-combinations of one-dimensional OMA described above. The Goodman-Kruskal statistic is a measure of “proportionate reduction in error” (PRE), which reflects the percentage by which knowledge of the independent variable reduces errors in predicting the dependent variable (Goodman and Kruskal 1979; Siegel and Castellan 1988; Olzak and Ritschard 1995; Confais, Grelet, and Le Guen 2005). This statistic varies between 0 (absolute independence) and 1 (perfect association). When lambda (R|C) 26 has a value of 1, it means that each row of the table has only one cell different from zero. To efficiently reduce the complexity of a contingency table, we should capture the maximum information available in the rows, with minimum overlapping from one row to another in the same column, as schematically presented in Figure 4. This is exactly what we get from cross-tabulating MCSA with the combined OMA—that is, many cells with no cases or very few cases and many cells with high-column percentages and no cells in the same column with comparably high scores. 27 Table 1 shows the degree of association (lambda and contingency coefficients) between family and occupational types of trajectories computed either with MCSA (four clusters) or cross-combined monochannel solutions (three or five clusters, respectively). The contingency coefficients in Table 1 show a strong association between multichannel and cross-combined monochannel
25

The stopping rules reveal also a seven-types solution for the MCSA. Its association with the cross-combined monochannels is very similar to the four clusters solution. 26 This is called asymmetric lambda, which predicts the rows distribution (R) under the condition that one knows the columns distribution (C). 27 Due to the size of the contingency tables used in the tests, we decided to provide a schematic example of the situation (Figure 4) and to summarize the results by only indicating the value of the lambdas and the contingency coefficients (Table 1).

MULTICHANNEL SEQUENCE ANALYSIS

19

Cross-Combined F1*O1 F1*O2 F1*O3 F2*O1 F2*O2 F2*O3 F3*O1 F3*O2 F3*O3 Monochannel Typologies MCSA Typology MCSA1 MCSA2 MCSA3 MCSA4 X X X (Y) (Y) X X X X X

X= most cases of a column are concentrated in a single cell; Y= important proportion of cases are distributed on more than one cell of the same column.

FIGURE 4. Schematic representation of the association between cross-combination of family (F) and occupational (O) monochannel typologies containing three types each (respectively F 1 , F 2 , F 3 , O 1 , O 2 , O 3 ) and MCSA typology containing four types based on the same data.

solutions. The asymmetric lambda R|C is systematically higher than the asymmetric lambda C|R, indicating that knowing the distribution of combined monochannel solutions allows for better predictions of the multichannel solution distribution. Put another way, MCSA efficiently reduces the complexity of the data while conserving most of the relevant information. More than 80% of the MCSA solution may be predicted by the cross-combined one-dimensional OMA distributions, whereas the reduction in complexity (i.e., the difference between the number of cells in the cross-combined and multichannel solutions, divided by the number of cells in the cross-combined solution) is, respectively, 56%, 73%, and 84%. The asymptotic standard error (ASE) values are much lower than the lambda values. This means here that the 95% confidence interval limits of the lambdas do not contain zero (data not shown), suggesting that these results may be considered statistically significant (SAS technical support, private communication, 2006). 6.2. Interdependence Starting from the results presented in Table 1, we now turn to the extent by which statistical association between individual trajectories unfolding in distinct social spheres influences the quality and reliability of the MCSA features described above. We therefore first crosstabulate the categorical variables corresponding to the one-dimensional

20

GAUTHIER ET AL.

TABLE 1 Association Between Categorical Variables (Asymmetric Lambda) Corresponding to Types of Trajectories Stemming from Either MCSA or Cross-Combined One-Dimensional OMA Cross-Combined One-dimensional OMA Combination 1: 9 clusters Trajectories: Family (3 clusters) ∗ Occupational (3 clusters) Dimension of contingency table 1: 9 ∗ 4 = 36 Combination 2: 15 clusters Trajectories: Family (5 clusters) ∗ Occupational (3 clusters) Dimension of contingency table 2: 15 ∗ 4 = 60 Combination 3: 15 clusters Trajectories: Family (3 clusters) ∗ Occupational (5 clusters) Dimension of contingency table 3: 15 ∗ 4 = 60 Combination 4: 25 clusters Trajectories: Family (5 clusters) ∗ Occupational (5 clusters) Dimension of contingency table 4: 25 ∗ 4 = 100 Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient MCSA (4 clusters) Value 0.4641 0.7975 0.8197 0.3436 0.8237 0.8237 0.3772 0.8006 0.8238 0.2659 0.8463 0.8320 0.0124 0.0110 0.0128 0.0117 0.0128 0.0113 ASE 0.0132 0.0118

ASE = Asymptotic standard error.

typologies of family trajectories with those of occupational trajectories (Table 2). 28 Table 2 shows that family and occupational types of trajectories have strong statistical association. The value of the likelihood ratio chi-square is larger when the number of clusters is greater; the
28 According to our stopping rules, we consider for both trajectories a three- and a five-type typology.

MULTICHANNEL SEQUENCE ANALYSIS

21

TABLE 2 Association Between Categorical Variables (Likelihood Ratio Chi-square) Corresponding to Types of Family and Occupational Trajectories Stemming from One-dimensional OMA Cross-Tabulated Types of Trajectories Family (3 types) ∗ Occupational (3 types) Family (3 types) ∗ Occupational (5 types) Family (5 types) ∗ Occupational (3 types) Family (5 types) ∗ Occupational (5 types) df 4 8 8 16 LR χ 2 27.5998 32.2821 28.1240 35.4739 p Value <.0001 <.0001 0.0005 0.0034

Family (3 types) = three types of family trajectories; Occupational (5 types) = three types of occupational trajectories; LR χ 2 = likelihood ratio chi-square; df = degree of freedom. N = 1847.

significance level stays under the threshold of 0.01 but decreases slightly as the number of types of trajectories increases. From this result we hypothesize that the use of MCSA provides better results when the types of one-dimensional trajectories are statistically associated. Two life spheres are considered interdependent when the types stemming from OMA performed independently on each of the corresponding trajectories are associated. 29 As mentioned earlier, it is the common information implied by interdependence that allows MCSA to reduce the complexity of multidimensional typologies by locally “deducing” a channel’s missing or hidden information. Therefore, in order to test this hypothesis, we focus on other multiple social participations over time—namely, family and education-to-work trajectories. To differentiate education-to-work from occupational trajectories, we limit the period of observation from birth to age 25. 30 One-dimensional OMA performed on these trajectories along with usual stopping rules indicate a two- or five-cluster solution for the first channel (family trajectories), a three- or five-cluster solution for the second one (education-to-work trajectories), and a four-cluster solution for the typology stemming from

Association is measured using the likelihood ratio of chi-square and asymmetric lambda. 30 To support the comparison with the results presented in Table 2, we measure the association between family and occupational trajectories over a 25year period and still find similarly high degrees of association. We conclude that the absence of association between family and education-to-work trajectories is therefore not due primarily to the length of the trajectories.

29

22

GAUTHIER ET AL.

TABLE 3 Association Between Categorical Variables (Likelihood Ratio Chi-square and Asymmetric Lambda) Corresponding to Types of Family and Education-to-Work Trajectories from Either MCSA or Cross-Combined One-Dimensional OMA Cross-Tabulated Types of Trajectories Family (2 types) ∗ Educ-work(3 types) Family (2 types) ∗ Educ-work (5 types) Family (5 types) ∗ Educ-work (3 types) Family (5 types) ∗ Educ-work (5 types) MCSA (4 types) ∗ [Family (2) ∗ Educ-work (3)] MCSA (4 types) ∗ [Family (2) ∗ Educ-work (5)] MCSA (4 types) ∗ [Family (5) ∗ Educ-work (3)] MCSA (4 types) ∗ [Family (5) ∗ Educ-work (5)] Likelihood Ratio Chi-square 2.7324 4.2019 5.8219 8.1966 833.3803 836.3845 1669.0510 1675.4887 p Value 0.2551 0.3794 0.6672 0.9428 <.0001 <.0001 <.0001 <.0001 Lambda R|C 0.0000 0.0000 0.0000 0.0000 0.3257 0.3257 0.4397 0.4397

Family = family trajectories; Educ-work = trajectories of the transition between education and work; MCSA = multichannel sequence analysis of these trajectories. The number of types considered is indicated in parentheses.

MCSA. 31 Cross-tabulations of the categorical variables based on each one-dimensional typology are also created (Table 3). The results from Table 3 show that the categorical variables representing these one-dimensional types of trajectories are not statistically associated with one another, whereas the MCSA based on family and education-to-work sequences is logically and significantly correlated with the cross-combination of these types. Lambda values in this case, however, are much lower, in comparison to the results stemming from the significantly correlated one-dimensional trajectories described above. This means that the percentage reduction in error in predicting the dependent variable given by MCSA in this case is two to four times lower than it is when the lambda values are obtained with higher correlated trajectories. These results confirm to a certain extent that MCSA is more efficient at reducing data complexity when the considered
The stopping rules also suggest a six-cluster solution for the MCSA. Its association with the cross-combined monochannels is quite similar to the fourcluster solution.
31

MULTICHANNEL SEQUENCE ANALYSIS

23

trajectories are interdependent—that is, when they share a certain amount of information.

6.3. Resistance to Noise The third test comparing MCSA and unidimensional OMA concerns the ability of the two approaches to “resist” noise in the data. In other words, we aim at testing the extent to which these procedures are able to identify the same structure in the data when characters in sequences are progressively and randomly replaced by characters that do not belong to the alphabet building the original sequences. From our original data set of family and occupational life trajectories which contain only valid values, we generate 15 alternative data sets for each type of trajectory. Each of these data sets contains a progressively greater proportion of a randomly assigned unknown status compared to the original data set (from 2% to 30%, in increments of 2%). 32 The unknown status is associated with the same unitary substitution cost as the other statuses. The size of the sequences remains the same after the noising process. We then run cluster analyses on each of the distance matrices produced by MCSA and OMA for these data sets and then cross-tabulate the typologies stemming from the original data set with those obtained using the increasingly noisy versions of that same data set. For a given type of trajectory, the number of clusters is held constant and corresponds to the types presented above (cf. Section 4.1.). The degree of association between typologies (lambda coefficient) is computed for each solution and plotted in Figure 5. It compares the ability of MCSA and cross-combined OMA to identify the original data structure from its degraded signal. Figure 5 illustrates that the four-clusters multichannel typology resists noise much better than do the other typologies. The lambda values for the former remain stable at approximately 0.85, which indicates a rather strong association with the original solution. For onedimensional types of trajectories, the lambda values decline rapidly and show greater variation than the four-clusters multichannel solution.

32 The “noising” of the data is a random procedure that is made by SALTT on each individual sequence.

24
1

GAUTHIER ET AL.

0.9

0.8

Lambda R|C

0.7

0.6

0.5

0.4 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% Increasing noise (missing values) 22% 24% 26% 28% 30%

Multi_4 Combo_15b

Multi_7 Combo_25

Combo_9 Fam_5

Combo_15a Prof_5

Multi = multichannel analysis; Combo = cross combination of monochannel analyses; Fam = types of family trajectories as categorical variables; Prof = types of professionnal trajectories as categorical variables; _n= number of clusters retained.

FIGURE 5. Value of asymmetric lambda, by increasing amount of missing values on eight types of trajectories.

As the noising of the data occurs before the clustering procedure, each noise level considered on the abscissa of Figure 5 will produce a specific cluster solution. This may explain the substantial variation from one noise level to another that is visible in Figure 5, and the peak lambda value in Figure 6 for the three-cluster solution. In the latter case, the noisy sequences lead to a splitting of the two-cluster solutions, which is not the case when clustering the original sequences that serve as references for both graphs. To focus specifically on the behavior of MCSA regarding noised data, we compute the values of the asymmetric lambda for the two to 25 clusters solutions of the original multichannel trajectories for three levels of noise in the same data (10%, 20%, and 30%) and plot them in Figure 6. Figure 6 shows that noise resistance is weakened by the increasing number of clusters and by the level of noise in the sequences. It appears, however, that more noise is systematically associated with a weaker lambda for any given cluster solution. We must also address the extent to which the resistance to noise of a given cluster solution suggests the reliability of that solution. For instance, to what extent can we use such a result to select one cluster solution over another?

MULTICHANNEL SEQUENCE ANALYSIS
1

25

0.9

0.8 Lambda C|R Noise = 10% Noise = 20% Noise = 30% 0.7

0.6

0.5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of clusters

FIGURE 6. Asymmetric lambda value between a given multichannel cluster solution and its corresponding noised solutions (10%, 20%, and 30% missing)

The best solution is also the one that is more resistant to internal variations, which suggests a more stable and informative data structure. Comparing the 25-cluster solutions for the typologies based on MCSA (Figure 6) and one-dimensional OMA (Figure 5) shows that the noisy multichannel solution predicts the original solution better than does the combination of one-dimensional OMA, although this difference is small at a noise level of 10%.

6.4. Minimizing the Distortion of Alignments Considering two distinct dimensions of the individual life course, we use the length variation resulting from pairwise alignment as an indicator of distortion. Minimizing this variation is of special interest because each position in a sequence represents a year of life, which corresponds to a specific age. Given the fact that some statuses and some transitions are more common at certain ages than at others, alignments with greater length variation bias the actual relations between age and social statuses. For instance, Figure 8 exemplifies how MCSA contributes to limiting distortions, since the optimal alignment of Channel d results in a length of six; whereas when both channels are aligned simultaneously

26
Position: seq1 aligned: seq2 aligned: 0123456789 A-BBBBA-CA AABBB-AB-A (seq1 original: (seq2 original:

GAUTHIER ET AL.

ABBBBACA) AABBBABA)

FIGURE 7. Measuring the distortion resulting from a pairwise alignment.

(MCSA), the length of the final alignment is five for both dimensions. In this way, MCSA keeps the chronological order of both trajectories as close to the original as possible without using indels, which allows for a better structural conservation of sequences than do systematic substitutions. The distortion due to an alignment is defined as the sum of the products of the number of character(s) shifted, multiplied by the size of the shift (time units), and divided by the total, number of aligned character pairs. This is a standardized measure that may be used to align sequences of different lengths, although the ones used here are of equal lengths. Figure 7 gives an example of distortion measurement resulting from a pairwise alignment. Considering the aligned sequence seq1, the three characters ‘B’ from the original sequence seq1 (positions 2–4) and character ‘C’ (position 8) are shifted by one position (time unit) to the right. In this case, there are six aligned character pairs in the alignment. The value of the distortion resulting from the pairwise alignment of seq1 and seq2 is 0.66 [((3 ∗ 1) + (1 ∗ 1)) / 6 = 4/6 = 0.666]. Our aim is to test whether MCSA provides less distorted alignments than one-dimensional OMA does. Therefore, using SHP data, we compare the age distortion stemming from two separate monochannel alignment procedures for each individual—one for family and the other for occupational trajectories—with age distortion computed using MCSA based on the same trajectories. From three data sets containing 1,847 family, occupational, and multidimensional trajectories, we obtain 1,704,781 possible alignments for each of them. 33 A distortion score is computed for each alignment. To compare the alignments produced by cross-combined one-dimensional OMA and MCSA, we subtract for each individual the largest distortion score stemming from either one-dimensional
33

Number of alignments = N ∗ (N-1)/2 = 1847 ∗ 1846/2 = 1’704’781

MULTICHANNEL SEQUENCE ANALYSIS

27

TABLE 4 Difference in Distortion Between Multichannel (Reference) and Family, Occupational as well as Max(Family, Occupational) Pairwise Alignments Sequences Aligned Multichannel is better (−) No difference (0) Multichannel is worse (+) Total Family 28% 44% 28% 100% N = 1’704’781 Occupational 26% 50% 24% 100% N = 1’704’781 Max(fam., occup.) 48% 46% 6% 100% N = 1’704’781

Max(fam., occup.) = larger distortion resulting from the alignment of a pair of either family or occupational trajectories for the same individual.

alignments to that produced by MCSA. 34 A resulting negative value indicates that the distortion stemming from MCSA is smaller than that resulting from one-dimensional alignments. Table 4 shows the distortion differences between MCSA (reference) and one-dimensional OMA based on family and occupational trajectories. For each individual, we also consider the larger distortion produced by either alignment. Table 4 shows that in the majority of cases, MCSA provides less or equally distorted alignments than does regular one-dimensional OMA. Since MCSA represents a combination of two alignments, it would be understandable if MCSA produces more distorted alignments than one-dimensional OMA. Actually, MCSA clearly produces better results than the one-dimensional OMA performed on the channel associated with the most distorted alignments. MCSA generates less distorted alignments in approximately 50% of cases. Distortion from MCSA is greater than that from one-dimensional OMA in only 6% of alignments. In other words, MCSA’s distortions are almost always smaller than, or equal to, those of two one-dimensional OMA applied separately. By reducing sequences’ distortion in the alignment process, MCSA offers a better conservation of structural and temporal patterns (Lesnard and Saint Pol 2004). Table 5 presents the paired t-test values for these comparisons and shows that MCSA significantly reduces the structural and
Comparison of individual distortion scores equals distortion score measured on MCSA (the largest distortion score measured on the alignment of either family or occupational trajectories).
34

28

GAUTHIER ET AL.

TABLE 5 Paired t-Test on Distortion’s Value Resulting from Either MCSA or One-dimensional OMA Mean Standard Deviation t-Test Value Family—MCSA 0.2911 Occupational—MCSA 0.1659 Max (fam., occup.)—MCSA 1.0948
N = 1’704’781.

p <.0001 <.0001 <.0001

2.4628 1.7331 2.4066

154.33 124.98 594.00

temporal distortion of aligned sequences (p < 0.0001). This reduction is greatest when comparing MCSA to regular OMA performed on sequences that produce the greatest distortions. Despite the large number of cases, which improves performance in significance tests, the relatively high standard deviations indicate substantial variability in the data. Moreover, non-paired t-tests on the same data (not shown in Table 5) indicate that MCSA produces significantly smaller standard deviations than does one-dimensional OMA (p < 0.0001).

7. FURTHER VALIDATION ON RANDOM DATA Having already shown the favorable properties of MCSA compared to regular OMA performed on existing social science data, this section aims at assessing the extent to which MCSA also provides qualitatively similar results when used on random data. To compare various multidimensional approaches using OMA, we use two simulated data sets (N = 2001 pairs of sequences), each corresponding to a specific channel. In this simulated data, the alphabet and length of sequences are kept constant. In each data set, the first sequence has a length of five characters and the second a length of four. The alphabets of the first and the second channel contain three and four characters, respectively (cf. Figure 8). 35 We first use the simulated data to evaluate whether different approaches to multidimensional sequence analysis produce similar results. We compare four ways of computing multidimensionality: ex-post sum of the distance matrices produced by two independent OMA, MCSA,
35 To create the sequences, we use the Perl’s function rand(), which produces uniformly distributed pseudo-random numbers (Wall et al. 2000).

MULTICHANNEL SEQUENCE ANALYSIS

29

and two ex-ante recoding of both channels into one unique channel (called “extended 1 and 2” in Figure 8). For aligning pairs of sequences (nested or separately), we use unitary substitution costs matrices. 36 For the extended alphabet, we consider the fact that, when comparing two characters of the recoded sequences, the cost may be one unit if both recoded characters have a character in common. The cost is set to two units if they have no characters in common. 37 The cost is zero when both pairs of characters are identical for two given individuals. In the first case, the value of indel is set to the average off-diagonal value (AOD) of the substitution cost matrix, while in the second case, indel is set to half of this value 38 (see Figure 8). We compare these different approaches using the degree of similarity between all pairs of sequences, which is given by either the raw score of the alignment or the PID. Using the simulated data sets and the cost schemes described above, we compute linear coefficient correlations among alignment scores stemming from different ways of assessing the distance between multidimensional sequences, as shown in Table 6, where the distances produced using either extended alphabets, ex-post sum of monochannel distances, or MCSA are strongly associated, although not identical. According to the two latter methods, the use of either percent identity or raw score produces the same correlations with the other measures of multidimensional distances. Since it otherwise brings the most differentiated correlations, we retain PID to estimate the distance between sequences (May 2004). The five measures of multidimensional distances between individual trajectories are all based on some linear function of one-dimensional OMA distances. They differ essentially in the timing of the contribution of each channel—that is, either before, during, or after the alignment process. As one can read from Table 6, results produced by MCSA appear here as a representative denominator to the other measures. This means that they are at the same time as variable

This means that substituting any character with another one has a cost of 1, whereas substituting a character with itself has a cost of 0. 37 For instance, if “recoded f ” stands for “m” and “z” at the same position in channels 1 and 2, “recoded j” stands for “m” and “t,” and recoded “g” for “n” and “z,” the cost of substituting “f ” and “j” is two-fold lower than the cost of substituting “j” and “g.” 38 At this point, we did not consider the differentiation between gap opening penalty (GOP) and gap extension penalty (GEP; Thompson et al. 1994), or between internal and external gaps.

36

30

GAUTHIER ET AL.

FIGURE 8. Comparing MCSA to summing two distance matrices or using extended alphabet (random data sets).

as the five others but less sensitive to the computation options, which is a first indication toward its robustness.

8. DISCUSSION This paper explores two key points regarding the methodological potential of multichannel sequence analysis (MCSA). First, MCSA offers an overall advantage over conventional OMA, since it allows for the simultaneous analysis of multiple social trajectories without prior recoding of the data. MCSA produces an extended alphabet that corresponds to the combination of two or more alphabets defining different types of sequences used in the analysis. The main advantage of MCSA over other extended alphabet methods (Dijkstra and Taris 1995; Stovel et al. 1996) is that it avoids defining, coding, and weighting all combinations prior to the analysis, and it therefore allows for the use of weighting strategies specific to each dimension (family, occupation, and so forth) considered separately, such as the data-based training procedure

TABLE 6 Similarity Matrix (Linear Coefficient Correlation) Between Six Measures of Two-dimensional Pairwise Distances Using Random Data Sets (N = 2001) Sum_PID_OMA 1.000 0.773 0.748 0.517 0.540 1.000 0.826 0.704 0.577 Score_ext1 Score_ext2 PID_ext1 PID_ext2

MULTICHANNEL SEQUENCE ANALYSIS

PID_MCSA

PID_MCSA Sum_PID_OMA Score_ext1 Score_ext2 PID_ext1 PID_ext2 1.000 0.798 0.898

1.000 0.803 0.898 0.960 0.754 0.739

1.000 0.736

1.000

31

32

GAUTHIER ET AL.

(Gauthier et al. 2009). Keeping the specific codification of each trajectory distinct allows for better interpretation of MCSA’s typologies, since nested trajectories are represented as parallel processes associated with substitution matrices that are in themselves informative. Applied to social science data from the Swiss household panel, the illustrative application of MCSA shows that it produces more convincing results than does independent OMA. Moreover, it provides semantically and graphically straightforward patterns of the ways multidimensional social participations unfold over time, a feature that represents one of the central developing fields of sequence analysis. Second, our results on the same data show that MCSA performs best when the dimensions under study are interdependent. Comparing the analysis of correlated versus uncorrelated monochannels, we find that MCSA leads to a greater reduction of complexity when the trajectories are statistically associated and when the number of clusters is relatively small. This outcome provides a first indication toward MCSA’s range of applications: It is precisely when nested trajectories are interdependent—that is, when they share information—that MCSA is required. Additionally, the results show the ability of MCSA to simplify the outputs obtained from regular OMA. This simplification is achieved by dramatically reducing the number of categories involved, while retaining a high proportion of the original information. This means that considering interdependence increases the complexity of nested trajectories under study, but at the same time it reduces the number of relevant combinations, compared to cross-combining results from independent one-dimensional OMA. We also test the resistance of MCSA to noisy data. It appears that MCSA is less sensitive to increasing noise in the data than are combinations of regular one-dimensional OMA. MCSA uses the interdependence that exists between nested trajectories as an additional source of information to identify relevant multidimensional patterns, even when some of the data are missing. This paper also examines the issue of sequence distortion produced by the use of insertions and deletions that change sequence length. Carrying out an alignment modifies the correspondence between actual age and the position in the sequence prior to alignment. In measuring the distortion resulting from MCSA and conventional OMA, we find that MCSA was nearly always superior or equivalent to conventional OMA in minimizing this distortion; that is, MCSA performed better

MULTICHANNEL SEQUENCE ANALYSIS

33

by keeping the length of aligned sequences as close as possible to that of the original sequences. We demonstrated the ability of MCSA to produce less distorted alignments and take the timing of episodes more accurately into account than combinations of conventional OMA. This feature is particularly important when considering not only relative duration but also dimensions such as social age, which take into account the fact that some social statuses or transitions are more common at certain points in life than at others. In other words, MCSA produces alignments that optimize the relationship between age and social statuses over time. Finally, using random data, we demonstrate that distances produced by MCSA differ from those produced by either summing pairwise distances in independent OMA or by recoding the data prior to analysis. Furthermore, MCSA yields the strongest correlations with the results of alternative measures of multidimensional distances. It therefore appears to be the most representative technique among those that we examined. Given the number of dimensions that may play a role in the variability of results obtained through either method, this paper provides initial guidance on the potential advantages of MCSA. Further developments of the method, along with in-depth testing, are needed to continue improving MCSA. Our main expectation regarding MCSA is to significantly reduce the “signal differences” between channels when channels are related. It is precisely the correlation between channels that allows the alignment procedure to benefit from the information contained in one sequence but not another, and ultimately to produce multichannel alignments that reduce the complexity, distortion, and loss of signal due to such noise or to missing values. This informational asymmetry between channels may vary over time (i.e., it is positionspecific), and it may depend on specific stages of the life course or on specific features of social age. In some stages, for instance, occupational status is poorly or not at all informative (e.g., during school years). In such cases, information from the other channel(s) should be given preference. In other words, if one channel is more informative than another at a given point in the sequence, we should rely more heavily on the more informative channel to compute the multichannel alignment. In the same way, if there are missing values on one channel, we should “let the other channel talk” by giving it more weight. Future developments should implement heuristic procedures to systematize methods for dealing with such information asymmetries between channels. In

34

GAUTHIER ET AL.

this paper, we have set the combination of substitution costs at one point to the average value of the two substitution costs involved at that point. An alternative would be to follow some weighting scheme based in theory (e.g., costs set to the highest or the lowest value), or to rely on empirically determined costs of substitution. Overall, MCSA presents two main advantages over onedimensional OMA. It allows both for the discovery of regularities within multidimensional trajectories and for the reduction of the effects of noise, whether due to missing data, poorly recorded information, or heterogeneous information content. APPENDIX The computations presented in this paper are encapsulated in the program SALTT (Search Algorithm for Life Trajectories and Transitions), an open source freeware written in C (Notredame, Bucher, Gauthier, and Widmer 2005). It can be compiled and installed on any UNIX-like platform including Linux, Cygwin, and MacOSX. The package and its documentation can be downloaded from: http://www.tcoffee.org/saltt/. REFERENCES
Abbott, A. 1992. “From Causes to Events. Notes on Narrative Positivism.” Sociological Methods and Research 20 (4):428–55. ———. 2001. Time Matters: On Theory and Method. Chicago, IL: University of Chicago Press. Abbott, A., and A. Hrycak. 1990. “Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.” American Journal of Sociology 96 (1):144–85. Abbott, A., and A. Tsay. 2000. “Sequence Analysis and Optimal Matching Methods in Sociology.” Sociological Methods and Research 29 (1):3–33. Aisenbrey, S., and A. E. Fasang. 2010. “New Life for Old Ideas: The ‘Second Wave’ of Sequence Analysis Bringing the ‘Course’ Back Into the Life Course.” Sociological Methods and Research, 38 (3):420–62. Blair-Loy, M. 1999. “Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.” American Journal of Sociology 104 (5):1346–97. Blossfeld, H.-P., and G. Rohwer. 1995. Techniques of Event History Modeling. Mahwah, NJ: Lawrence Erlbaum. Butts, C., and J. Pixley. 2004. “A Structural Approach to the Representation of Life History Data.” Journal of Mathematical Sociology 28 (2):81–124.

MULTICHANNEL SEQUENCE ANALYSIS

35

Clausen, J. A. 1986. The Life Course: A Sociological Perspective. Toronto: PrenticeHall. Claverie, J.-M., and C. Notredame. 2003. Bioinformatics for Dummies. New York: Wiley. Confais, J., Y. Grelet, and M. Le Guen. 2005. “La Proc´ dure FREQ de SAS. Tests e d’independance et mesures d’association dans un tableau de contingence.” La Revue Modulad 33:188–224. Dijkstra, W., and T. Taris. 1995. “Measuring the Agreement Between Sequences.” Sociological Methods and Research 24 (2):214–31. Doolitle, R. F. 1981. “Similar Amino Acid Sequences: Chance or Common Ancestry.” Science 214 (4517):149–59. Duda, R. O., and P. E. Hart. 1973. Pattern Classification and Scene Analysis. New York: Wiley. Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 2002. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, England: Cambridge University Press. Elder, G. H., ed. 1985. Life Course Dynamics: Trajectories and Transitions, 1968– 1980. Ithaca, NY: Cornell University Press. Elder, G. H., M. Kirkpatrick Johnson, and R. Crosnoe. 2003. “The Emergence and Development of Life Course Theory.” Pp. 3–19 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer. Elzinga, C. H. 2003. “Sequence Similarity: A Non-Aligning Technique.” Sociological Methods and Research 31(4):3–29. Esser, H. 1996. “What Is Wrong with ‘Variable Sociology’?” European Sociological Review 12 (2):159–66. ¨ Gabadinho, A., G. Ritschard, M. Studer, and N. S. Muller. 2008. Mining Sequence Data in R with the TraMineR Package: A User’s Guide. University of Geneva. Retrieved January 21, 2010. (http://mephisto.unige.ch/traminer). Gauthier J.-A., E. D. Widmer, P. Bucher, and C. Notredame 2009. “How Much Does It Cost? Optimization of Costs in Sequence Analysis of Social Science Data.” Sociological Methods and Research 38 (1):197–231. George, L. K. 1993. “Sociological Perspectives on Life Transitions.” Annual Review of Sociology 19:353–73. Giele, J. Z., and G. H. Elder Jr., eds. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Goodman, L. A., and W. H. Kruskal. 1979. Measures of Association for Cross Classification. New York: Springer-Verlag. Groves, R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. 2004. Survey Methodology. Wiley Series in Survey Methodology. New York: Wiley. Han, S.-K., and P. Moen. 1999. “Clocking Out: Temporal Patterning of Retirement.” American Journal of Sociology 105 (1):191–236. Heinz, W. R. 2003. “From Work Trajectories to Negotiated Careers.” Pp. 185–204 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer. Heinz, W. R., and V. W. Marshall, eds. 2003. Social Dynamics of the Life Course: Transitions, Insitutions, and Interrelations. New York: Aldine de Gruyter.

36

GAUTHIER ET AL.

¨ Hopflinger, F. C., and A. Debrunner. 1991. Familenleben und Berufsarbeit. Zurich, Switzerland: Seismo. Kohli, M. 1986. “The World We Forgot: A Historical Review of the Life Course.” Pp. 271–303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. Kruskal, J. 1983. “An Overview of Sequence Comparison.” Pp. 1–44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Don Mills, Ontario: Addison-Wesley. Lesesne, C. A., and C. Kennedy. 2005. “Starting Early: Promoting the Mental Health of Women and Girls Throughout the Life Span.” Journal of Women’s Health 14 (9):754–63. Lesnard, L., and T. de Saint Pol. 2006. “Introduction aux m´ thodes d’appariement e optimal (Optimal Matching Analysis).” Bulletin of Sociological Methodology 90:5–25. Levine, J. H. 2000. “But What Have You Done for Us Lately?” Sociological Methods and Research 29 (1):34–40. Levy, R. 1991. “Status Passages as Critical Life Course Transition: A Theoretical Sketch.” Pp 87–114 in Theoretical Advances on Life Course Research, edited by W. R. Heinz. Weinheim, Germany: Deutscher Studien Verlag. ———. 1996. “Toward a Theory of Life Course Institutionalization.” Pp. 83–108 in Society and Biography, edited by A. Weymann and W. R. Heinz. Weinheim, Germany: Deutscher Studien Verlag. Levy, R., J. A. Gauthier, E. D. Widmer. 2006. “Entre contraintes institutionnelle et domestique: les parcours de vie masculins et f´ minins en Suisse.” Canadian e Journal of Sociology 31 (4):461–89. Macmillan, R., ed. 2005. The Structure of the Life Course: Standardized? Individualized? Differentiated?, Vol. 9. Amsterdam: JAI Press. Macmillan, R., and S. R. Eliason. 2003. “Characterizing the Life Course as Role Configurations and Pathways. A Latent Structure Approach.” Pp. 529–54 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer Academic. May, A. C. W. 2004. “Percent Sequence Identity: The Need to Be Explicit.” Structure 12:737–38. Milligan, G. W., and M. C. Cooper. 1985. “An Examination of Procedures for Determining the Number of Clusters in a Data set.” Psychometrika 50 (2):159– 79. ———. 1987. “Methodology Review: Clustering Methods.” Applied Psychological Measurement 11 (4):329–54. Moen, P. 1985. “Continuities and Discontinuities in Women’s Labor Force Activity.” Pp. 113–55 in Life Course Dynamics: Trajectories and Transitions, 1968– 1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Mojena, R. 1977. “Hierarchical Grouping Methods and Stopping Rules: An Evaluation.” The Computer Journal 20:359–63. Mortimer, J. T., and M. J. Shanahan, eds. 2003. Handbook of the Life Course. New York: Kluwer Academic.

MULTICHANNEL SEQUENCE ANALYSIS

37

¨ Muller, N. S., A. Gabadinho, G. Ritschard, and M. Studer. 2008. “Extracting Knowledge from Life Courses: Clustering and Visualization.” Pp. 176–85 in Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery. Turin, Italy: Springer-Verlag. Nargundkar, S., and T. J. Olzer. 1998. “An Application of Cluster Analysis in the Financial Services Industry.” Presented at the Sixth annual meeting of the South East SAS Users Group (SESUG), Norfolk, Virginia. National Centre for Biotechnology Information (NCBI) 2004. Glossary. Retrived October 15, 2004 (http://www.ncbi.nlm.nih.gov/ Education/BLASTinfo/glossary2.html). Needleman, S. B., and C. D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48:443–53. Notredame, C., P. Bucher, J.-A. Gauthier, and E. Widmer. 2005. TCoffee/SALTT: User Guide and Reference Manual. Retrieved October 15, 2005. (http://www.tcoffee.org/saltt) Olzak, M., and G. Ritschard. 1995. “The Behaviour of Nominal and Ordinal Partial Association Measures.” The Statistician 44 (2):195–212. Piccarreta, R., and O. Lior. 2010. “Exploring Sequences: A Graphical Tool Based on Multi-Dimensional Scaling.” Journal of the Royal Statistical Society, Series A: Statistics in Society 173 (1):165–84. Pollock, G. 2007. “Holistic Trajectories: A Study of Combined Employment, Housing, and Family Careers Using Multiple Sequence Analysis.” Journal of the Royal Statistical Society, Series A: Statistics in Society 170:167–83. Raghava, G. P. S., and G. Barton. 2006. “Quantification of the Variation in Percentage Identity for Protein Sequence Alignments.” BMC Bioinformatics 7 (1):415. Repetti, R. L., S. E. Taylor, and T. E. Seeman. 2002. “Risky Families: Family Social Environments and the Mental and Physical Health of Offspring.” Psychological Bulletin 128(2):330–66. Sankoff, D., and J. Kruskal. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Don Mills, Ontario: AddisonWesley. R SAS Institute. 2004. SAS/STAT 9.1 User’s Guide. Cary, NC: SAS Institute Inc. Sheridan, J. T. 1997. The Effects of the Determinants of Women’s Movement into and out of Male-Dominated Occupations on Occupational Sex Segregation. Madison: Department of Sociology, Center for Demography and Ecology, University of Wisconsin. Siegel, S., and N. J. Castellan. 1988. Nonparametric Statistics for the Behavioural Sciences, 2nd ed. New York: McGraw-Hill. Spruijt, E., and M. de Goede. 1997. “Transitions in Family Structure and Adolescent Well-Being.” Adolescence 32(128):897–911. Stovel, K., M. Savage, and P. Bearman. 1996. “Ascription into Achievement: Models of Career Systems at Lloyds Bank, 1890–1970.” American Journal of Sociology 102(2):358–99.

38

GAUTHIER ET AL.

Thompson, J., D. G. Higgins, and T. Gibson. 1994. “CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.” Nucleic Acids Research 22:4673–80. Tillmann, R., and E. Zimermann. 2004. “Introduction: The Swiss Household Panel and the Nature of This Book.” Pp. 1–25 in Vivre en Suisse 1999–2000 [Living in Switzerland 1999–2000], edited by R. Tillmann and E. Zimermann. Bern, Switzerland: Peter Lang. Tufte, E. R. 1997. Visual Explanation, Images and Quantities, Evidence and Narrative. Cheshire, CO: Graphic Press. Wall, L., T. Christiansen, and J. Orwant. 2000. Programming Perl, 3rd ed. Sebastopol, CA: O’Reilly. Ward, J. H. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58(301):236–44. Wasserman, S., and K. Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge, England: Cambridge University Press. Wetzler, H. P., and R. J. Ursano. 1988. “A Positive Association Between Physical Health Practices and Psychological Well-Being.” The Journal of Nervous and Mental Disease 176 (5):280–83. Widmer, E., R. Levy, A. Pollien, R. Hammer, and J.-A. Gauthier. 2003. “Une analyse exploratoire des insertions professionnelles et familiales: Trajectoires de couples r´ sidant en Suisse.” Revue suisse de Sociologie 29(1):35–67. e Wu, L. L. 2000. “Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect’.” Sociological Methods and Research 29(1):41–64.

BIOINFORMATICS

Vol. 19 no. 1 2003, pages i1–i7 DOI: 10.1093/bioinformatics/btg1029

APDB: a novel measure for benchmarking sequence alignment methods without reference alignments
Orla O’Sullivan 1, Mark Zehnder 3, Des Higgins 1, Philipp Bucher 3, ´ Aurelien Grosdidier 3 and Cedric Notredame 2, 3,∗
1 Department

of Biochemistry, University College, Cork, Ireland, 2 Information ´ ´ Genetique et Structurale, CNRS UMR-1889, 31, Chemin Joseph Aiguier, 13402 Marseille, France and 3 Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges, Switzerland

Received on January 6, 2000; revised on Month xx, 2000; accepted on February 20, 2000

Author please check use of A and B heads is correct

ABSTRACT Motivation: We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efﬁciently and objectively benchmark multiple sequence alignment methods. Results: Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages. Availability: APDB is implemented in C, its source code and its documentation are available for free on request from the authors. Contact: cedric.notredame@gmail.com

INTRODUCTION We introduce APDB (Analyze alignments with PDB), a new method for benchmarking and improving multiple sequence alignment packages with minimal human intervention. We show how it is possible to avoid the use of reference alignments when PDB structures are available for at least two homologous sequences in a test alignment. Using this method it should become possible to systematically benchmark or train multiple sequence alignment methods using all known structures, in a completely automatic manner. There are strong justiﬁcations for improving multiple sequence alignment methods. Many sequence analysis
∗ To whom correspondence should be addressed.

techniques used in bioinformatics require the assembly of a multiple sequence alignment at some point. These include phylogenetic tree reconstruction, detection of remote homologues through the use of proﬁles or HMMs, secondary and tertiary structure prediction and more recently the identiﬁcation of the nsSNPs (non synonymous Single Nucleotide Polymorphisms) that are most likely to alter a protein function. All of these important applications demonstrate the need to improve existing multiple sequence alignment methods and to establish their true limits and potential. Doing so is complicated, however, because most multiple sequence alignment methods rely on a complicated combination of greedy heuristic algorithms meant to optimize an objective function. This objective function is an attempt to quantify the biological quality of an alignment. Almost every multiple alignment package uses a different empirical objective function of unknown biological relevance. In practice, most of these algorithms are known to perform well on some protein families and less well on others, but it is difﬁcult to predict this in advance. It can also be very hard to establish the biological relevance of a multiple alignment of poorly characterized protein families. See Duret and Abdeddaim (2000) and Notredame (2002) for two recent reviews of the wide variety of techniques that have been used to make multiple alignments. Given such a wide variety of methods and such poor theoretical justiﬁcation for most of them, the main option for a rational comparison is systematic benchmarking. This is usually accomplished by comparing the alignments produced by various methods with ‘reference’ alignments of the same sequences assembled by specialists with the help of structural information. Barton and Sternberg (1987) made an early systematic attempt to validate a multiple sequence alignment method using structure based alignments of globins and immunoglobulins. Later on,
1

Bioinformatics 19(1) c Oxford University Press 2003; all rights reserved.

O.O’Sullivan et al.

Notredame and Higgins (1996) used another collection of such alignments assembled by Pascarella and Argos (1992). More recently, it has become common practice to use BAliBASE (Thompson et al., 1999); a collection of multiple sequence alignments assembled by specialists and designed to systematically address the different types of problems that alignment programs encounter, such as the alignment of a distant homologue or long insertions and deletions. In this work, we examined two such reference collections: BaliBase and Homstrad (Mizuguchi et al., 1998), a collection of high quality multiple structural alignments. There are two simple ways to use a reference alignment for the purpose of benchmarking Karplus and Hu (2001). One may count the number of pairs of aligned residues in the target alignment that also occur in the reference alignment and divide this number by the total number of pairs of residues in the reference. This is the Sum of Pairs Score (SPS). The main drawback is that it is not very discriminating and tends to even out differences between methods. The more popular alternative is the Column Score (CS) where one measures the percentage of columns in the target alignment that also occur in the reference alignment. This is widely used and is considered to be a stringent measure of alignment performance. In order to avoid the problem of unalignable sections of protein sequences (i.e. segments that cannot be superimposed), it is common practice to annotate the most reliable regions of a multiple structural alignment and to only consider these core regions for the evaluation. In BaliBase, the core regions make up slightly less than 50% of the total number of alignment columns. Such use of multiple sequence alignment collections for benchmarking is very convenient because of its simplicity. However, a major problem is the heavy reliance on the correctness of the reference alignment. This is serious because, by nature, these reference alignments are at least partially arbitrary. Although structural information can be handled more objectively than sequence information, the assembly of a multiple structural alignment remains a very complex problem for which no exact solution is known. As a consequence, any reference multiple alignment based on structure will necessarily reﬂect some bias from the methods and the specialist who made the assembly. The second drawback is that given a set of structures there can be more than one correct alignment. This plurality results from the fact that a structural superposition does not necessarily translate unambiguously into one sequence alignment. For instance, if we consider that the residues to be aligned correspond to the residues whose alpha carbons are the closest in the 3-D superposition, it is easy to imagine that sometimes an alpha carbon can be equally close to the alpha carbons of two potential homologous residues. Most structure based sequence
i2

alignment procedures break this tie in an arbitrary fashion, leading to a reference alignment that represents only one possible arrangement of aligned residues. This problem becomes most serious when the sequences one is considering are distantly related (less than 30% identity). Unfortunately, this is also the most interesting level of similarity where most sequence alignment methods make errors and where it is important to accurately benchmark existing algorithms. The APDB method that we describe in this work has been designed to speciﬁcally address this problem and remove, almost entirely, the need for arbitrary decisions when using structures to evaluate the quality of a multiple sequence alignment. In APDB, a target alignment is not evaluated against a reference alignment. Rather, we measure the quality of the structural superposition induced by the target alignment given any structures available for the sequences it contains. By treating the alignment as the result of some sort of structure superposition, we simply measure the fraction of aligned residues whose structural neighborhoods are similar. This makes it possible to avoid the most expensive and controversial element of the MSA benchmarking methods: the reference multiple sequence alignment. APDB requires just three parameters. This is tiny if we compare it with any reference alignment where each pair of aligned residue can arguably be considered as a free parameter. In this work we show how the APDB measure was designed and characterized on a few carefully selected pairs of structures. Among other things we explored its sensitivity to parameter settings and various sequence and structure properties, such as similarity, length, or alignment quality. Finally, APDB was used to benchmark known methods using two popular data sets: BaliBase and Homstrad. These were either used as standard reference alignments or as collections of structures suitable for APDB. It should be noted that there are several methods for evaluating the quality of structure models and predictions using known structures. The development of these has been driven by the need to evaluate entries in the CASP protein structure prediction competition and have been reviewed by Cristobal et al. (2001). These all depend on generating structure superpositions between the model and the target and evaluating the quality of the match using, for example, RMSD between the two or using some measure of the number of alpha carbons that superimpose well (e.g. MaxSub by Siew et al. (2000)). In principle, this could also be used to benchmark alignment methods. However, one serious disadvantage is the requirement for a superposition, which is itself a difﬁcult problem. A second disadvantage is the way RMSD measures behave with different degrees of sequence divergence and their sensitivity to local or global alignment differences. We

APDB

have carefully designed APDB so that on the one hand it remains very simple but on the other hand it is able to measure the similarity of the structural environments in a manner that lends itself to measuring alignment quality.

correct X :Y (Z : W) is equal to 1 if d(X, Z ) and d(Y,W) are sufﬁciently similar as set by T 1. aligned(X : Y) is equal to 1 if most pairs Z : W in the X : Y bubble are correct as set by T 2. considered X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad correct X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad and |d(X, Z ) − d(Y, W )| < T 1 (1) (2)

Author please resupply ﬁg as not reproducing correctly, thanks

SYSTEM AND METHODS The APDB scoring function APDB is a measure designed to evaluate how consistent an alignment is with the structure superposition this alignment implies. Let us imagine that A and B are two homologous structures. If the structure of sequence A tells ˚ us that the residues X and Z are 9A apart, then we expect to ﬁnd a similar distance between the two residues Y and W of sequence B that are aligned with X and Z. The difference between these two distances is an indicator of the alignment quality.
________9Å_____ _ _ A aaaaaaaaaaaXaaaaaaaaaaaaaaaZaaaaaaa B bbbbbbbbbbbYbbbbbbbbbbbbbbbWbbbbbbb _________________ 9 Å?

aligned(X : Y) = 1 Z :W Corr ect X :Y (Z : W ) × 100 > T 2 (3) if Z :W Consider ed X :Y (Z : W ) Finally, the APDB measure for the entire alignment is deﬁned as: APDB Score =
X :Y

Aligned(X : Y ) N

(4)

In APDB we take this idea further by measuring the differences of distances between X:Y (X aligned with Y) and Z:W within a bubble of ﬁxed radius centered around X and Y. The bubble makes APDB a local measure, less sensitive than a classic RMSD measure to the existence of non-superposable parts in the structures being considered. Furthermore it ensures that a bad portion of the alignment does not dramatically affect the overall alignment evaluation. The typical radius of this bubble is ˚ 10A, and it contains 20 to 40 amino acids. We consider two residues to be properly aligned if the distances from these two residues to the majority of their neighbors within the bubble are consistent between the two structures. In other words, we check whether a structural neighborhood is supportive of the alignment of the two residues that sit at its center. This can be formalized as follows: X : Y and is a pair of aligned residues in the alignment N Number of aligned pairs of residues d(X, Z) is the distance between the Cα of the two residues X and Z within one structure. Brad is the radius of the bubble set around residues X ˚ and Y (Brad = 10 A by default). T1 is the maximum difference of distance between ˚ d(X, Z ) and d(Y, W ) (T 1 = 1 A by default). T2 is the minimal percentage of residues that must respect the criterion set by T 1 for X and Y to be considered correctly aligned (70% by default). considered X :Y (Z : W) is equal to 1 if the pair Z : W is in the bubble deﬁned by pair X : Y

Given a multiple alignment of sequences with known structures, the APDB score can easily be turned into a sum of pairs score by summing the APDB score of each pair of structures and dividing it by the total number of sequence pairs considered.

Design of a benchmark system for apdb In order to study the behavior of APDB, we used two established collections of reference alignments: BAliBASE (Thompson et al., 1999) and HOMSTRAD (Mizuguchi et al., 1998). First we extracted 9 structure based pair-wise sequence alignments from HOMSTRAD, which we refer to as HOM 9. These reference alignments were chosen so that their sequence identities (as measured on the HOMSTRAD reference alignments) evenly cover the range 17 to 90%. These alignments are between 200 and 300 residues long and are used for detailed analysis and parameterization of APDB. The PDB names of the pairs of structures are given in the ﬁgure legend for Figure 2. Next, in order to assemble a discriminating test set, we selected the most difﬁcult alignments from HOMSTRAD. We chose alignments which had at least 4 sequences and where the average percent identity was 25% or less. This resulted in a selection of 43 alignments, which we refer to as HOM 43. BAliBASE version 1 has 141 alignments divided into 5 reference groups. We chose all alignments where 2 or more of the sequences had a known structure. This resulted in a subset of 91 alignments from the ﬁrst 4 reference groups of BAliBASE which we refer to as BALI 91. Minor adjustments had to be made to ensure consistency between BAliBASE sequences and the corresponding PDB ﬁles. HOM 43 and BALI 91 test sets are available in the APDB distribution.
i3

O.O’Sullivan et al.

A second method for generating sub-optimal alignments was based on the PROSUP package (Lackner et al., 2000). PROSUP takes two structures, makes a rigid body superposition and generates all the sequence alignments that are consistent with this superposition, thus producing alternative alignments that are equivalent from a structural point of view. Typically PROSUP yields 5 to 25 alternative alignments within a very narrow range of RMSDs.

Comparison of apdb with other standard measures In order to compare the APDB measure with more conventional measures, we used the Column Score (CS) measure as provided by the aln compare package (Notredame et al., 2000). CS measures the percentage of columns in a test alignment that also occur in the reference alignment. In BAliBASE this measure is restricted to those columns annotated as core region in the reference. Although alternative measures have recently been introduced (Karplus and Hu, 2001), CS has the advantage of being one of the most widely used and the simplest method available today.

Fig. 1. Tuning of Brad, the bubble radius using sub-optimal alignments of two sequences from HOM 9 Each graph represents the correlation between CS and APDB for 4 different Bubble Radius ˚ values (Brad of 6, 8, 10 and 12A). In each graph, each dot represents a sub-optimal alignment from HOM 9, sampled from the genetic algorithm.

Generation of multiple alignments We compared the performance of APDB on two different multiple alignment methods. We tested the widely used ClustalW version 1.80 (Thompson et al., 1994). We also tested the more recent T-Coffee version 1.37 (Notredame et al., 2000) using default parameters. Generation of suboptimal alignments In order to evaluate the sensitivity of APDB to the quality of an alignment, we used an improved version of the genetic algorithm SAGA (Notredame and Higgins, 1996) in order to generate populations of sub-optimal alignments. In each case a pair of sequences was chosen in HOM 9 and 50 random alignments were generated and allowed to evolve within SAGA so that their quality gradually improved (as measured by their similarity with the HOMSTRAD reference alignment). Ten alignments were sampled at each generation in order to build a collection of alternative alignments with varying degrees of quality. This algorithm was stopped when optimality was reached, thus typically yielding collections of a few hundred alignments.
i4

RESULTS AND DISCUSSION Fine tuning of apdb Three parameters control the behaviour of APDB: Brad (the bubble radius), T1 (the difference of distance threshold) and T2 (the fraction of the bubble neighbourhood that must support the alignment of two residues). We exhaustively studied the tuning effect of each of these parameters using HOM 9 and parameterised APDB so that its behaviour is as consistent as possible with the behaviour of CS on HOM 9. In Figure 1 we show the relationship between CS and APDB for 250 sub-optimal alignments generated by genetic algorithm for one of the 9 test cases from HOM 9 over 4 different settings of Brad, the Bubble Radius. While the two scoring schemes are in broad agreement, the correlation improves dramatically as Brad increases. This trend can be summarised using the correlation coefﬁcient measured on each of the graphs similar to those shown in Figure 1. The overall results for all nine HOM 9 test cases are shown in Figure 2. These results clearly show that the ˚ behaviour of APDB is best for values of Brad of 10 A or above. With these values the level of correlation between CS and APDB increases and so does the agreement across ˚ all 9 test cases. We chose 10 A as the default value in order to ensure a proper behaviour while retaining as much as possible the local character of the measure. Given the ˚ default value of 10 A for Brad, we examined T1 and T2 in a similar fashion and found the most appropriate values as ˚ 1 A for T1 and 70% for T2.

APDB

Fig. 2. Correlation between the Column Score measure (CS) and APDB on HOM 9 Each HOM 9 test set is labelled according to its average percent sequence identity as measured on the HOMSTRAD reference. The horizontal axis indicates the value of Brad. The vertical axis indicates the correlation coefﬁcient between CS and APDB as measured on a population of sub-optimal alignments similar to the ones in Figure 1. Each dot indicates a correlation coefﬁcient measured on one HOM 9 test set, using the indicated value of Brad. Each HOM 9 test set is an alignment between two sequences whose PDB names are as follows: 17: 2gar versus 1fmt, 18: ljﬂ versus lb74, 33: 1isi versus 11be, 43: 2cev versus 1d3v, 52: 1aq0 versus 1ghs, 63: 2gnk versus 2pii, 71: 1hcz versus 1cfm, 82: 1dvg versus 1qq8, 89: 1k25 versus 1qme.

Sensitivity of apdb to sequence and structure similarity It is important to verify that the behaviour of APDB remains consistent across a wide range of sequence similarity levels. It is especially important to make sure that when two different alignments of the same sequences are evaluated, the best one (as judged by comparison with the HOMSTRAD reference) always gets the best APDB score. In order to check for this, we used the genetic algorithm to generate sub-optimal alignments for each test case in HOM 9. In each case, we gathered a collection of 250 sub-optimal alignments with CS scores of 0–40%, 41–60%, 61–80% and 81–100%. The CS score measures the agreement between an alignments and its reference in HOMSTRAD. We then measured the average APDB score in each of these collections. Each of these measures corresponds to a dot in Figure 3 where vertically aligned series of dots correspond to different measures made on the same HOM 9 test set. Figure 3 clearly shows that regardless of the percent identity within the HOM 9 test set being considered, alignments with better CS scores always correspond to a better APDB score (this results in the lines never crossing one another on Fig. 3). We did a similar analysis using the RMSD as measured on the HOMSTRAD alignment in place of sequence identity. The behaviour was the

same and clearly indicates that APDB gives consistent results regardless of the structural similarity between the structures being considered.

Suitability of apdb for analysing sub-optimal alignments Collections of sub-optimal alignments for each of the nine HOM 9 test sets were generated using SAGA and evaluated for their CS scores and APDB scores. These results were pooled and are displayed on the graph shown on Figure 4. This Figure indicates good agreement between the CS and the APDB score regardless of the level of optimality within the alignment being considered. It suggests that APDB is informative over the complete range of CS values. It also conﬁrms that APDB is not ‘too generous’ with sub-optimal alignments We also checked whether sequence alignments that are structurally equivalent obtain similar APDB scores even if they are different at the sequence level. For this purpose, we used PROSUP (Lackner et al., 2000). Given a pair of structures, PROSUP generates several alignments that are equally good from a structure point of view (similar RMSD), but can be very different at the sequence level (different Column Score). We manually identiﬁed two such test sets in HOMSTRAD and the results are summarized in Table 1. For each of these two test sets, we
i5

O.O’Sullivan et al.

Fig. 3. Estimation of the sensitivity of APDB to sequence identity On this graph, each set of vertically aligned dots corresponds to a single HOM 9 test set. The 9 HOM 9 test sets are arranged according to their average identity (17–89%, see Figure 2). Each dot represents the average APDB score of a population of 250 sub-optimal alignments (generated by genetic algorithm) with a similar CS score (binned in four groups representing CS of <40%, 41–60%, 61–80% and 81–100%) generated for one of the 9 HOM 9 test sets.

Table 1. Evaluating PROSUP suboptimal alignments with APDB

Set 1 2

St1 1e96B 1e96B 1cd8 1cd8

St2 1a17 1a17 1qfpa 1qfpa

ALN aln1 aln2 aln1 aln2

RMSD ˚ 1.45A 1.50˚ A ˚ 2.95A 2.95˚ A

CS 100.0 65.6 100.0 55.1

APDB 80.2 80.7 18.7 17.9

Set indicates the test set index, St1 and St2 indicate the two structures being aligned by PROSUP, ALN indicates the alignment being considered, RMSD shows the RMSD associated with this alignment, CS indicates its CS score, with the CS score of aln1 alignments being set to 100 because they are used as references. APDB indicates the APDB score.

Fig. 4. Correlation between CS and APDB on the complete HOM 9 test set Each dot corresponds to a sub-optimal alignment of one of the HOM 9 test cases, generated by genetic algorithm. For each alignment the graph plots the APDB score against its CS counterpart.

In both test sets, using aln1 as a reference for the CS measure leads to the conclusion that aln2 is mostly incorrect (cf. CS column of Table 1). This is not true since these alignments are structurally equivalent as indicated by their RMSDs. In such a situation, APDB behaves much more appropriately and gives to each couple aln1/aln2 scores that are nicely consistent with their RMSD, thus indicating that APDB can equally well reward two suboptimal alignments when these are equivalent from a structural point of view.

selected in the output of PROSUP two alignments (aln1 and aln2) to which PROSUP assigns similar RMSDs. aln1 is used as a reference and therefore gets a CS score of 100 while the CS score of the second alignment (aln2) is computed by direct comparison with its aln1 counterpart.
i6

Using apdb to benchmark alignment methods Table 2 shows the average CS and APDB scores for the test sets in each of the four Bali 91 categories being considered here and in HOM 43. The highest scores in all cases, for both measures, come from the reference column (the last column). This is desirable providing the reference alignments really are consistent with the

APDB

Table 2. Correlation between APDB and CS on BaliBase and Homstrad

Set

N CS

ClustalW APDB 59.9 26.6 38.5 59.5 60.2

CS

T-Coffee APDB 58.3 47.1 46.9 64.5 61.6

Reference CS APDB 100 100 100 100 100 64.7 55.2 53.2 65.7 72.9

B91 R1 B91 R2 B91 R3 B91 R4 H43

35 23 22 11 43

70.1 32.7 46.4 52.0 35.4

67.7 33.9 48.6 52.5 38.9

Test Set: indicates the test set being considered, either one of the BaliBase 91 references (B91R#) or HOM 43(H43), a subset of HOMSTRAD. N indicates the number of test alignments in this category. ClustalW indicates a set of measures made on alignments generated with ClustalW. T-Coffee indicates similar measures made on T-Coffee generated alignments. Reference indicates measures made on the reference alignments as provided in BaliBase or in Homstrad. CS columns are the Column Score measures while APDB indicates similar measures made using APDB.

local evaluation and the absence of a reference alignment, the only possible effect of non-superposable regions is to decrease the proportion of residues found aligned in a structurally optimal sequence alignment, thus yielding scores lower than 100 in the case of distantly related structures. A key advantage of APDB is its simplicity. It only requires three parameters and a few PDB ﬁles. Most importantly, APDB does not require any arbitrary manual intervention such as the assembly of a reference alignment. In the short term, all the existing collections of reference alignment could easily be integrated and extended with APDB. In the longer term, APDB could also be used to evaluate and compare existing collections of alignments such as proﬁles, when structures are available.

REFERENCES
Barton,G.J. and Sternberg,M.J.E. (1987) A strategy for the rapid multiple alignment of protein sequences: conﬁdence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Cristobal,S., Zemla,A., Fischer,D., Rychlewski,L. and Elofsson,A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5. Duret,L. and Abdeddaim,S. (2000) Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. Bioinformatics, Sequence, Structure and Databanks. Higgins,D. and Taylor,W. (eds), Oxford University Press, Oxford. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a reﬁned tool for protein structure alignment. Protein Eng., 13, 745–752. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Notredame,C. (2002) Recent progress in multiple sequence alignments. Pharmacogenomics, 3, 131–144. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel algorithm for multiple sequence alignment. J. Mol. Biol., 302, 205–217. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Siew,N., Elofsson,A., Rychlewski,L. and Fischer,D. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J., Plewniak,F. and Poch,O. (1999) BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15(1), 87–88.

underlying structures. If we now compare the columns two by two, we ﬁnd that every variation on CS from one column to another agrees with the corresponding variation of APDB. For instance in row 1 (Bali 91 Ref1), when T-Coffee/CS is lower than ClustalW/CS, T-Coffee/APDB is also lower. This observation is true for the whole table, regardless of the pair of results being considered. When considering the 134 alignments one by one, this observation remains true in more than 70 % of the cases.

CONCLUSION This work introduces APDB, a novel method that makes it possible to evaluate the quality of a sequence alignment when two or more tertiary structures of the sequences it contains are available. This method does not require a reference alignment and it does not depend on any complex procedure such as structure superposition or sequence alignment. We show here that APDB sensitivity is comparable with that of CS, a well-established measure that compares a target alignment with a reference alignment. Our results also indicate that APDB can discriminate better than CS between structurally correct sub-optimal sequence alignments and structurally incorrect sequence alignments, even when the structures being considered are distantly related. Apart from the cost associated with their assembly, a serious problem with reference alignments is that they need to be annotated to remove from the evaluation regions that correspond to non-superposable portions of the structures. This is necessary because otherwise these regions (whose alignment cannot be trusted) will bias a CS evaluation toward rewarding the arbitrary alignment conformation displayed in the reference. Table 2 illustrates well the fact that such an annotation is not necessary in APDB. In our measure, thanks to the combination of

i7

BIOINFORMATICS

Vol. 14 no. 5 1998 Pages 407-422

COFFEE: an objective function for multiple sequence alignments
Cédric Notredame 1, Liisa Holm 1 and Desmond G. Higgins 2
1EMBL OutstationćThe European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland Received on January 19, 1998; revised and accepted on February 24, 1998

Abstract Motivation: In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. Results: We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments. Availability: The package is available along with the test cases through the WWW: http://www.ebi.ac.uk/∼cedric Contact: cedric.notredame@ebi.ac.uk Introduction
Multiple alignments are among the most important tools for analysing biological sequences. They can be useful for structure prediction, phylogenetic analysis, function prediction and polymerase chain reaction (PCR) primer design. Unfortunately, accurate multiple alignments may be difficult to build. There are two main reasons for this. First of all, it is difficult to evaluate the quality of a multiple alignment. Secondly, even when a function is available for the evaluation,
E Oxford University Press

it is algorithmically very hard to produce the alignment having the best possible score (optimal alignment). Cost functions or scoring functions roughly fall into two categories. First of all, there are those that rely on a substitution matrix. These are the most widely used. They require a substitution matrix (Dayhoff, 1978; Henikoff and Henikoff, 1992) that gives a score to each possible amino acid substitution, a set of gap penalties that gives a cost to deletions/insertions (Altschul, 1989), and a set of sequence weights (Altschul et al., 1989; Thompson et al., 1994b). Under this scheme, an optimal multiple alignment is defined as the one having the lowest cost for substitutions and insertion/deletions. One of the most widely used scoring methods of this type is the ‘weighted sums of pairs with affine (or semi affine) gap penalties’ (Altschul and Erickson, 1986). The main limitation of these scoring schemes is that they rely on very general substitution matrices, usually established by statistical analysis of a large number of alignments. These may not necessarily be adapted to the set of sequences one is interested in. To compensate for this drawback, a second type of scoring scheme was designed: profiles (Gribskov et al., 1987) and Hidden Markov Models (HMMs) (Krogh and Mitchison, 1995). Profiles allow the design of a sequencespecific scoring scheme that will take into account patterns of conservation and substitution characteristic of each position in the multiple alignment of a given family. To some extent, HMMs can be regarded as generalized profiles (Bucher and Hofmann, 1996). In HMMs, sequences are used to generate statistical models. The sequences of interest are then aligned to the model one after another to generate the multiple sequence alignment. The main drawback of HMMs is that to be general enough, the models require large numbers of sequences. However, this can be partially overcome by incorporating in the model some extra information such as Dirichlet mixtures (the equivalent of a substitution matrix in an HMM context) (Sjolander et al., 1996). Whatever scoring scheme one wishes to use, the optimization problem may be difficult. There are two types of optimization strategies: the greedy ones that rely on pairwise alignments and those that attempt to align all the sequences

407

C.Notredame, L.Holm and D.G.Higgins

simultaneously. The main tool for making pairwise alignments is an algorithm known as dynamic programming (Needleman and Wunsch, 1970) and is often used for optimizing the sums of pairs. The complexity of the algorithm makes it hard to apply it to more than two sequences (or two alignments) at a time. Nevertheless, it allows greedy progressive alignments as described by Feng and Doolittle (1987) or Taylor (1988). In such a case, the sequences are aligned in an order imposed by some estimated phylogenetic tree. The alignment is called progressive because it starts by aligning together closely related sequences and continues by aligning these alignments two by two until the complete multiple alignment is built. Some of the most widely used multiple sequence alignment packages like ClustalW (Thompson et al., 1994a), Multal (Taylor, 1988) and Pileup (Higgins and Sharp, 1988) are based on this algorithm. They have the advantage of being fast and simple, as well as reasonably sensitive. Their main drawback is that mistakes made at the beginning of the procedure are never corrected and can lead to misalignments due to the greediness of the strategy. It is to avoid this pitfall that the second type of methods have been designed. They mostly involve aligning all the sequences simultaneously. For the sums of pairs, this is a difficult problem that has been shown to be NP-complete (Wang and Jiang, 1994). However, using the Carrillo and Lipman (1988) algorithm implemented in the Multiple Sequence Alignment program MSA (Lipman et al., 1989), one can simultaneously align up to 10 sequences. Other global alignment techniques using the sums of pairs cost function involve the use of stochastic heuristics such as simulated annealing (Ishikawa et al., 1993a; Godzik and Skolnik, 1994; Kim et al., 1994), genetic algorithms (Ishikawa et al., 1993b; Notredame and Higgins, 1996) or iterative methods (Gotoh, 1996). Simulated annealing can also be used to optimize HMMs (Eddy, 1995). The stochastic methods have two main advantages over the deterministic ones. First of all they have a lower complexity. This means that they do not have strong limitations on the number of sequences to align or on the length of these sequences. Secondly, these methods are more flexible regarding the objective function they can use. For instance, MSA is restricted to an approximation of the sums of pairs using semi-affine gap penalties (Lipman et al., 1989) instead of the natural ones shown to be biologically more realistic (Altschul, 1989). This is not the case with simulated annealing (Kim et al., 1994). The main drawback of stochastic methods is that they do not guarantee optimality. However, in some previous work, we showed that with the Sequence Alignment Genetic Algorithm (SAGA), results similar to MSA could be obtained (Notredame and Higgins, 1996). We also showed that the package was able to handle test cases with sizes much beyond the scope of MSA. The robustness of SAGA as an optimizer was confirmed by results obtained

on a different objective function for RNA alignment (Notredame et al., 1997) and motivated our choice to use SAGA for optimizing the new objective function described here. The main argument for aligning all the sequences simultaneously instead of making a greedy progressive alignment is that using all the available information should improve the final result. However, one limitation of such methods is that regions of low similarity may induce some noise that will weaken the signal of the correct alignment (Morgenstern et al., 1996). In order to avoid this, one would like a scheme that filters some of the initial information and allows its global use. The approach we propose here is an attempt to do so. The underlying principle is to generate a set of pairwise alignments and look for consistency among these alignments. In this case, we define the optimal multiple alignment as the most consistent one and produce it using the SAGA package. The idea of using the consistency information in a multiple sequence alignment context is not new (Gotoh, 1990; Vingron and Argos, 1991; Kececioglu, 1993). In his scheme, Gotoh (1990) proposed the identification of regions that are fully consistent among all the pairwise alignments. These regions are used as anchor points in the multiple alignment, in order to decrease complexity. A similar strategy was described by Vingron and Argos (1991), allowing the computation of a multiple alignment from a set of dot matrices. Although very interesting, these methods had several pitfalls, including a sensitivity to noise (especially when some sequences are highly inconsistent with the rest of the set) and a high computational complexity. The work of Kececioglu (1993) bears a stronger similarity to the method we propose here. Kececioglu directly addressed the problem of finding a multiple alignment that has the highest level of similarity with a collection of pairwise alignments. Such an alignment is named ‘maximum weight trace alignment’ (MWT), and its computation was shown to be NP-complete. An optimization method was also described, based on dynamic programming and limited to a small number of sequences (six maximum). More recently, a method was described that allows the construction of a multiple alignment using consistent motifs identified over the whole set of sequences by a variation of the dynamic programming algorithm (Morgenstern et al., 1996). This algorithm should be less sensitive to noise than the one described by Vingron and Argos, but its main drawback is that it does rely on a greedy strategy for assembling the multiple alignment. An important aspect of multiple sequence alignment often overlooked is estimation of the reliability. Since all the alignment scoring functions available are known to be intrinsically inaccurate, identifying the biologically relevant portions of a multiple alignment may be more important than increasing the overall accuracy of this alignment. A few tech-

408

COFFEE: an objective function for multiple sequence alignments

niques have been proposed to identify accurately aligned positions in pairwise (Vingron and Argos, 1990; Mevissen and Vingron, 1996) and multiple sequence alignments (Gotoh, 1990; Rost, 1997). We show here that our method allows a reasonable estimation of a multiple alignment local reliability. The measure we use for reliability is in fact very simple and could easily be extended much further to incorporate other methods such as the one described by Mevissen and Vingron (1996).

Methods
The overall approach relies on the definition of an objective function (OF) describing the quality of multiple protein sequence alignments. Given a set of sequences and an ‘allagainst-all’ collection of pairwise alignments of these sequences (library), the score of a multiple sequence alignment is defined as the measure of its consistency with the library. This objective function was optimized with the SAGA package. Sets of sequences with a known structure and for which a multiple structural alignment is available were extracted from the 3D_ali database (Pascarella and Argos, 1992) and used in order to validate the biological relevance of the new objective function. Two other test cases were designed using the DALI server (Holm and Sander, 1996a) and aligned using libraries made of structural pairwise alignments extracted from the FSSP database (Holm and Sander, 1993).

Objective function
The OF is a measure of quality for multiple sequence alignments. Ideally, the better its score, the more biologically relevant the multiple alignment. The method proposed here requires two components: (i) a set of pairwise reference alignments(library), (ii) the OF that evaluates the consistency between a multiple alignment and the pairwise alignments contained in the library. We named this objective function COFFEE (Consistency based Objective Function For alignmEnt Evaluation).

Creation of the library
A library is specific for a given set of sequences and is made of pairwise alignments. Taken together, these alignments should contain at least enough information to define a multiple alignment of the sequences in the set. In practice, given a set of N sequences, we included in the library a pairwise alignment for each of the (N2 – N)/2 possible pairs of sequences. This choice is arbitrary since in theory there is no limit regarding the amount of redundancy one can incorporate into a library. For instance, instead of each pair of sequences being represented by a single pairwise alignment, one could use several alternative alignments of this pair, obtained by various methods. In fact, the library is mostly an

interface between any method one can invent for generating pairwise alignments, and the COFFEE function optimized by SAGA. However, the method follows the rule ‘garbage in/garbage out’ and the overall properties of the COFFEE function will most likely reflect the properties of the method used to build the library. The amount of time it takes to build the library depends on the alignment method used and increases quadratically with the number of sequences. Inside the evaluation algorithm, the library is stored in a look-up table. If each pair of sequences is represented only once, the amount of memory required for the storage increases quadratically with the number of alignments and linearly with their length. For the analyses presented here, two types of libraries were built. The first one relies on ClustalW. Given a set of N sequences, each possible pair of sequences was aligned using ClustalW with default parameters. The collection of output files obtained that way constitutes the library (ClustalW library). The motivation for using ClustalW as a pairwise method stems from the fact that Clustal is using local gap penalties, even for two sequences. In order to show that COFFEE is not dependent on the method used to construct the library, a second category of library was created using the FSSP database (Holm and Sander, 1996b). FSSP is a database containing all the PDB structures aligned with one another in a pairwise manner. For each test case, a set of sequences was chosen and the (N2 – N)/2 pairwise structure alignments involving these sequences were extracted from the FSSP database to construct an FSSP library. We also used as references the multiple alignments contained in FSSP. An FSSP entry is always based around a guide structure to which all the other structures are aligned in a pairwise manner. This collection of pairwise alignments can be regarded as a pairwise-based multiple alignment. This means that if one is interested in a set of N protein structures, FSSP contains the N corresponding pairwise-based multiple alignments, each using one structure of the set as a guide. Generally speaking, these N multiple alignments do not have to be consistent with one another, but only consistent with the subset of the pairwise alignments that was used to produce them.

Evaluation procedure: the COFFEE function
Let us assume an alignment of N sequences and an appropriate library built for this set. Evaluation is made by comparing each pair of aligned residues (i.e. two residues aligned with each other or a residue aligned with a gap) observed in the multiple alignment to those present in the library (Figure 1). In such a comparison, residues are identified by their position in the sequence (gaps are not distinguished from one another). In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the

409

C.Notredame, L.Holm and D.G.Higgins

multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical). In practice, this scheme needs extra readjustments to incorporate some important properties of the sequence set. For instance, the significance of the information content of each pairwise alignment is not identical. Several schemes have been described in the literature for weighting sequences according to the amount of information they bring to a multiple alignment (Altschul et al., 1989; Sibbald and Argos, 1990; Vingron and Sibbald, 1993; Thompson et al., 1994a). In COFFEE, our main concern was to decrease the amount of noise produced by inaccurate pairwise alignments in the library. To do so, each pairwise alignment in the library is given a weight that is a function of its quality. For this purpose, we used a very simple criterion: the weight of a pairwise alignment is equal to the per cent identity between the two aligned sequences in the library. This may seem counterintuitive since weighting schemes are normally used in order to decrease the amount of redundancy in a set of sequences (i.e. down-weighting sequences that have a lots of close relatives). Doing so makes sense in the context of profile searches (Gribskov et al., 1987; Thompson et al., 1994b) where it is important to prevent domination of the profile by a given subfamily. However, in the case of multiple sequence alignments made by global optimization, it is more important to make sure that closely related pairs of sequences are correctly aligned, regardless of the background noise introduced by other less related sequences. In such a context, a weight can be regarded as a constraint. The consequence is that the alignment of a given sequence will mostly be influenced by its closest relatives. On the other hand, if a sequence lacks any really close relative, its alignment will mostly be influenced by the consistency of its pairwise alignments with the rest of the library. The COFFEE function can be formalized as follow. Given N aligned sequences S1 … SN in a multiple alignment, Ai,j is the pairwise projection (obtained from the multiple alignment) of the sequences Si and Sj , LEN(Ai,j ) is the length of this alignment, SCORE(Ai,j ) is the overall consistency (level of identity) between Ai,j and the corresponding pairwise alignment in the library and Wi,j is the weight associated with this pairwise alignment. Given these definitions, the COFFEE score is defined as follows:

COFFEE score +

ƪȍ ȍ
N*1 N

ƪȍ ȍ
N*1 N

W i,j

SCORE(A i,j) ń

i+1 j+i)1

W i,j

LEN(A i,j)

i+1 j+i)1

ƫ

ƫ

(1)

with: SCORE(Ai,j ) = number of aligned pairs of residues that are shared between Ai,j and the library (2) The COFFEE function presents some similarities with the ‘weighted sums of pairs’ (Altschul and Erickson, 1986). Here as well, we consider all the pairwise substitutions in the multiple alignment, and weight these in a way that reflects the relationships between the sequences. The library plays the role of the substitution matrix. The main differences between the COFFEE function and the ‘weighted sums of pairs’ are that (i) no extra gap penalties are applied in our scheme, since this information is already contained in the library, (ii) the COFFEE score is normalized by the value of the maximum score (i.e. its value is between 0 and 1) and (iii) the cost of the substitutions is made position dependent, thanks to the library (i.e. two similar pairs of residues will have potentially different scores if the indices of the residues are different). Under this formulation, an alignment having an optimal COFFEE score will be equivalent to an MWT alignment using a ‘pairwise alignment graph’ (Kececioglu, 1993). The score defined above is a global measure for an entire alignment. It can also be adapted for local evaluation. We have defined two types of local scores: the residue score and the sequence score. The residue score is given below. S xis the i residue x in sequence S i and A x,yis the pair of aligned residues i,j S x and S y in the pairwise alignment A i,j. i j
residue score(S x) + i

ȍ
j+1,j!+1

N

W i,j

OCCURRENCE(A x,y)ń
i,j

ȍ
j+1,j!+1

N

W i,j

(3)

OCCURRENCE( A x,y) is equal to the number of occuri,j rences of the pair A x,y in the reference library (0 or 1 when i,j using the libraries described here). The sequence score is the natural extension of the residue score. It is defined as the sum of the score of each residue in a sequence divided by the number of residues in that sequence.

Optimizing an alignment for its COFFEE score: SAGA-COFFEE
The aim is to create an alignment having the best possible COFFEE score (optimal alignment). Doing so is a difficult

410

COFFEE: an objective function for multiple sequence alignments

Fig. 1. COFFEE scoring scheme. This figure indicates how a column of an ALIGNMENT is evaluated by the COFFEE function using a REFERENCE LIBRARY. Each pair in the alignment is evaluated (SCORE MATRIX). In the score matrix, a pair receives a score of 0 if it does not appear in the library or a score equal to the WEIGHT of the pair of sequences in which it occurs in the PAIRWISE LIBRARY. S ince the matrix is symmetrical, the column score is equal to the sum of half of the matrix entries, excluding the main diagonal. This value is divided by the maximum score of each entry (i.e. the sum of the weights contained in the library). The residue score is equal to the sum of the entries contained by one line of the matrix, divided by the sum of the maximum score of these entries.

task. The computational complexity of a dynamic programming solution is known to be NP-complete (Kececioglu,

1993). For reasons discussed in the Introduction, we used SAGA V0.93 (Notredame and Higgins, 1996).

411

C.Notredame, L.Holm and D.G.Higgins

Table 1. Accuracy of the prediction made on the category 5 of substitution Test case Length Nseq. Proportion (H+E) (%) 57 68 43 48 57 74 53 53 67 57 51 / / Avg Id. (%) 21 31 42 17 36 24 24 39 22 61 27 14 12 COFFEE score Clustal 0.48 0.72 0.84 0.49 0.86 0.78 0.63 0.87 0.59 0.96 0.69 / / SAGA 0.56 0.84 0.87 0.62 0.89 0.80 0.67 0.87 0.65 0.97 0.74 / / Accuracy (H+E) % Clustal SAGA 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 / / 50.2 64.5 90.7 47.0 83.1 85.2 78.1 72.3 64.7 96.9 66.6 / / Accuracy (ov.) % Clustal SAGA 35.2 50.0 88.3 35.7 76.7 82.1 65.6 72.3 60.4 93.6 57.7 / / 45.9 61.7 86.1 43.6 80.2 81.7 69.4 72.4 61.4 93.6 61.2 / / CPU time (s) 21 009 1003 699 936 91 28 477 110 453 256 388 644 44 978 13 756 43 568 N.G.

ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot ceo vjs

248 500 146 136 52 183 194 213 90 331 229 882 1207

14 7 6 9 8 17 37 6 8 7 15 7 8

535 166 259 480 55 222 132 105 110 127 744 882 1400

Test case: generic name of the test case, taken from 3D_ali for the first 11 (ac prot: acid proteases, binding: sugar/amino acid binding proteins, cytc: cytochrome c’ ss, fniii: fibronectin type III, gcr: crystallins, globin: globins/phycocyanins/collicins, igb: immunoglobulin fold, lzm: lyzozymes/lactalbumin, phenyldiox: dihydroxybiphenyl dioxygenase, sbt: subtilisin, s_prot: serine protease fold) and from the DALI server for the last two. ceo includes: 1cbg, 1ceo, 1edg, 1byb, 1ghr, 1×yzA and vjs includes: 1cnv, 1vjs, 1smd, 2aaa, 1pamA, 2amg, 1ctn, 2ebn. Length: length of the reference alignment. Nseq: number of sequences in the alignment. Proportion (E+H): percentage of the substitutions involving E→E or H→H. Avg. Id.: average level of identity between the sequences. OF score: score measured with the COFFEE function using a ClustalW library on the ClustalW or the SAGA alignments. Accuracy (E+H): percentage of the (E+H) substitutions found identical between the SAGA (or ClustalW alignment) and the reference. Accuracy (ov.): percentage of substitutions similar in the SAGA (or ClustalW) alignment and in the reference. CPU time: cpu time in seconds using an alpha 8300 machine N.G.: number of generations needed by SAGA to find the solution. The results for the two last test cases analysis are presented in Table 6.

SAGA follows the general principles of genetic algorithms as described by Goldberg (1989) and Davis (1991). In SAGA, a population of alignments evolves through recombination, mutation and selection. This process goes on through series of cycles known as generations. Every generation, the alignments are evaluated for their score (COFFEE). This score is turned into some fitness measure. In an analogy with natural selection, the fitter an alignment, the more likely it is to survive and produce an offspring. From one generation to the next, some alignments will be kept unchanged (statistically the fittest), others will be randomly modified (mutations), combined with another alignment (cross-over) or will simply die (statistically the less fit). The new population generated that way will once again undergo the same chain of events, until the best scoring alignment cannot be improved for a specific number of generations (typically 100). Operators play a central role in the GA strategy. They can be subdivided between two categories. First the cross-overs, which combine the content of two multiple alignments. Thanks to them, and to the pressure of selection, good blocks tend to be merged into the same alignments. On their own, cross-overs cannot create new blocks, this needs to be done by the second category of operators: the mutations. These are

specific programs that input an alignment and modify it by inserting or moving patterns of gaps. Mutations can be slightly greedy (attempt to make some local optimization) or completely random. A key concept in the genetic algorithm strategy is that the fitness-based selection is not absolute but statistical. To select an individual, a virtual wheel is created. On this wheel, each individual is given a number of slots proportional to its fitness. To make a selection, the wheel is spun. Therefore, the best individuals are simply more likely to survive, or to be selected for a cross-over or a mutation. This form of selection protects the GA search from excessive greediness, hence preventing it from converging onto the first local minimum encountered during the search. SAGA V0.93 is mostly similar to the Version 0.91 described in Notredame and Higgins (1996). Most of the changes between Version 0.93 and 0.91 have to do with some improvement in the implementation and the user interface, but do not affect the algorithm itself. To optimize the COFFEE scores, SAGA was run using the default parameters described for SAGA 0.91 in Notredame and Higgins (1996). SAGA was also modified so that it could evaluate any alignment (including a ClustalW alignment) using the COFFEE function.

412

COFFEE: an objective function for multiple sequence alignments

Test cases
To assess the biological accuracy of the COFFEE function and the efficiency of its optimization by SAGA, 13 test cases were designed. They are based on sequences with known three-dimensional structures, for which a structural alignment is available. This choice was guided by the fact that structure-based alignments are usually biologically more correct than any other alternative, especially when they involve proteins with low sequence similarity. For this reason, we used these structure-based alignments as a standard of truth in our analyses. Eleven test cases were extracted from the 3D_ali Release 2.0 (Pascarella and Argos, 1992). Alignments were selected according to the following criteria: alignments with more than five structures and with a consensus length larger than 50. In 3D_ali, 18 alignments meet this requirement. Among these, we removed those for which ClustalW produces a multiple alignment >95% identical to its structural counterpart (four alignments). We also removed three alignments which were impossible to align accurately using ClustalW or SAGA/ COFFEE. These consist of sets of very distantly related sequences with unusually long insertions/deletions (barrel, nbd and virus in 3D_ali). These alignments are beyond the scope of conventional global sequence alignment algorithms. This leaves a total of 11 alignments used in our analyses. Their characteristics are shown in Table 1. The two last test cases were extracted from the FSSP database. As opposed to the 11 other test cases, they have been specifically designed for making a multiple sequence alignment using a structure-based reference library. This explains their low level of average sequence identity, as can be seen in Table 1 (the two last entries, vjs and ceo).

Evaluation of the COFFEE function accuracy
When evaluating a new OF with SAGA, two main issues are involved: the quality of the optimization and the biological relevance of the optimal alignment. Another aspect involves the comparison of the new OF with already existing methods. The evaluation of the biological relevance of the COFFEE function required the use of some references. The structural alignments described above were used for this purpose. Comparison between a sample alignment and its reference was made following the protocol described in Notredame and Higgins (1996), inspired by the method used by Vogt et al. (1995) for substitution matrix comparison and Gotoh (1996). All the pairs (excluding gaps) observed in the sample alignment were compared to those present in the reference. The level of similarity is defined as the ratio between the number of identical pairs in the two multiple alignments divided by the total number of pairs in the reference. This procedure gives access to an overall comparison. It does not reflect the fact that in a global structural alignment,

some positions are not correctly aligned because they cannot be aligned (this is true of any position where the two structures cannot be superimposed). In practice, structural alignment procedures may deal with these situations in different ways, producing sequence alignments that are sometimes locally arbitrary (especially in the loops). While in DALI these regions are explicitly excluded from the alignment, it is not so obvious to identify them in the multiple sequence alignments contained in 3D_ali. To overcome this type of noise, a procedure was designed that should be less affected by misalignments. For this alternative measure of biological relevance, we only take into account substitutions that involve a conservation of secondary structural state in the reference alignment (helix to helix and beta strand to beta strand). In the text and the tables, this category of substitution is referred to as (E+H). In most of the test cases, the (E+H) category makes up the majority of the observed pairs, as can be seen in Table 1. For each of the first 11 test cases (3D_ali), the evaluation procedure involved making multiple alignments with five different methods (cf. the next section) and a ClustalW library (default parameters). The ClustalW library was used with SAGA for producing a multiple alignment having an optimal COFFEE score. The biological relevance of this alignment was then assessed by comparison with the structural reference, and compared to the accuracy obtained with the other methods on the same sequences. On the two last test cases (vjs and ceo), alignments were made using FSSP libraries. Alignment accuracy was assessed using the DALI scoring measure. Given a pairwise alignment, this is a measure of the quality of the structure superimposition implied by the alignment. The program used for this purpose (trimdali) returns the DALI score (Holm and Sander, 1993) and two other values: the length of the consensus (number of residues that could be superimposed) and the average RMS (the average deviation between equivalent Ca atoms). These values were computed for each possible pairwise projection of the multiple alignments and averaged. The scores obtained that way for the SAGA alignments were compared to similar scores measured on the FSSP pairwise-based multiple alignments.

Comparison of COFFEE with other methods
In total, six alignment methods where used to align the 3D ali test cases: ClustalW v1.6 (Thompson et al., 1994a), SAGA with the MSA objective function (SAGA-MSA) (Notredame and Higgins, 1996), PRRP (V 2.5.1), the iterative alignment method recently described by Gotoh (1996), PILEUP (Higgins and Sharp, 1988) in GCG v9.1 and SAM (v2.0), a HMM package (Hughey and Krogh, 1996) and SAGA-COFFEE.

413

C.Notredame, L.Holm and D.G.Higgins

Apart from SAM, all these methods were used with the default parameters that came along with the package. In the case of SAM, since it is known that HMMs usually require large sets of sequences in order to evaluate a model, we used the Dirichlet mixture regularizer provided in the package, which is supposed to compensate for this type of problems. SAGA-MSA is the package previously described (Notredame and Higgins, 1996), Rationale 2 weights (Altschul et al., 1989) were computed using the MSA package. (Lipman et al., 1989). When possible, MSA was used on the same sequences as SAGA-MSA in order to control the quality of the optimization. Results were consistent with those previously reported.

workstation, it takes ∼4 s/generation for the gcr test case and ∼7 min/generation for the igb test case. Unfortunately, establishing the complexity in terms of the number of generations needed to reach a global optimum is much harder. This depends on several factors: number of sequences, length of the consensus, relative similarity of the sequences, complexity of the pattern of gaps needed for optimality, operators used for mutations and cross-overs. Since the pattern of gaps is an unknown factor, it is impossible really to predict how many generations will be required for one specific test case. On the other hand, judging from the data in Table 1 (N generations column), it seems that the length of the alignment has a stronger effect than the number of sequences.

Implementation
The COFFEE function and the procedure for building ClustalW pairwise libraries have been implemented in ANSI C. These programs have been integrated in Version 0.93 of the SAGA package also written in ANSI C. These are available upon request from the authors (http://www.ebi.ac.uk/∼cedric).

Comparison of the COFFEE function with other methods
Multiple alignments were produced with SAGA-COFFEE using ClustalW libraries (best scoring alignment out of 10 runs). These alignments were compared to the structural references. Multiple alignments of the same sets, generated with five other methods, were also compared to the references in order to evaluate relative performances. Since in the way it is used here, the COFFEE function depends heavily on ClustalW, special emphasis was given to the comparison of these two methods (Table 1). The results are unambiguous. When considering the overall comparison, nine test cases showed that SAGA makes an improvement over ClustalW (in two of these, the improvement is >10%). The trend is similar when looking only at (E+H) substitutions, where 10 test cases out of 11 present an improvement. In the few cases where it occurs, the degradation made by SAGA is always <2%. The extent of the observed improvements usually correlates well with the differences in the scores measured with the COFFEE function. Degradation is only observed in the cases where the ClustalW alignment already has a high level of consistency with the reference library (>75%), as can be seen with the globin (COFFEE score of the ClustalW alignment = 0.78) and the cytochrome C (COFFEE score of the ClustalW alignment = 0.84). In order to put SAGA-COFFEE in a wider context, comparisons were made using five other different methods (Table 2). These results show that in most of the cases SAGA-COFFEE does reasonably well. When its alignment is not the best, it is usually within 3% of the best (except for the binding and gcr tests, for which the difference is greater). Apart from the HMM method (SAM) that has low performances, it is relatively hard to rank existing methods. PRRP is one of the newest methods available. It has been described as being one of the most accurate (Gotoh, 1996) and happens to be the only one that significantly outperforms SAGACOFFEE on some test cases. It is also interesting to notice that SAGA-COFFEE is always among the best for test cases

Results Accuracy and complexity of the optimization
Since our approach relies on the ability of SAGA to optimize the COFFEE function, we checked that this optimization was performed correctly. For each test case, a dummy library was created, containing sets of pairwise alignments identical to those observed in the reference multiple structure alignment. In such a case, the structural alignment has a score of 1 since it agrees completely with the library. Therefore, the maximum score that can be reached by SAGA also becomes 1. Since, under these artificial conditions, the score of the optimum is known, we could test the accuracy of SAGA’s optimization. Several runs made on the same set reached the optimum value in an average of 5.4 runs out of 10. The lowest reproducibility was found with the largest test cases of Table 1 (igb or s prot with a score of 1 being reached, respectively, one and two times out of 10 runs). However, even if the optimal score is not reached, we found that it is always possible to produce an alignment with a score better than 0.94. Although they do not constitute a full proof, these results support the assumption that SAGA is a good choice for optimizing the COFFEE function. An important aspect is the complexity of the program and the factors that influence it. As we previously reported when optimizing the sums of pairs with SAGA (Notredame and Higgins, 1996), establishing the complexity is not straightforward. The evaluation of a COFFEE score is quadratic with the number of sequences and linear with the consensus length. Using a given population size, the time required for one generation will vary accordingly. For instance, on a fast

414

COFFEE: an objective function for multiple sequence alignments

having a low level of identity. This trend is confirmed by the results shown in Table 3, where the sequences are grouped according to their average similarity with the rest of their family (as measured on the reference structural alignment). In this table, we analysed the overall performance of each method and compared it with SAGA-COFFEE by counting (i) the overall per cent of (E+H) residues correctly aligned and (ii) the number of sequences for which SAGA-COFFEE makes a better (b)/worse (w) alignment than a given method. Overall, the results confirm that SAGA-COFFEE seems to do better than the other methods when the sequences have a low level of identity with the rest of their set. The poor performances of SAM can probably be explained by two reasons: the small number of sequences in each test case and perhaps some inadequate default settings in the program (in practice, SAM is often used as an alignment improver rather than on its own). Sequence identity: minimum and maximum average identity of the sequences of each category with the rest of their alignment as measured on the reference structural alignment. Nseq.: number of sequences in a category. Nres.: number of residues. SAGA-COFFEE percentage of the (E+H) substitutions present in the reference structural alignment observed in the SAGA-COFFEE alignment. ClustalW: (%), similar but using ClustalW alignment; (b), number of
Table 2. Method comparison on the 3D_ali test cases Test case Avg. id. (%) 21 31 42 17 36 24 24 39 22 61 27 Nseq. SAGA COFFEE (%) 50.2 64.5 90.7 *47.0 83.1 85.2 *78.1 *72.3 *64.7 96.9 66.6 PRRP (%) 48.8 *76.2 89.4 36.3 *92.8 *87.0 74.9 71.1 49.9 96.7 64.3

sequences for which SAGA-COFFEE produces a better alignment than ClustalW; (w), number of sequences for which SAGA-COFFEE produces a worse alignment than ClustalW. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. [Note that the (b) and (w) categories do not necessarily add up to the overall number of sequences because they do not include sequences having the same score with the two method compared.] Test case: generic name of the test case, taken from 3D_ali (see 3D_ali for PDB identifiers), see Table 1 for more details. Nseq: number of sequences in the alignment. Avg. Id.: average level of identity between the sequences. SAGA-COFFEE accuracy of the alignments obtained with SAGA-COFFEE as judged by comparison with the structural alignment, only considering the (E+H) substitutions. ClustalW: similar but with ClustalW alignments. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model.

ClustalW (%) 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5

PILEUP (%) 40.9 66.6 *94.6 37.8 80.8 72.6 52.4 *72.3 37.4 *97.4 57.9

SAGA MSA (%) *51.2 64.2 67.3 45.2 80.8 78.0 70.1 *72.3 55.6 96.0 *68.5

SAM (%) 27.9 36.9 67.3 16.2 85.7 67.8 67.2 55.3 45.7 90.6 61.7

ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot

14 7 6 9 8 17 37 6 8 7 15

*Indicates the method performing best on a test case. Table 3. Method comparison on the 3D_ali test cases: global results
Sequence identity [00.0–20.0] [20.0–40.0] [40.0–100.0] Nseq. Nres. SAGA COFFEE (%) *63.3 *76.2 89.3 PRRP (%) 62.2 74.6 *90.9 ClustalW (%) b 49.7 66.1 84.6 11 68 20 PILEUP (%) b 42.4 60.2 89.8 23 80 3 SAGA MSA (%) b 56.9 69.7 87.8 20 63 16 SAM (%) w 7 24 2 36.4 59.1 64.3 18 84 25 3 2 0

b 18 57 14

w 8 31 4

w 6 20 3

w 3 8 15

b

w

28 88 18

3424 12 010 3808

*Indicates the method performing best on a given range of identity.

415

C.Notredame, L.Holm and D.G.Higgins

Fig. 2. Correlation between sequence score and alignment accuracy. (a) The average level of identity of each sequence with the rest of its alignment was measured on the reference structural alignment. The average level of accuracy of the SAGA-COFFEE alignment of each of these sequences was also estimated on the (E+H) category. The two values are plotted against one another. (b) For each sequence, the sequence score was measured on the SAGA-COFFEE alignment, this value was plotted against the accuracy of the sequence alignment. The coefficient of linear correlation was estimated on these points (r = 0.65).

These results also indicate that there is no such thing as an ideal method. Even if COFFEE seems to do better on average, one can see in Table 2 and III that the alignments it produces are not always the best. In fact, it seems that depending on the test case any method can do better than the others. Unfortunately, as discussed by Gotoh (1996), it is hard to discriminate the factors that should guide the choice of a method. For this reason, being able to identify the correct portions in a multiple sequence alignment may be even more important than being able to do a very accurate alignment.

Correlation between COFFEE score and alignment accuracy
As mentioned in Methods, the score can be assessed at a local level (sequence score or residue score). One of the benefits of such evaluation is that local score and accuracy can be correlated, thus allowing the identification of potentially correct portions of an alignment with a known risk of error. The

3D_ali structure-based alignments were used once more to validate this approach. Generally speaking, a high residue score will indicate that the pairs in which a given residue is involved are also found in the pairwise library. On the other hand, if none of the pairings in which a given residue is involved are found in the library, this residue will be considered unaligned. We evaluated the COFFEE score of each sequence in each alignment. In each of these sequences the (E+H) average accuracy was also measured. The graph in Figure 2b shows the relationship between sequence score and (E+H) average accuracy. The correlation between these two quantities is reasonable (r = 0.65). When considering the values used for this graph, we found that for >85% of the sequences it is possible to predict the actual accuracy of the alignment with a ±10% error rate. In terms of prediction, this is a substantial improvement over what can be obtained when measuring the average level of identity between one sequence and its multiple alignment, as shown in Figure 2a.

416

COFFEE: an objective function for multiple sequence alignments

Table 4. Average accuracy of the alignment of each sequence as a function of its sequence score (3D_ali test cases) Sequence score N. residues (%) ClustalW 5.8 36.8 57.4 19 242 residues N. sequences (%) ClustalW 6.7 40.3 53.0 Accuracy (E+H) (%) ClustalW 14.3 63.2 82.0

SAGA 2.6 33.7 63.7 134 sequences

SAGA 3.0 38.1 59.0

SAGA 9.9 67.2 82.5

[0.00–0.33] [0.33–0.66] [0.66–1.00] TOTAL

Sequence score: minimum and maximum score of the sequences in each category. N. residues: percentage of residues belonging to each category estimated on SAGA or ClustalW alignments. N. sequences: percentage of the total sequences belonging to each category of score as measured on the SAGA and the ClustalW alignments. Accuracy (E+H): accuracy associated with each category of score in the SAGA and ClustalW alignments. TOTAL: total number of residues and sequences in the comparison. Table 5. Accuracy of the prediction made on the category 5 of substitution Test case Accuracy (%) ClustalW 56.8 64.3 96.2 81.5 75.5 97.2 88.8 91.5 78.0 82.2 89.7 Correct substitution (%) ClustalW SAGA 9.6 31.5 72.1 13.8 63.4 63.1 39.0 61.3 34.3 45.2 85.2 15.7 40.0 73.5 15.6 74.5 66.5 42.3 61.5 40.0 50.1 87.0

SAGA 68.2 61.4 93.9 77.7 77.4 95.0 85.5 91.8 72.5 82.4 89.7

ac prot binding cytc fniii gcr globin igb lzm phenyl s prot sbt

Test case: generic name of the test case taken from 3D_ali. Accuracy: ratio between the total number of substitutions in category 5 (in SAGA and ClustalW alignments) and the number of these substitutions present in the reference alignment. % Correct substitutions: percentage of the correct substitutions (over the total number, all substitution categories included) identified in the category 5 of substitution.

The correlation between score and accuracy becomes slightly more apparent when looking at the data in a more global way (Table 4). In this case, the sequences have been grouped according to their score, and the accuracy of their alignment was measured. One can see that the higher the score of a sequence, the higher its average alignment accuracy. We also found that the distribution of the sequences among the three categories was modified when using ClustalW instead of SAGA. SAGA produces more sequences with a high score than does ClustalW. This means that not only are SAGA alignments more accurate than ClustalW, it is also possible to identify them for being so. In practice, the sequence score, as imperfect as it may seem, provides a fast and simple way to identify sequences that do not really belong to a set or that are so remotely related to the rest that their alignment should be considered with care.

The sequence score is a global measure. It does not reflect the local variations that occur at the residue level. To analyse these kinds of data and assess their utility for predicting correct portions of an alignment, the score of each residue in each multiple sequence alignment was evaluated using equation (3). These scores were scaled in a range varying from 0 to 9. A residue has a score of 9 if >90% of the pairs in which it is involved are also present in the reference library, and so forth for 8 ([80–90%]), etc. Once residue scores have been evaluated, substitution classes can be defined. For instance, the class 5 of substitutions includes all the residues of a multiple alignment having a residue score superior to or equal to 5 (Figure 3a), the class 0 of substitution includes all the residues in the alignment. Figure 3a gives an example of such an evaluation. In this alignment, each residue is replaced by its score, and the residues that belong to the category 5 of substitution are boxed. Figure 3b shows the correctly aligned residues in this category. It is possible to see that using our measure, one can identify some of the correct portions in the SAGA fniii alignment. As can be gathered from Table 1, fniii is a very demanding test case. Except for the two first sequences, which are almost identical, all the other members of this set have a very low level of identity with one another. This is especially true for the sequence 2hft_1 which illustrates well the advantages and limits of our method. This sequence is not the most remotely related to the set. It has an average identity of 14%, whereas two other members (3hhr_c and 2hft) are more distantly related with 12% average identity. Despite this fact, 2hft_1 gets the lowest sequence score in the multiple alignment (0.29). This correlates well with the fact that it also has the lowest alignment accuracy of the set [18% overall, 20% for the (E+H) category]. Similarly, the only non-terminal stretch of this sequence that belongs to the category 5 is one of the only portions to be correctly aligned (Figure 3a and b). The same type of analyses were carried out on the 10 other test cases (Table 5). Our measures indicate that using the category 5 of substitution, a substantial portion of correctly aligned residues can be identified. When comparing Clus-

417

C.Notredame, L.Holm and D.G.Higgins

Fig. 3. Evaluation of the accuracy of the fniii test case (fibronectin type III family). (a) Sequences in the fniii test case were aligned by SAGA-COFFEE using a ClustalW library. The alignment obtained that way was evaluated locally. The sequences names are the PDB id entifiers. The suffix _1, _2.. indicates that several portions of the same sequence have been used (cf. 3D_ali for further details). In th is alignment, each residue has been replaced by its score. The gray boxes indicate all the residues that belong to category 5 of substitution (i.e . having a score ≥5). The sequence score box lists the values measured on each sequence. (b) The accuracy in the category 5 of substitution (boxes) was evaluated by comparison with the reference alignment. Residues shadowed in gray in the boxes are correctly aligned to one another. Boxed residues not shadowed are not correctly aligned with each other or with the rest of the category 5 residues. Residues not contained in the boxes are not taken into account for this evaluation.

talW and SAGA, we found that more correct residues can be identified with SAGA. This improvement is sometimes achieved at the cost of a slightly lower accuracy (more false positives) in the SAGA alignments. A global estimation was made to evaluate the accuracy that can be expected when using any of the 10 substitution categories on a SAGA alignment. The proportion of correct substitutions predicted that way was also measured. These results are presented in Figure 4a and b, respectively. Residues are grouped in three classes, depending on the score of the sequences in which they occur. Figure 4a confirms that high-scoring residues are usually correctly aligned (high accuracy). However, the higher the substitution category, the smaller the number of residues on which a prediction can be made, as shown in Figure 4b. These graphs confirm that the residue score can be used as a measure for predicting accuracy; they also indicate that the sequence score is informative when making a prediction on a residue.

Making a multiple structural alignment
The analysis carried out with the ClustalW libraries represents only one possible application for the COFFEE function. Generally speaking, the COFFEE scheme allows the

combination of the information contained by any reference library, regardless of the method used for its construction. To illustrate this fact, we show that it is possible to build a structure-based multiple sequence alignment when a library of high-quality pairwise structural alignments is available. We used COFFEE on two sets of proteins (vjs and ceo) using appropriate FSSP libraries. It was impossible to improve significantly over FSSP for the ceo test case, made of endoglucanases and other related carbohydrate degradation enzymes. This can be explained by the fact that the FSSP alignment with the best DALI score (the one using 1ceo as a guide) already has a high level of consistency with the library (COFFEE score = 0.82). This shows quite clearly in the fact that this alignment is 88% similar to the SAGA-COFFEE one. The second set is made of amylases and other carbohydrate degradation enzymes. Table 6 is used to compare the SAGACOFFEE alignment of these sequences with the corresponding FSSP pairwise-based multiple alignments. These results clearly indicate that the alignment produced by SAGA is better than any of the FSSP multiple alignments, regardless of the criterion used to evaluate this improvement (DALI score, consensus length or RMS). This result was to be expected since SAGA makes use of much more information

418

COFFEE: an objective function for multiple sequence alignments

Fig. 4. Prediction of correctly aligned residues using the residue COFFEE score. (a) The accuracy of the alignments (number of correct substitutions in one of the categories divided by the total number of substitutions in this category) of each sequence was meas ured. To do so, sequences were divided into three groups, depending on their sequence score. The graph was made for each of the three groups. (b) For each sequence, the number of correct substitutions contained in each category was evaluated and divided by the overall number of substitutions involving that sequence. This value was plotted against the category of substitution.

than any of the FSSP alignments. In Table 6, entries are sorted according to the DALI score. This allows one to see that the DALI and COFFEE scores correlate well for the
Table 6. Comparison of FSSP and SAGA multiple alignments Guide sequence 2ebn 1cnv 1vjs 1ctn 1smd 2amg 2aaa 1pamA SAGA-COFFEE Average DALI score 1152.6 1205.2 1258.4 1331.2 1667.1 1672.9 1766.8 1786.3 1860.0

FSSP alignments, and supports the idea that the COFFEE score is also a good indicator of the alignment quality when the library is based on structural alignments.

Average consensus length 186.5 196.4 198.8 196.9 219.4 217.7 224.9 225.8 230.9

Average RMS (Å) 3.73 3.63 3.62 3.53 3.40 3.42 3.45 3.30 3.20

COFFEE score 0.53 0.59 0.50 0.60 0.65 0.67 0.69 0.70 0.79

Guide sequence: sequence used as a guide in the FSSP multiple alignment (SAGA indicates the alignment obtained with SAGA-COFFEE). Average DALI score: average DALI score of each pair of sequences in the alignment. The table is sorted according to the values of these entries. Average consensus length: average of the number of residues superimposable by DALI in each pair of sequence. Average RMS: the average of the RMS values measured by DALI on each pair of the alignment in Ångströms. COFFEE score: score given by SAGA to the multiple alignments using the same library.

419

C.Notredame, L.Holm and D.G.Higgins

In theory, we could have used the DALI score as an objective function, and optimized it with SAGA. In such a context, DALI would be used to evaluate all the pairwise projections in order to give a score similar to the one shown in the ‘DALI score’ column of Table 6. However, this is not possible in practice because the computation of a DALI score is much more expensive than the evaluation of a COFFEE score. DALI score used on a multiple alignment is quadratic with the number of sequences and quadratic with the length of the alignment. The COFFEE score is also quadratic with the number of sequences, but only linear with the length of the alignment. In consequence, even if we were to assume a global DALI score to be biologically more realistic than the FSSP library-based COFFEE score, COFFEE still appears as a good trade-off between approximating DALI and saving on computational cost.

Discussion
In this work, we show that alignments can be evaluated for their MWT score using the COFFEE function and subsequently optimized with the genetic algorithm package SAGA. Given a heterogeneous, non-consistent collection of pairwise alignments, one can extract the corresponding multiple alignment with COFFEE and SAGA. We have shown here that the SAGA-COFFEE scheme outperforms most of the commonly used alternative packages when applied to sequences having low levels of identity. The comparison made with other global optimization techniques such as SAGA-MSA and PRRP indicates that the method is not only better because it does a global optimization, but also probably because of the way it uses information, filtering some of the noise through the library of pairwise alignments. The weighting scheme also plays a role in this improvement. It helps turning the relationship between the sequences into some of the constraints that define the optimal alignment. It is because all these constraints (library and weights) are unlikely to be consistent that the genetic algorithm strategy proves to be a very appropriate mean of optimization. There is little doubt that the performances of our method will also depend on the relationship between the sequences. Sets with a lot of intermediate sequences (i.e. a dense phylogenetic tree) are likely to lead to more accurate alignments. However, the fact that COFFEE proves able to deal with sequences having a very low level of identity is quite encouraging regarding the robustness of the method. One of the main advantages of the COFFEE strategy is the flexibility given to the user for defining the library. Here, by using two completely different pairwise alignment methods, we managed to produce high-quality multiple alignments in both cases. This is interesting, but constitutes only a first step. The structure of the libraries we have been using is very simple. They only rely on an ‘all-against-all’ comparison

using one type of pairwise alignment algorithm per library. In practice, this scheme can easily be extended to much more complex library structures. It is common sense to have a higher confidence in results that can be reproduced using independent methods. Some prediction methods rely on this type of assumption, such as the block definition strategy described by Henikoff et al. (1995).These methods usually limit themselves to identifying highly conserved patterns among a set of solutions. With the COFFEE strategy, we go much further and make it possible to find a consensus solution whatever the number of constraints and whatever their relative compatibility. Of course, it is not enough for a solution to exist, one also needs to know how accurate this solution is. In this work, we have shown that the level of consistency of a solution is a good indicator of such accuracy. This accuracy prediction constitutes the other main aspect of the COFFEE function. Several methods have been proposed that attempt to predict correct portions of a pairwise alignment given a set of sub-optimal alignments (Gotoh, 1990; Vingron and Argos, 1990; Mevissen and Vingron, 1996). Using these methods, libraries could be designed with large numbers of sub-optimal alignments. Here again, the difference between the COFFEE method and other previously proposed approaches is that not only is it possible to predict the correct portions in an alignment, but it is also possible to optimize a multiple alignment for having as many reliable regions as possible. SAGA-COFFEE still needs to be improved on several accounts. For instance, further approaches will involve the definition of more complex libraries that will hopefully lead to more meaningful consistency indices. The main source of inspiration when doing so will be the work done on pairwise alignment stability (Mevissen and Vingron, 1996). The other direction we plan to take has to do with the combination of scoring schemes. We have seen here that there is no uniform solution to the multiple sequence alignment problem. For this reason, it would make sense to generate libraries containing alternative alignments made by all the available methods (PRRP, ClustalW, HMM, etc.). COFFEE could then be used to merge this information and hopefully extract the best of each alignment. This will require some improvement of the COFFEE function and its adaptation to highly redundant library. Another crucial aspect will be increasing the efficiency of the algorithm. At present, SAGA-COFFEE is an extremely slow method; however, we hope to improve on this by using a more appropriate type of seeding. Finally, another important aspect of our approach will involve the refinement of the method used here for building multiple structural alignments. The project will be based on a procedure similar to the one described above: the design of more efficient weights and an attempt to use the alternative

420

COFFEE: an objective function for multiple sequence alignments

structural alignments that can be produced by the DALI method, using a wider range of DALI test cases.

Acknowledgements
The authors wish to thank Miguel Andrade and Thure Etzold for very useful comments and corrections. They also wish to thank the referees for their useful remarks and interesting suggestions.

References
Altschul,S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol., 138, 297–309. Altschul,S.F. and Erickson,B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603–616. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St Louis, MO. Carrillo,H. and Lipman,D.J. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48, 1073–1082. Davis,L. (1991) The Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. Dayhoff,M.O. (1978) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC. Eddy,S.R. (1995) Multiple alignment using hidden Markov models. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Feng,D.-F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Godzik,A. and Skolnik,J. (1994) Flexible algorithm for direct multiple alignment of protein structures and sequences. Comput. Applic. Biosci., 10, 587–596. Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, New York. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gribskov,M., McLachlan,M. and Eisenberg,D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. Henikoff,S., Henikoff,J., Alford,W. and Pietrokovsky,S. (1995) Automated construction and graphical representation of blocks from unaligned sequences. Gene, 163, GC17–26.

Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. and Sander,C. (1996a) Alignment of three-dimensional protein structures: network server for database searching. Methods Enzymol., 266, 653–662. Holm,L. and Sander,C. (1996b) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res., 24, 206–210. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Applic. Biosci., 12, 95–107. Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M. (1993a) Multiple sequence alignment by parallel simulated annealing. Comput. Applic. Biosci., 9, 267–273. Ishikawa,M., Toya,T. and Tokoti,Y. (1993b) Parallel iterative aligner with genetic algorithm. In Artificial Intelligence and Genome Workshop, 13th International Conference on Artificial Intelligence, Chambery, France. Kececioglu,J.D. (1993) The maximum weight trace problem in multiple sequence alignmnet. Lecture Notes Comput. Sci., 684, 106–119. Kim,J., Pramanik,S. and Chung,M.J. (1994) Multiple sequence alignment using simulated annealing. Comput. Applic. Biosci., 10, 419–426. Krogh,A. and Mitchison,G. (1995) Maximum entropy weighting of aligned sequences of proteins or DNA. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. Mevissen,H.T. and Vingron,M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, 127–132. Morgenstern,B., Dress,A. and Wener,T. (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl Acad. Sci. USA, 93, 12098–12103. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Rost,B. (1997) AQUA Server. http://www.ebi.ac.uk/∼rost/Aqua/ aqua.html Sibbald,P.R. and Argos,P. (1990) Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J. Mol. Biol., 216, 813–818.

421

C.Notredame, L.Holm and D.G.Higgins

Sjolander,K., Karplus,K., Brown,M., Huguey,R., Krogh,A., Saira,M. and Haussler,D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Applic. Biosci., 12, 327–345. Taylor,W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161–169. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gab excision. Comput. Applic. Biolsci., 10, 19–29.

Vingron,M. and Argos,P. (1990) Determination of reliable regions in protein sequence alignment. Protein Eng., 3, 565–569. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. Vingron,M. and Sibbald,P. (1993) Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc. Natl Acad. Sci., 90, 8777–8781. Vogt,G., Etzold,T. and Argos,P. (1995) An assessement of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol., 299, 816–831. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol., 1, 337–348.

422

8VLQJ PXOWLSOH DOLJQPHQW PHWKRGV WR DVVHVV WKH TXDOLW\ RI

JHQRPLF GDWD DQDO\VLV

Cédric Notredame and Chantal Abergel

Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille
Email: cedric.notredame@igs.cnrs-mrs.fr, chantal.abergel@igs.cnrs-mrs.fr

1

$%675$&7

The analysis of multiple sequence alignments can generate essential clues in genomic data analysis. Yet, to be informative such analyses require some mean of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence alignment algorithm. We provide two examples of applications: one where the CORE index is used to identify correct blocks within a difficult multiple alignment and another where the CORE index is used on genomic data to identify the proper start codon and a frameshift within one of the sequence.

2



,QWURGXFWLRQ

Biological analysis largely relies on the assembly of elaborate models meant to summarize our knowledge of life complex mechanisms. For that purpose, vast amounts of data are collected, analyzed, validated and then integrated within a model. In an ideal world, an existing model would be available to explain every bit of experimental data. In the real world, this is rarely the case, and every day, existing models need to be modified to accommodate new findings. Sometimes, data that cannot be explained is kept at bay until the accumulation of new evidences prompts the design of an entirely new model. Unaccountable data can be viewed as the stuff inflating an inconsistency bubble. Eventually, the bubble bursts and a new model is designed.

A multiple alignment is nothing less than such a model. Given a series of sequences and an alignment criteria (structure similarity, common phylogenetic origin) the multiple alignment contains a series of hypothesis regarding the relationship between the sequences it is made of. This alignment can accommodate data generated experimentally (e.g. alignment of two homologous catalytic residues) or combine the results of various sequence analysis methods. The importance of the use of multiple sequence alignments in the context of sequence analysis has been recognized for a long time and it is so well established that most bioinformatics protocols make use of it. Multiple alignments have been turned into profiles (Gribskov et al., 1987) and hidden Markov models (Krogh et al., 1994) to enhance the sensitivity and the specificity of database searches (Altschul et al., 1997). State of the art methods for protein structure prediction depend on the proper assembly of a multiple

3

sequence alignment (Jones, 1999) as do phylogenetic analysis (Duret et al., 1994). Over the last years multiple sequence alignment techniques have been instrumental to improvements made in almost every key area of sequence analysis. Yet, despite its importance, the accurate assembly of a multiple sequence alignment is a complex process, the biological knowledge and the computational abilities it requires are far beyond our current capacities. As a consequence, biologists are left to use approximate programs that attempt to assemble proper alignments without providing any guaranty they may do so. The lack of a ‘perfect’ or at least reasonably robust method explains why so many multiple sequence alignment packages exist. The variations among these packages are not only cosmetic; they include the use of very different algorithms, different parameters and generally speaking different paradigms. For a recent review of state-of-the-art techniques, see (Duret and Abdeddaim, 2000).

Database searches, structure predictions, phylogenetic analysis are enough on their own to make multiple alignment compulsory in a genome analysis task. Yet, thanks to the sanity checks they provide, multiple alignments can also be instrumental at tackling the plague of genomic analysis: faulty data. When dealing with genomes, faulty data arises from two major sources: sequencing errors and wrong predictions. The consequence is that a predicted protein sequence may have accumulated errors both at the DNA level and when its frame was predicted (this will be especially true in eukaryotic genes where exons may be missed, added or improperly predicted). In the worst cases, the effect of such errors will be amplified in the high level analysis, leading to an improper analysis of the available data. On the other hand, once they have been identified, these errors are usually easily corrected either by extra sequencing or data extrapolation. Therefore, any method providing a reasonable sanity-check that earmarks areas of a genome likely to be problematic would be a major improvement. In this chapter we will show how multiple sequence alignments can be used to carry out part of

4

this task. For that purpose we will focus on the applications of T-Coffee, a recently described method (Notredame et al., 2000).



*HQHUDWLQJ 0XOWLSOH $OLJQPHQWV :LWK 7&RIIHH

Despite the large variety of multiple sequence alignment methods publicly available, the number of packages effectively used for data analysis is surprisingly small and a vast majority of the alignments found in the literature are produced using only two programs: ClustalW (Thompson et al., 1994) and its X-Window implementation ClustalX. ClustalW uses the progressive alignment strategy described by Taylor (Taylor, 1988) and Doolitle (Feng and Doolittle, 1987), refined in order to incorporate sequence weights and a local gap penalty scheme. Recently, the ClustalW algorithm was further modified in order to improve the accuracy of the produced alignments by making the evaluation of the substitution costs position dependant. This improved algorithm is implemented in the T-Coffee package (Notredame et al., 2000).

The aim of T-Coffee is to build a multiple alignment that has a high level of consistency with a library of pre-computed pair-wise alignments. This library may contain as many alignments as one wishes and it may also be redundant and inconsistent with itself. For instance it may contain several alternative alignments of the same sequences aligned using various gap penalties. It may also contain alternative alignments obtained by applying different methods onto the sequences. Overall, the library is a collection of alignments believed to be correct. Within this library, each alignment receives a weight that is an estimation of its biological likeliness (i.e. how much trust does one have in this alignment to be correct). For that

5

purpose, one may use any suitable criteria such as percent identity, P-Value estimation or any other appropriate method. The T-Coffee algorithm uses this library in order to compute the score for aligning two residues with one another in the multiple alignment. This score is named the extended weight because it requires an extension of the library. The extended weight takes into account the compatibility of the alignment of two residues with the rest of the alignments observed within the library, its derivation is extensively described in (Notredame et al., 2000). The principle is straightforward: in order to compute the extended weight associated with two residues R and S of two different sequences, one will consider whether when R is found aligned in the library with some residue X of a third sequence, S is also found aligned with that same residue X in another entry of the library. If that is the case, then the weight associated with R and S will be increased by the minimum of the two weights RX and SX. The final extended weight will be obtained when every possible X has been considered and the resulting contributions summed up. Although this operation seems to be very expensive from a computational point of view, its effective computational cost is kept low thanks to the scarceness of the primary library (i.e. for most pairs of residues RS, very few Xs need to be considered). In the end, a pair of residues is highly consistent (and has a high extended weight) if most of the other sequences contain at least one residue that is found aligned both to R and to S in two different pair-wise alignments. A key property of this weight extension procedure is to concentrate information: the extended score of RS incorporates some information coming from all the sequences in the set and not only from the two sequences contributing R and S.

The main advantage of the extended weights is that they can be used in place of a substitution matrix. While standard substitution matrices do not discriminate between two identical residues (e.g. all the cysteins are the same for a Pam (Dayhoff et al., 1979) or a Blosum

6

(Henikoff and Henikoff, 1992)), the extended weights are truly position specific and make it possible to discriminate between two identical residues that only differ by their positions. Once the library has been assembled (potential ways of assembling that library are described later) and the extended weights computed, T-Coffee closely follows the ClustalW procedure using the extended weights instead of a substitution matrix. The overall T-Coffee strategy is outlined in Figure 1. All the sequences are first aligned two by two, using dynamic programming (Needleman and Wunsch, 1970) and the extended library in place of a substitution matrix. The distance matrix thus obtained is then used to compute a neighborjoining tree (Saitou and Nei, 1987). This tree guides the progressive assembly of a multiple sequence alignment: the two closest sequences are first aligned by normal dynamic programming using the extended weights to align the residues in the two sequences, no gap penalty is applied (because it has already been applied to generate the alignments contained in the library). This pair of sequences is then fixed and any gaps that have been introduced cannot be shifted later. Then the program aligns the next closest two sequences or adds a sequence to the existing alignment of the first two sequences, depending which is suggested by the guide tree. The procedure always joins the next two closest sequences or pre-aligned group of sequences. This continues until all the sequences have been aligned. To align two groups of pre-aligned sequences one uses the extended weights, as before, but taking the average library scores in each column of the existing alignments.

The key feature of T-Coffee is the freedom given to the user to build his own library following whatever protocol may seem appropriate. For this purpose, one may mix structural information with database results, knowledge-based information or pre-established collections of multiple alignments. It may also be necessary to explore a wide range of parameters given some computer package. A simple library format was designed to fit that purpose, it is shown

7

on Figure 2. A library is a straightforward ASCII file that contains a listing of every pair of aligned residue that needs to be described. Any knowledge-based information can easily be added manually to an automatically generated library or the other way round. This figure also shows clearly that the library can contain ambiguities and inconsistencies (i.e. two alignments possible for the first residue of Seq1 with Seq2). These ambiguities will be resolved while the alignment is being assembled, on the basis of the score given by the extended weights. The library does not need to contain a weight associated with each possible pair of residues. On the contrary, an ideal library only contains pairs that will effectively occur in the correct multiple alignment (i.e. N2L pairs rather than N2L2 pairs). While this flexibility to design and assemble one’s own library is a very desirable property, in practice it is also convenient to have a standard automatic protocol available. Such a protocol exists and is fully integrated within the T-Coffee package. It is ran with the default mode and does not require the user to be aware of T-Coffee underlying concepts (Library, extension, progressive alignment). This default protocol extensively described and validated in (Notredame et al., 2000) requires two distinct libraries to be compiled and combined within the primary library before the extension. The first one contains a ClustalW pair-wise alignment of each possible pair of sequence within the dataset. For that purpose, ClustalW (Thompson et al., 1994) is run using default parameters. This library is global because it is generated by aligning the sequences over their whole length (global alignments) using a linear space version of the Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). The second library is local: for each possible pair of sequences, it contains the ten best non-overlapping local alignments as reported by the Lalign program (Huang and Miller, 1991) run with default parameters. In the local and the global libraries, each pair of residues found aligned is associated with a weight equal to the average level of identity within the alignment it came from. When a specific pair is found more than once, the weights associated with each occurrence are added. The main strength of

8

this protocol is to combine local and global information within a multiple alignment. The level of consistency within the library will depend on the nature of the sequences. For instance, if the sequences are very diverse, the requirement for long insertions/deletions will often cause the global alignments to be incorrect and inconsistent, while the local alignments will be less sensitive to that type of problems. In such a situation, the measure of consistence will enhance the local alignment signal and let it drive the multiple alignment assembly. Inversely, if the global alignments are good enough they will help removing the noise associated with the collection of local alignments (local alignments do not have any positional constraints). Overall, the current default T-Coffee protocol contains three distinct elements that lead to the collection of extended weights: the global library, the local library and the library extension that turns the sum of the two libraries into an extended library. Earlier work demonstrated that each of these components plays a significant part in improving the overall accuracy of the program. Table 1 shows that the current version of T-Coffee (Version 1.29) outperforms other popular multiple sequence alignment methods, as judged by comparison on BaliBase (Thompson et al., 1999), a database of hand made reference structural alignments that are based on structural comparison (See Table 1 legend for a description of BaliBase and the comparison protocol).

These results illustrate well the good performances of T-Coffee on the wide range of situations that occur in BaliBase. It is especially interesting to point out that T-Coffee is the only method equally well suited to situations that require a global alignment strategy (categories 1, 2 and 3) and situations that are better served with a local alignment strategy (categories 4 and 5 with long internal and terminal insertions/deletions). The other methods are either good for global alignments (like ClustalW) or for local alignments (like Dialign2 (Morgenstern et al., 1998)). It should be noted that T-Coffee still uses ClustalW 1.69 to

9

construct the primary global library, because this was the last ‘naïve’ version of ClustalW, not tuned up on BaliBase. The latest version (1.81) has been tuned on the BaliBase references (hence its improved performances over the results originally reported for ClustalW). Using this ClustalW 1.81 version when benchmarking T-Coffee would make the process circular. Nonetheless, as good as it may seem, T-Coffee still suffers from the same shortcoming as any other package available today:
LW LV QRW DOZD\V WKH EHVW PHWKRG

. Even if on average it does

better than any of its counterparts, one cannot guaranty that T-Coffee will always generate the best alignment. For instance, although Dialign2 is significantly less good, it T-Coffee outperforms on 17 test sets (11%). ClustalW is better than T-Coffee in 24% of the cases. We may conclude from this that in practice, there will always be situations where some alternative method beats T-Coffee. Furthermore, even in cases where the T-Coffee improvement over any alternative method is very significant, it may lead to an alignment much less than 100% correct. This may not be so helpful since for practical usage, it would be much more helpful to know where the correctly aligned portions lie. This is so true that a method 20% correct and a proper estimation of its reliability would be much more useful than a method more accurate ‘on average’.

Several situations exist in which a biologist can make use of this reliability information. For instance, if the purpose of the alignment is to extrapolate some experimental data onto an otherwise un-characterized genomic sequence, one will need to be very careful not to deduce anything from an unreliable portion of the alignment. More generally, unreliable positions within a multiple sequence alignment should not be used for predictive purpose. For instance, when turning a multiple alignment into a profile in order to scan databases for remote homologues, it is essential to exclude regions whose alignment cannot be trusted and that may obscure some otherwise highly conserved position. Used in this fashion, reliability

10

information allows a significant decrease of the noise induced by locally spurious alignments. The other important application of a reliability measure is the identification of regions within a multiple alignment that are properly aligned without being highly conserved. These regions are extremely important when the alignment is used in conjunction with a predictive method that bases its analysis on mutation patterns. For instance, structure and phylogeny prediction methods require the presence of non-conserved positions to yield informative results. Any scheme that allows discriminating between positions that are degenerated but correctly aligned and positions that are simply misaligned may induce a dramatic improvement in the accuracy of these prediction methods. Furthermore a reliability measure will help identifying faulty data and provide some clues on how to correct it. In the next section, we show how consistency can be measured on a T-Coffee alignment and how that measure provides a fairly accurate reliability estimator.



0HDVXULQJ

7KH

&RQVLVWHQF\

2Q

$

0XOWLSOH

6HTXHQFH

$OLJQPHQW

T-Coffee is a heuristic algorithm that attempts to optimize the consistency between a multiple alignment and a list of pre-computed pair-wise alignments known as a library (Figure 2). By consistency we mean that a pair of residues described aligned in the library will also be found aligned in the multiple alignment. While the theoretical maximum for the consistency is 100%, the score of an optimal alignment will only be equal to the level of self-consistency within the library. Figure 2 shows the example of a library that is not self consistent because it

11

is ambiguous regarding the alignment of some of the residues it contains. Of course, the more ambiguous the library, the less consistency it will yield. For instance, given two residues and
U T

taken from two different sequences
  ¡   ¢

6

and

6

, one can easily measure the consistency

(CS( 5 1 , 5 2 ) ) between the alignment of these two residues and all the other alignments contained in the library by comparing ES( 5 1 , 5 2 ), the extended score of the pair
¡ ¢    
T

and U,

with the sum of the extended scores of all the other potential pairs that involve 6 and 6 and either U or T.

If we want to use it as a quality factor, this measure suffers from two major drawbacks. Firstly it is expensive to compute: given a multiple alignment of N sequences and of length L, each pair of residues found in the multiple alignment needs O(L) operations of extension that require a minimum of O(N) operations each. “O(L)” is a standard notation called
QRWDWLRQ ELJ2

, meaning that the computation time is proportional to L, up to a constant factor.

Since there are L*N2 pairs of residues in a multiple alignment, this leads to O(L2N3) operations for an estimate of the CS of every pair. This cubic complexity becomes problematic with large numbers of sequences. The second limitation of this measure is that with sequences rich in repeats, the summation factor can become artificially high and cause a dramatic decrease of the consistency score. In practice, we found it much more effective to use the extended score of the best scoring pair contained in the alignment as a normalization factor. This defines the aCS (approximate Consistency measure).

12

©









aCS 5 1 , 5


2

5

1

,5

2

0D[ (6 5

,5

¨ §

 §

§

§

§

§

(

) = ES(

)

{ (

)} (2)

¤

= 1

= 2

£

£

¤



¤

¥

¦

¤

¦

¥

¦

CS 5 1 , 5
¥

2

5

1

,5

2

5

1

,5

2

5

1

,5

£

£

£

£

£

£

£

£

(

) = ES(

)/  ∑ ES( 

) + ∑ ES(

2

)  (1) 


Our measurements on the BaliBase dataset indicate that the CS and the aCS are well correlated.

An important criteria, when using the aCS as a reliability measure, is its ability to discriminate between correct and incorrect alignments within the so-called twilight zone (Sander and Schneider, 1991). Given two sequences, the twilight zone is a range of percent identity (between 0 and 30%) that has been shown to be non-informative regarding the relationship that exist among two sequences. Two sequences whose alignment yields less than 30% identity can either be unrelated or related and incorrectly aligned or related and perfectly aligned. To check how good the aCS is when used as an accuracy measure, every 142 BaliBase dataset was aligned using T-Coffee 1.29 and the similarity of each pair of sequence was measured within the obtained alignments. Pairs of sequences with less than 30% identity (5088) were extracted and the accuracy of their alignment was assessed by comparison with their counterparts in the reference BaliBase alignment, the aCS score was also assessed on each pair of aligned residues and averaged along the sequences. Figure 3a shows the scattered graph Identity Vs Accuracy (See Figure legend for the definitions). Despite a weak correlation between these two measurements, the percent identity is a poor predictor of the alignment accuracy. For 75% of the sequence pairs (identity lower than 25%) the accuracy indication given by the percent identity falls in a 40% range (i.e. the average identity indicates the average accuracy +/- 20%). On the other hand, when the accuracy is plotted against the aCS score (Figure 3b) the correlation is improved and for pairs of sequences having an aCS higher than 20 (this is true for 60% of the 5088 pairs) this measure is a much better alignment accuracy predictor than the percent identity. While they do not solve the twilight zone





With

5

, 5 any two residues found aligned in the multiple alignment.

 

 

13

problem, these results indicate that the aCS measure provides us with a powerful mean of assessing an alignment reliability within the twilight zone. Nonetheless, from a practical point of view, the aCS measure is not so useful since one is often more concerned by the overall quality (i.e. is residue r of sequence S correctly aligned to the rest of the sequences?) than by pair-wise relationships. In order to answer this type of questions, the aCS measure was used to derive three very useful non pair-wise indexes.

7KH &RQVLVWHQF\ RI WKH 2YHUDOO 5HVLGXH (YDOXDWLRQ

(CORE) is obtained by averaging the

scores of each of the aligned pairs involving a residue within a column.

Where T and U are two residues found aligned in the same column.

The CORE index and equivalent approaches have been shown in the literature to be good indicators of the local quality of a multiple sequence alignment (Heringa, 1999; Notredame et al., 1998), as judged by comparison with reference biological alignments. In the T-Coffee package, an option makes it possible to output multiple alignments with the CORE index (a rounded value between 0 and 9) replacing each residue. It is also possible to produce a colorized version (pdf, postscript or html) of that same multiple alignment where residues receive a background coloration proportional to their CORE index (blue/green for low scoring residues and orange/red for the high scoring ones). Such an output is shown on Figure 5 and 6.

!

=1, ≠
"

$

%

"

%

CORE

5

D&6 5

,5

" #

! #

( )= ∑
! #

(

)/ (

1

− 1) (3)

14

The CORE index described in equation (3) is merely an average aCS measure, and whether that measure provides some indication on the multiple alignment quality is a key question. We tested that hypothesis on the complete BaliBase dataset. Given each T-Coffee alignment, residues were divided in 4 categories: (i)
WUXH SRVLWLYHV

(TP) are correctly aligned residues

rightfully predicted to be so, (ii) WUXH QHJDWLYH (TN) are incorrectly aligned residues rightfully predicted to be so, (iii) IDOVH SRVLWLYH (FP) are residues predicted to be well aligned when they are not, (iv)
IDOVH QHJDWLYH

(FN) are residues wrongly predicted to be misaligned. Following

previously described definitions (Notredame et al., 1998), a residue is said to be correctly aligned if at least 50% of the residues to which it was aligned in the reference alignment are found in the same column in the T-Coffee alignment. Each of the 10 CORE indexes (0 to 9) was used in turn as threshold to discriminate correctly and non-correctly aligned residues on the T-Coffee alignments. The BaliBase reference alignments were then used to evaluate the TP, TN, FP and FN. Sensitivity and the specificity were then computed according to Sneath and Sokal (Sneath and Sokal, 1973) and plotted on a graph (Figure 4). Our results indicate that the best trade off between sensitivity and specificity is obtained when CORE=3 is used as a threshold (i.e. every residue with a score higher or equal to 3 is considered to be properly aligned). In that case the specificity is 84% and the sensitivity is 82%. These high figures partly reflect the overall quality of the T-Coffee alignments in which 80.5% of the residues are correctly aligned according to the criteria used here. It is therefore more interesting to note that when the CORE index reaches 7, the specificity is 97.7% and the sensitivity is close to 50%. This means that thanks to the CORE index, half of the residues properly aligned in a multiple alignment can unambiguously be identified (e.g. more than 40 % of all the residues contained in BaliBase). In the next section we will see that this proper identification sometimes occurs in cases that are far from being trivial, even for an expert eye. Similar results were observed when applying the CORE index on multiple alignments obtained using

15

another method (i.e. ClustalW alignments evaluated with a standard T-Coffee library). This suggests that the CORE measure may be used to evaluate the local quality of a multiple alignment produced by any source. However, one should be well aware that the relevance of the CORE measure regarding the reliability of an alignment is entirely dependant on the way in which the library was derived. All the conclusions drawn here only apply to libraries derived using the standard T-Coffee protocol.

7KH VHTXHQFH &25( V&25(

is obtained by averaging the CORE scores over all the residues

contained within one sequence.

=1

That measure can be helpful for identifying among the sequences an outlier, a sequence that should not be part of the set either because it is not homologous or because it is too distantly related to the other members to yield an informative alignment.

7KH DOLJQPHQW &25(

(alCORE) may be obtained by averaging the sCOREs over all the

sequences. Our analysis suggest that this index may be a reasonable indicator of the alignment overall accuracy. Yet, to be fully informative, it requires the sequence set to be homogenous (i.e. the standard deviation of the sCOREs should be as low as possible).

'

52& 5

( )

sCORE(6[ ) = ∑
'

(

( ))/

&

/

(4)

16



8VLQJ

WKH

&25(

0HDVXUH

7R

$VVHVV

/RFDO

$OLJQPHQW

4XDOLW\

The driving force behind the development of the CORE index is the identification of correctly aligned blocks of residues within a multiple sequence alignment. It is common practice to identify these blocks by scanning the multiple alignment and marking highly conserved regions as potentially meaningful. ClustalW and ClustalX provide a measure of conservation that may help the user when carrying out this task. Unfortunately, situations exist where it is difficult to make a decision regarding the correct alignment of some residues within an alignment. Such an example is provided in Figure 5 with the BaliBase alignment known as 1pamA_ref1, made of 6 alpha-amylases.

This set is difficult to align because it contains highly divergent sequences. Not only have these sequences accumulated mutations while they were diverging but they have also undergone many insertion/deletions that make it difficult to reconstruct their relationships with accuracy. The average level of identity measured on the BaliBase reference is 18%, the two closest sequences being less than 20% identical. As such, 1pamA_ref1 constitutes a classic example of a test set deceptive for most multiple sequence alignment methods. The fact that less than one third of the 1pam_ref1 reference alignment is annotated as trustable in BaliBase confirms that suspicion. When ran on this test-set, existing alignment programs generate different results, Prrp finds 37% of the columns annotated as trustable in BaliBase, ClustalW (1.81) 40%, T-Coffee 54% and Dialign2 56%. Regardless of the methods used, such an alignment is completely useless unless correctly aligned portions can be identified. It is exactly the information that the CORE index provides us with. An alignment colorized according to its CORE indexes is shown on Figure 5. 17

The results are in good agreement with those reported in Figure 4. Out of the 905 correctly aligned residues (42% of the total), 267 have a score higher than 7. No incorrectly aligned residue has a score higher or equal to 7. Using 7 as a prediction threshold gives a sensitivity of 29% and a specificity of 100%. Residues with a CORE index of 3 or higher (yellow pale) yield a sensitivity of 65% and a specificity of 79%. In this alignment, the main features are the red/dark-orange blocks: they are 100% correct. These blocks could be fed as they are to any suitable method (structure prediction, phylogeny….). They are not very well conserved at the sequence level and are therefore very informative for structural and phylogenetic analysis. For instance, the block II in Figure 5 is perfectly aligned although within that block, the average pair-wise identity is lower than 30% (41 % for the two most closely related sequences). The measure of consistency can also help questioning positions that may seem unambiguous from a sequence point of view. In the column annotated as I, the position marked with a “*” could easily be mistaken to be correct: it is within a block, aromatic positions are usually fairly well conserved and owing to their relative rarity, unlikely to occur by chance. Yet the green color code indicates that this position may be incorrectly aligned (the green tyrosine has a CORE index of 1). This is confirmed by comparison with the reference that shows the correct alignment to incorporate another tyrosine at this position.

When analyzing these patterns, one should always keep in mind that the consistency information only has a positive value. In other words, inconsistent regions are those where the library does not support the alignment. This does not mean they are incorrectly aligned but rather that no information is at hand to support or disprove the observed alignment.

18



,GHQWLI\LQJ )DXOW\ *HQH 3UHGLFWLRQV

Another possible application of the T-coffee CORE index is to reveal and help resolving sequence ambiguities in predicted genes. In the structural genomic era, many projects involve hypothetical proteins, for which an accurate prediction of the start and stop codon is needed to properly express the gene product. Since over-predicted N or C-terminal are rarely conserved at the amino acid level, sequence comparison provides us with a very powerful mean of identifying this type of problems. A simple procedure consists of multiply aligning the most conserved members of a protein family before measuring the T-Coffee CORE index on the resulting alignment. Inspection of the CORE patterns offers a diagnostic regarding the correctness of the data. This approach can also be applied to frame-shift detection where the identification of abnormally low scoring segments may lead to their correction. Such an alignment will make it possible to decide if the abnormal length of a coding region could be due to a sequencing error (and the resulting frame-shift). At least the CORE measure will indicate that a thorough examination is needed. Of course, one could also detect these frameshifts using standard pair-wise comparison methods such as Gene-wise (Birney and Durbin, 2000), but the advantage of using a multiple sequence alignment is that the simultaneous comparison of several sequences can strengthen the evidence that the frame-shift is real. Furthermore, thanks to the multiple alignment, one may be able to detect mistakes in sequences that lack a very close homologue.

To illustrate this potential usage of T-Coffee, we chose the example of an (VFKHULFKL FROL . gene (Accession # U00096) predicted to encode a protein of unknown function, yifB. Orthologous genes were found in complete genomes using BLAST (Altschul et al., 1997) and the four most conserved sequences (identity >70% relative to the
(VFKHULFKLD FROL .

gene,

19

see figure for ID numbers) were retrieved along with their flanking regions (80 nucleotides on the N-terminus side) in order to check whether these supposedly non coding regions did not contain any coding information. The ‘elongated’ sequences were translated in the same frame as their core coding region, their multiple alignment was carried out using T-Coffee and the CORE indexes were measured. The resulting alignment is displayed on Figure 6 with the CORE indexes color-coded (low CORE in blue and green, high CORE in orange and red). The main feature on the N-terminus is an abrupt transition (II) from low to high CORE indexes. This position is also a conserved methionine. The combination of these two observations suggests that the starting point of these five sequences is probably where the transition occurs, ruling out other methionines as potential starting points in the first sequence (I). Another discrepancy occurs in this alignment that is also emphasized by the CORE analysis: the sequence yifB_SALTY_1 yields a very low N-terminal CORE index, relatively to the other family members. The CORE score of this sequence becomes abruptly in phase with the other sequences at the position marked III. This pattern is a clear indication of a frame-shift: a protein highly similar to the other members of its family but locally unrelated. To verify that hypothesis, we used some data provided by SwissProt (Bairoch and Boeckmann, 1992) and found that in the corresponding entry, the nucleotide sequence has been corrected to remove the frame-shift we observed (entry P57015). The corrected sequence has been added to the bottom of the alignment on Figure 6 (non-colored sequence). The position where yifB_SALTY_1 and its corrected version start agreeing is also the position where the CORE score changes abruptly from a value of 2 (yellow) to a value of 7 (orange). That position also turns out to be the one where the frame-shift occurs in the genomic sequence.

20



&RQFOXVLRQ

In this chapter, we introduced an extension of the T-Coffee multiple sequence alignment method: the CORE index. The CORE index is a mean of assessing the local reliability of a multiple sequence alignment. Using the CORE index, correct blocks within a multiple sequence alignment can be identified. This measure also makes it possible to detect potential errors in genomic data, and to correct them. The CORE index is a relatively add hoc measure and even if it may prove extremely useful from a practical point of view, it still needs to be attached to a more theoretical framework. One would really need to be able to turn the consistency estimation into some sort of P-Value. For instance, to assess efficiently the local value of an alignment, one would like to ask questions of the following kind: what is the probability that library X was generated using dataset Y? What is the probability that alignment A yields p% consistency with library X? Altogether these questions may open more venues to the automatic processing of multiple alignments. That issue may prove crucial for the maintenance of resources that rely on a large scale usage of multiple sequence alignments such as Hobacgene (Perriere et al., 2000), Hovergene (Duret et al., 1994)or Prodom (Corpet et al., 2000).

21

)LJXUH /HJHQGV

Figure 1
/D\RXW RI WKH 7&RIIHH DOJRULWKP

22

This figure indicates the chain of events that lead from unaligned sequences to a multiple sequence alignment using the T-Coffee algorithm. Data processing steps are boxed while data structures are indicated by rounded boxes.

23

Figure 2
/LEUDU\ )RUPDW

An example of a library used by T-Coffee. The header contains the sequences and their names. ‘# 1 2’ indicates that the following pairs of residues will come from sequences 1 and 2. Each pair of aligned residues contains three values: the index of residue 1, the index of residue 2 and the weight associated with the alignment of these two residues. No order or consistency is expected within the library.

24

Figure 3 a) 3HUFHQWDJH LGHQWLW\ 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH: the 5088 pairs of sequences that have less than 30% identity in the BaliBase reference alignments were extracted. The accuracy of

25

their alignment was measured by comparison with the reference, and the resulting graph was plotted. b)
$SSUR[LPDWH &RQVLVWHQF\ 6FRUH D&6 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH

: the aCS was

measured on the 5088 pairs of sequences previously considered and was plotted against the average accuracy previously reported. The vertical line indicates aCS=25 and separates the pairs for which the aCS is informative from those whose aCS seems to be non-informative.

26

27

Figure 4
6SHFLILFLW\ DQG 6HQVLWLYLW\ RI WKH &25( PHDVXUH

28

The sensitivity and the specificity of the CORE index used as an alignment quality predictor were evaluated on the BaliBase test-sets. Measures were carried out on the entire BaliBase dataset. The sensitivity (z) and the specificity () were measured on the T-Coffee alignments after considering that every residue with a CORE index higher than x was properly aligned (see text for details).

29

Figure 5
,GHQWLI\LQJ FRUUHFW EORFNV ZLWK WKH &25( PHDVXUH

30

An example of the T-Coffee output on a BaliBase test set (1pamA_ref1) that contains five alpha amylases. This alignment was produced using T-Coffee 1.29 with default parameters and requesting the score_pdf output option. The color scale goes from blue (CORE=0, bad) to red (CORE=9, good). The residues in capital are correctly aligned (as judged by comparison with the BaliBase reference). Those in lower case are improperly aligned. Box I indicates a conserved position that is not properly aligned, box II indicates a block of distantly related segments that is correctly aligned by T-Coffee.

31

Figure 6
,GHQWLI\LQJ IUDPH VKLIWV DQG VWDUW FRGRQV

32

The chosen sequences came are YifB_ECOLIA YifB_SALTY_1 (6DOPRQHOOD (+DHPRSKLOXV
LQIOXHQ]DH WLSK\

(VFKHULFKLD FROL

, accession # AE005174),

, C18 chromosome, Sanger Center), YifB_HAIN
PXOWRFLGD

Accession # L42023), YifB_PASMU (3DVWHXUHOOD
DHUXJLQRVD

,

Accession # AE004439) and YifB_PSEAE (3VHXGRPRQDV

, Accession #

AE004091), they were aligned using the standard T-Coffee alignment procedure and requesting the score_pdf output option. The corrected sequence of
6DOPRQHOOD WLSK\

YifB

protein sequence was later added for further reference (YifB_SALTY, SP: P57015) but it was not used for coloring the residues (Non colored sequence) or improving the multiple alignment. The figure only shows the N-terminal portion of the alignment, and the arrow indicates the positions annotated as starting codons in SwissProt (except for salmonella tiphy). Box I indicates a putative starting codon in YifB_ECOLIA, Box II indicates the true starting codon in most sequences, and Box III indicates the position where the frame-shift occurs in YifB_SALTY_1.

33

7DEOH 

cat 1

cat 2

cat 3

cat 4

cat 5

avg 1

avg 2

------------------------------------------------------------------------cw prrp dialign2 T-Coffee 79.53 78.62 70.99 32.91 32.45 25.21 48.72 50.14 35.12 74.02 51.12 74.66 67.84 82.72 80.38 67.89 66.45 61.54 61.82 60.25 57.99

To produce this table each dataset contained in BaliBase was aligned using one of the methods (cw: ClustalW 1.81 (Thompson et al., 1994), Prrp (Gotoh, 1996), dialign2 (Morgenstern et al., 1998) and T-Coffee 1.29 (Notredame et al., 2000). In BaliBase, reference alignments are classified in 5 categories: category 1 contains closely related sequences, category 2 contains a group of closely related sequences and an outsider category 3 contains two groups of sequences that are distantly related, category 4 contains families with long internal indels, Category 5 contains sequences with long terminal indels. The resulting

alignments were then compared to their reference counterpart in BaliBase, only using the regions annotated as trustable in BaliBase. Under this scheme we define the accuracy of an alignment to be the percentage of columns that are found totally conserved in the reference divided by the total number of columns within that reference. The comparison is restricted to the portions annotated as trustworthy in the reference alignment. results obtained on each of the 142 test cases,
DYJ  DYJ 

is the average of the results obtained in

each category. Bold numbers indicate the best performing method.

34

E E 3 S 5 QVB2TQDU87 0 C 3 H

Q64QTQDGQFPIG2F8D3BA9@864120 S 5 3 C S C R 3 9 0 1 H 3 9 E E C 5 7 5 3

is the average of the

%LEOLRJUDSK\

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. nucleic acids res. 25: 2289-3402. Bairoch, A. and Boeckmann, B., 1992. The SWISS-PROT protein sequence data bank. Nucleic Acids Res: 2019-2022. Birney, E. and Durbin, R., 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res 10: 547-548. Corpet, F., Servant, F., Gouzy, J. and Kahn, D., 2000. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. nucleic acids res 28: 267269. Dayhoff, M.O., Schwarz, R.M. and Orcutt, B.C., 1979. A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: M.O. Dayhoff (Editor), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C., pp. 353-358. Duret, L. and Abdeddaim, S., 2000. Multiple Alignment for Structural, Functional, or phylogenetic analyses of Homologous Sequences. In: D. Higgins and W. Taylor (Editors), Bioinformatics, Sequence, structure and databanks. Oxford University Press, Oxford. Duret, L., Mouchiroud, D. and Gouy, M., 1994. HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22: 2360-2365.

35

Feng, D.-F. and Doolittle, R.F., 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25: 351-360. Gotoh, O., 1996. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264: 823-838. Gribskov, M., McLachlan, M. and Eisenberg, D., 1987. Profile analysis: Detection of distantly related proteins. Proceedings of the National Academy of Sciences 84: 43555358. Henikoff, S. and Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Computers and Chemistry 23: 341364. Huang, X. and Miller, W., 1991. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12: 337-357. Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195-202. Krogh, A., Brown, M., Mian, I.S., Sjölander, K. and Haussler, D., 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235: 1501-1531. Morgenstern, B., Frech, K., Dress, A. and Werner, T., 1998. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Needleman, S.B. and Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453.

36

Notredame, C., Higgins, D.G. and Heringa, J., 2000. T-Coffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302: 205-217. Notredame, C., Holm, L. and Higgins, D.G., 1998. COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422. Perriere, G., Duret, L. and Gouy, M., 2000. HOBACGEN: database system for comparative genomics in bacteria. Genome Research 10: 379-385. Saitou, N. and Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425. Sander, C. and Schneider, R., 1991. Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9: 56-68. Sneath, P.H.A. and Sokal, R.R., 1973. Numerical Taxonomy. Freeman, W.H., San Francisco. Taylor, W.R., 1988. A flexible method to align large numbers of biological sequences. Journal of Molecular Evolution 28: 161-169. Thompson, J., Higgins, D. and Gibson, T., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4690. Thompson, J.D., Plewniak, F. and Poch, O., 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682-2690.

37

for Nucleic Acids Research Database Issue

A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary v.3
Sabine Dietmann1, Jong Park1, Cedric Notredame2, Andreas Heger1, Michael Lappe1 and Liisa Holm1 1 Structural Genomics Group, EMBL-EBI, CB10 1SD Cambridge, UK 2 Structural and Genetic Information, C.N.R.S UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France

Abstract
The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank. The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to (1) supersecondary structural motifs (attractors in fold space), (2) the topology of globular domains (fold types), (3) remote homologues (functional families), and (4) homologues with sequence identity above 25 % (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10531 PDB entries comprising 17101 chains, which were partitioned into 5 attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures severalfold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores, and a comprehensive library of explicit multiple alignments of distantly related protein families.

Introduction
Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB), and a number of derived databases that organize this data into hierarchical classification schemes or in terms of structural neighbourhoods have appeared on the World Wide Web [1-4]. We maintain the Dali Domain Dictionary and FSSP database with continuous weekly updates. Because many structural similarities are between substructures (domains), i.e., parts of structures, protein chains are decomposed into domains using the criteria of recurrence and compactness [5]. Each domain is assigned a Domain Classification number D.C.l.m.n.p representing fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). The discrete classification presents views which are free of redundancy and simplify navigation in protein space. The structural classification is explicitly linked to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships (Figure 1).

A map of fold space
The central concept underlying the classification is a ’map of fold space’. This map is based on exhaustive neighbouring of all protein structures in the PDB. The all-against-all structure comparison is carried out using the Dali program. As a result of the exhaustive comparisons, each structure in the PDB is positioned in an abstract, high-dimensional space according to its structural similarity score to all other structures. The graph of structural similarities (between domains) is partitioned into clusters at four different levels of granularity. Coarse-grained overviews yield few clusters with many members that share broad architectural similarities, while fine-grained clustering yields many clusters within which structural similarities between members can extend to atomic detail due to functional constraints, for example, in binding sites. Continuing the practice from the FSSP database, fold types are defined by agglomerative clustering so that the members of a fold type have average pairwise Z-scores above 2. The threshold has been chosen empirically to group together structures with topological similarity. Dali Domain Dictionary v.3 introduces two new levels to the fold classification, one above and one below the fold type abstraction. The top level of the fold classification corresponds to secondary structure composition and supersecondary structural motifs. We have previously identified five attractor regions in fold space [1]. We partition fold space so that each domain is assigned to one of attractors I-V, which are represented by archetype structures, using a shortest-path criterion. Structures which are disconnected from other structures, are assigned to class X. Domains which are not clearly closer to one attractor than another, are assigned to the mixed class Y. Currently, class Y comprises about one sixth of the representative domain set. In the future, some of these may be assigned to emerging new attractors.

An evolutionary classification
The other new level of the classification infers plausible evolutionary relationships from strong structural similarities which are accompanied by functional or sequence similarities. Conceptually, this functional family level is equivalent to the ’superfamily’ level of scop [2]. The computational discrimination between physically convergent (analogous) and evolutionarily related, divergent (homologous) proteins has received much attention recently [6-8]. Structural similarity alone is insufficient to draw a line between the two classes. For example, lysozymes exhibit extreme structural divergence in regions supporting the active site, while coiled coils and beta-barrels are simple, geometrically constrained topologies which are believed to have emerged several times in protein evolution. To address the evolutionary classification problem, we have chosen to analyse functional and sequence-motif attributes on top of structural similarity in a numerical taxonomy. The more functional features two proteins have in common, the more likely it is that they do so due to a common descent rather than by chance. Currently, our feature set includes common sequence neighbours (overlap of PSI-Blast families), analysis of 3D clusters of identically conserved residues, enzyme classification (E.C. numbers) and keyword analysis of biological function. A neural network assigns weights to these qualitatively different features. The neural network was trained against the superfamily to fold transition in a manual fold classification [2]. To unify families, we exploit the empirical observation that Dali’s intramolecular distance comparison measure gives higher scores to pairs of homologues than to analogues. In practice, we require that functional families are nested within fold families in the fold dendrogram: functional families are branches of the fold dendrogram where all pairs have a high

average neural network prediction for being homologous. The threshold for unification was chosen empirically and is conservative. 504 functional families unify two or more sequence families. Unified families have functional residues or sequence motifs that map to common sites in the 3D context of a fold. The strongest evidence is usually obtained for unifying enzyme catalytic domains. In some cases the expert system fails to capture enough evidence for unification of domains which are believed to be homologous, such as within the varied set of helix-turn-helix motif containing DNA binding domains where several functional families are defined at the same fold type level.

A library of structure-based multiple alignments of remote homologues
The Dali Domain Classification can be browsed interactively at http://www.ebi.ac.uk/dali/domain. The server is implemented on top of a MySQL database. The classification may be entered from the top of the hierarchy, or the user may make a query about a protein identifier or a node in the classification hierarchy. Multiple structural alignments including attributes of the proteins are generated on the fly for any user selection of structural neighbours. Precomputed alignments are available for each functional family. The T-Coffee program [9] is used to generate genuine consensus alignments of multiple structures from the library of pairwise Dali alignments. A reliability score is computed to indicate well defined regions (the structural core) and regions where structural equivalences are ambiguous. Technically, T-Coffee improves alignment quality in a few known cases of functional families where active site residues were inconsistently aligned in some of the pairwise Dali comparisons. Scientifically, the definition of functional families and reliable multiple structure alignments for each opens the door to sensitive sequence database searches using position-specific profiles, and to benchmarking the alignment accuracy of threading predictions.

Acknowledgement
S.D. and J.P. were supported by EU contract BIO4-CT96-0166.

References
1 Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595-603. 2 Hubbard, T.J., Ailey, B., Brenner, S.E., Murzin, A.G. and Chothia, C. (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res., 27, 254-256. 3 Orengo, C.A., Pearl, F.M., Bray, J.E., Todd, A.E., Martin, A.C., Lo Conte, L. and Thornton, J.M. (1999) The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res., 27, 275-279. 4 Marchler-Bauer, A., Addess, K.J., Chappey, C., Geer, L., Madej, T., Matsuo, Y., Wang, Y. and Bryant, S.H. (1999) MMDB: Entrez’s 3D structure database. Nucleic Acids Res., 27, 240-243. 5 Holm, L. and Sander, C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88-96. 6 Russell RB, Saqi MA, Bates PA, Sayle RA, Sternberg MJ. (1998) Recognition of analogous and homologous protein folds--assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng. 11:1-9.

7 Kawabata, T. and Nishikawa, K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins 41, 108-122. 8 Wood, T.C., and Pearson, W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977-995. 9 Notredame C, Higgins DG, Heringa J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol,.302:205-17. 10 Bewley, M.C., Jeffrey, P.D., Patchett, M.L., Kanyo, Z.F. and Baker, E.N. (1999) Crystal structures of Bacillus caldevelox arginase in complex with substrate and inhibitors reveal new insights into activation, inhibition and catalysis in the arginase superfamily. Structure, 7, 435-438. 11 Finnin, M.S., Donigian, J.R.,Cohen, A., Richon, V.M., Rifkind, R.A., Marks, P.A., Breslow, R. and Pavletich, N.P. (1999) Structure of a histone deacetylase homologue bound to the TSA and SAHA inhibitors. Nature, 401, 188-193. 12 Kraulis, P. (1991) Appl. Crystallogr., 24, 946-950.

A

B

Figure 1: Unification of the histone deacetylase and arginase families.
Reuse and adaptation of existing structural frameworks for new cellular functions is widespread in protein evolution. Histone deacetylase and arginase are unified at the functional family level of the classification despite very little overall sequence similarity. The supporting evidence comes from structural and functional similarity. (A) Structure comparison of arginase (left: 1rlaA [10])) and histone deacetylase (right: 1c3pA [11]) yields a high Z-score of 12. Superimposition by Dali, drawing by Molscript [12]. (B) Joint structural, evolutionary and functional information for two segments around the active site. Structurally aligned positions are shaded. Arginase has a binuclear metal centre where residues D124, H126 and D234 bind one and residues H101, H128 and H232 the other manganese ion. The former site is structurally equivalent to the zinc binding site of histone deacetylase made up of residues D168, H170 and D258. Sequence variability from multiply-aligned sequence neighbours in HSSP (* means values 10 or larger; 0 means invariant) is shown above and the secondary structure summary from DSSP (E,B: beta-sheet, S bend, T,G: hydrogen-bonded turns) is shown below the amino acid sequences.

8VLQJ *HQHWLF $OJRULWKPV IRU 3DLUZLVH DQG 0XOWLSOH 6HTXHQFH $OLJQPHQWV

Cédric Notredame
Information Génétique et Structurale CNRS-UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr
 ,QWURGXFWLRQ

1.1

Importance of Multiple Sequence Alignment

The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Given a set of homologous sequences, multiple alignments are used to help predict the secondary or tertiary structure of new sequences [51]; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families [6]; to suggest primers for the polymerase chain reaction (PCR) and as an essential prelude to phylogenetic reconstruction [19]. These alignments may be turned into profiles [25] or Hidden Markov Models (HMMs) [27, 9] that can be used to scour databases for distantly related members of the family.

Multiple alignment techniques can be divided into two categories: global and local techniques. When making a global alignment, the algorithm attempts to align sequences chosen by the user over their entire length. Local alignment algorithms automatically discard portions of sequences that do not share any homology with the rest of the set. They constitute a greater challenge since they increase the amount of decision made by the algorithm. Most multiple alignment methods are global, leaving it to the user to decide on the portion of sequences to be incorporated in the multiple alignment. To aid that decision, one often uses

local pairwise alignment programs such as Blast [3] or Smith and Waterman [56]. In the context of this chapter, we will focus on global alignment methods with a special emphasis on the alignment of protein and RNA sequences.

Despite its importance, the automatic generation of an accurate multiple sequence alignment remains one of the most challenging problems in bioinformatics. The reason behind that complexity can easily be explained. A multiple alignment is meant to reconstitute relationships (evolutionary, structural, and functional) within a set of sequences that may have been diverging for millions and sometimes billions of years. To be accurate, this reconstitution would require an in-depth knowledge of the evolutionary history and structural properties of these sequences. For obvious reasons, this information is rarely available and generic empirical models of protein evolution [18, 28, 8], based on sequence similarity must be used instead. Unfortunately, these can prove difficult to apply when the sequences are less than 30% identical and lay within the so-called “twilight zone” [52]. Further, accurate optimization methods that use these models can be extremely demanding in computer resources for more than a handful of sequences [12, 62]. This is why most multiple alignment methods rely on approximate heuristic algorithms. These heuristics are usually a complex combination of ad hoc procedures mixed with some elements of dynamic programming. Overall, two key properties characterize them: the optimization algorithm and the criteria (objective function) this algorithm attempts to optimize.

1.2

Standard Optimization Algorithms

Optimization algorithms roughly fall in three categories: the exact, the progressive, and the iterative algorithms. Exact algorithms attempt to deliver an optimal or a sub-optimal alignment within some well defined bounds [40], [57]. Unfortunately, these algorithms have

very serious limitations with regard to the number of sequences they can handle and the type of objective function they can optimize. Progressive alignments are by far the most widely used [30, 14, 45]. They depend on a progressive assembly of the multiple alignment [31, 20, 58] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [43]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity even if it is by nature a greedy heuristic that does not guarantee any level of optimization. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvement can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one from a multiple alignment and realigning them to the remaining sequences [7] [24] [29]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [39], simulated annealing (SA) [37, 36, 33] and evolutionary computation such as genetic algorithms (GAs) [44, 47, 34, 64, 5, 23] and evolutionary programming [11, 13]. Their main advantage is to allow for a good separation between the optimization process and evaluation criteria (objective function). It is the objective function that defines the aim of any optimization procedure and in our case, it is also the objective function that contains the biological knowledge one tries to project in the alignment.

1.3

The Objective Function

In an evolutionary algorithm, the objective function is the criteria used to evaluate the quality (fitness) of a solution (individual). To be of any use, the value that this function associates to an alignment must reflect its biological relevance and indicate the structural or the

evolutionary relation that exists among the aligned sequences. In theory, a multiple alignment is correct if in each column the aligned residues have the same evolutionary history or play similar roles in the three-dimensional fold of RNA or proteins. Since evolutionary or structural information is rarely at hand, it is common practice to replace them with a measure of sequence similarity. The rationale behind this is that similar sequences can be assumed to share the same fold and the same evolutionary origin [52] as long as their level of identity is above the so-called "twilight zone" (more than 30% identity over more than 100 residues).

Accurate measures of similarity are obtained using substitution matrices [18, 28]. A substitution matrix is a pre-computed table of numbers (for proteins, this matrix is 20*20, representing all possible transition states for the 20 naturally occurring amino acids) where each possible substitution/conservation receives a weight indicative of its likeliness as estimated from data analysis. In these matrices, substitutions (conservations) observed more often than one would expect by chance receive positive values while under-represented mutations are associated with negative values. Given such a matrix the correct alignment is defined as the one that maximizes the sum of the substitution (conservations) score. An extra factor is also applied to penalize insertions and deletions (Gap penalty). The most commonly used model for that purpose is named ’affine gap penalties’. It penalizes an insertion/deletion once for its opening (gap opening penalty, abbreviated GOP) and then with a factor proportional to its length (gap extension penalty, abbreviated GEP). Since any gap can be explained with one mutational event only, the aim of that scheme is to make sure that the best scoring evolutionary scenario involves only a small number of insertions or deletions (indels) in the alignment. This will result in an alignment with few long gaps rather than many short ones. The resulting score can be viewed as a measure of similarity between two sequences (pairwise). This measure can be extended for the alignment of multiple sequences in many

ways. For instance, it is common practice to set the score of the multiple alignment to be the sum of the score of every pairwise alignment it contains (sums of pairs)[1]. While that scoring scheme is the most widely used, its main drawback stems from the lack of an underlying evolutionary scenario. It assumes that every sequence is independent and this results in an overestimation of the number of substitutions. It is to counterbalance that effect that probability based schemes were introduced in the context of HMMs. Their purpose is to associate each column of an alignment with a generation probability [39]. Estimations are carried out in a Bayesian context where the model (alignment) probability is evaluated simultaneously with the probability of the data (the aligned sequences). In the end, the score of the complete alignment is set to be the probability of the aligned sequences to be generated by the trained HMM. The major drawbacks of this model are its high dependency on the number of sequences being aligned (i.e. many sequences are needed to generate an accurate model) and the difficulty of the training. More recently, new methods based on consistency were described for the evaluation of multiple sequence alignments. Under these schemes, the score of a multiple alignment is the measure of its consistency with a list of pre-defined constraints [42, 46, 10, 45]. It is common practice for these pre-defined constraints to be sets of pairwise, multiple or local alignments. Quite naturally, the main limitation of consistencybased schemes is that they make the quality of the alignment greatly dependent on the quality of the constraints it is evaluated against.

An objective function always defines a mathematical optimum, that is to say an alignment in which the sequences are arranged in such a manner that they yield a score that cannot be improved. The mathematically optimal alignment should never be confused with the correct alignment, the biological optimum. While the biological optimum is by definition correct, a mathematically optimal alignment is biologically only as good as it is similar to the biological

optimum. This depends entirely on the quality of the objective function that was used to generate it. In order to achieve this result, there is no limit to the complexity of the objective functions one may design, even if in practice the lack of appropriate optimization engines constitutes a major limitation.

What is the use of an objective function if one cannot optimize it and how is it possible to tell if an objective function is biologically relevant or not? Evolutionary algorithms come in very handy to answer these questions. They make it possible to design new scoring schemes without having to worry, at least in the first stage, about optimization issues. In the next section, we introduce one of these evolutionary techniques known as genetic algorithms (GA). GAs are described along with another closely related stochastic optimization algorithm: simulated annealing. Three examples are reviewed in details, in which GAs were successfully applied to sequence alignment problems.



(YROXWLRQDU\ $OJRULWKPV DQG 6LPXODWHG $QQHDOLQJ

An evolutionary algorithm is a way of finding a solution to a problem by means of forcing sub-optimal solutions to evolve through some perturbations (mutations and recombination). Most evolutionary algorithms are stochastic in the sense that the solution space is explored in a random rather than ordered manner. In this context, randomness provides a non-null probability to sample any potential solution, regardless of the solution space size, providing that the mutations allow such an exploration. The drawback of randomness is that all potential solutions may not be visited during the search (including the global optimum). In order to correct for this problem, a large number of heuristics have been designed that attempt to bias the way in which the solution space is sampled. They aim at improving the chances of

sampling an optimal solution. For that reason, most stochastic strategies (including evolutionary computation) can be regarded as a tradeoff between greediness and randomness. Two stochastic strategies have been widely used for sequence analysis: simulated annealing and genetic algorithms. Strictly speaking, SA does not belong to the field of evolutionary computation, yet, in practice, it has been one of the major source of inspiration for the elaboration of genetic algorithms used in sequence analysis.

2.1

Simulated Annealing

Simulated annealing (SA) [38] was the first stochastic algorithm used to attempt solving the multiple sequence alignment problem [33, 37]. SA relies on an analogy with physics. The idea is to compare the solving of an optimization problem to some crystallization process (cooling of a metal). In practice, given a set of sequences, a first alignment is randomly generated. A perturbation is then applied (shifting of an existing gap or introduction of a new one) and the resulting alignment is evaluated with the objective function. If that new alignment is better than the previous one, it replaces it, otherwise it replaces it with a probability that depends on the difference of score and on the current temperature. The higher the temperature the more likely an important score difference will be accepted. Every cycle the temperature decreases slightly until it reaches 0. From the perspective of an evolutionary algorithm, SA can be regarded as a population with one individual only. Perturbations are similar to the mutations used in evolutionary algorithms. Apart from the population size of one, the main difference between SA and any true evolutionary algorithm is the extrinsic annealing schedule.

While the principle is very sound and the adequacy to multiple alignment optimization and objective function evaluation is obvious, SA suffers from a very serious drawback: it is

extremely slow. Most of the studies conducted on simulated annealing and multiple alignments concluded that although it does reasonably well, SA is too slow to be used for abinitio multiple alignments and must be restricted to being used as an alignment improver (i.e. improvement of a seed alignment). This serious limitation makes it much harder to use it as the black box one needs to evaluate the design new objective functions.

2.2

Genetic Algorithms

It is in an attempt to overcome the limits of SA that evolutionary algorithm were adapted to the multiple sequence alignment problem. Evolutionary algorithms are parallel stochastic search tools. Unlike SA, which maintains a single line of descent from parent to offspring, evolutionary algorithms maintain a population of trials for a given objective function. Evolutionary algorithms are among the most interesting stochastic optimization tools available today. One of the reason why these algorithms have received so little attention in the context of multiple sequence alignment is probably due to the fact that the implementation of an evolutionary algorithm dedicated to multiple alignment is much less straightforward than with simulated annealing. In other areas of computational biology, evolutionary algorithms have already been established as powerful tools. These include RNA [26, 55, 48] and protein structure analysis [53, 60, 41]. Among all the existing evolutionary algorithms (genetic algorithms, genetic programming, evolution strategies, and evolutionary programming) genetic algorithms have been by far the most popular in the field of computational biology.

Although one could argue on who exactly invented GAs, the algorithms we use today were formally introduced by Holland in 1975 [32] and later refined by Goldberg to give the Simple Genetic Algorithm[22]. GAs are based on a loose analogy with the phenomenon of natural selection. Their principle is relatively straightforward. Given a problem, potential solutions

(individuals within a population) compete with one another (selection) for survival. These solutions can also evolve: they can be modified (mutations), or combined with one another (crossovers). The idea is that acting together, variation and selection will lead to an overall improvement of the population via evolution. Most of the concepts developed here about GAs are taken from [22, 16].

Two ingredients are essential to the GA strategy: the selection method and the operators. Selection is established in order to lead the search toward improvement. It means that the best individuals (as judged using the objective function) must be the most likely to survive. To serve the GA purpose, this selection strategy cannot be too strict. It must allow some variety to be maintained all along the search in order to prevent the GA population from converging toward the first local minimum it encounters. Evolution toward the optimal solution also requires the use of operators that modify existing solutions and create diversity (mutations) or optimize the use of the existing diversity (crossovers) by combining existing motifs into an optimal solution.

Given such a crude layout, the potential for variation is infinite and the study of new GA models is a very active field in its own right. This being said, the main difficulty to overcome when adapting a GA to a problem like multiple sequence alignment is not the choice of a proper model, but rather the conception of a well suited series of operators. This is a well known problem that has also received some attention in the field of structure prediction both for proteins [50] and RNA [54]. A simple justification is that the operators (and the problem representation) largely control the manner in which a solution landscape is being analyzed. For instance, the neighborhood of a solution is mostly defined by the exploration capabilities of the operators. Well chosen they can smoothen very rugged landscapes. Yet, on the other

hand, if they are too ’smart’ and too greedy, they may prevent a proper exploration to be carried out. Finding the right trade off can prove rather a complex task. When applying GAs to sequence alignments, previous work on SA proved instrumental. It provided researcher with a set of operators well tested and perfectly suitable for integration within most evolutionary algorithms.

Attempts to apply evolutionary algorithms to the multiple sequence alignment problem started in 1993 when Ishikawa et al. published a hybrid GA [34] that does not try to directly optimize the alignment but rather the order in which the sequences should be aligned using dynamic programming. Of course, this limits the algorithm to objective functions that can be used with dynamic programming. Even so, the results obtained that way were convincing enough to prompt the development of the use of GAs in sequence analysis. The first GA able to deal with sequences in a more general manner was described a few years later by Notredame and Higgins[44], shortly before a similar work by Zhang [64]. In these two GAs, the population is made of complete multiple sequence alignments and the operators have direct access to the aligned sequences: they insert and shift gaps in a random or semi-random manner. In 1997, SAGA was applied to RNA analysis [47] and parallelized for that purpose using an island model. This work was later duplicated by Anabrasu et al. [5] who carried out an extensive evaluation of this model, using ClustalW as a reference. Over the following years, at least three new multiple sequence alignment strategies based on evolutionary algorithms have been introduced [23], [13] and [11]. Each of these relies on a principle similar to SAGA: a population of multiple alignments evolves by selection, combination and mutation. The population is made of alignments and the mutations are string-processing programs that shuffle the gaps using complex models. The main difference between SAGA and these recent algorithms has been the design of better mutation operators that improve the efficiency and

the accuracy of the algorithms. These new results have strengthened the idea that the essence of the adaptation of GAs to multiple sequence alignments is the design of proper operators, reflecting as well as possible the true mechanisms of molecular evolution. In order to expose each of the many ingredients that constitute a GA specialized in sequence alignments, the example of SAGA will now be reviewed in details.



6$*$ D *$ 'HGLFDWHG WR 6HTXHQFH $OLJQPHQW

3.1

The Algorithm.

SAGA is a genetic algorithm dedicated to multiple sequence alignment [44]. It follows the general principles of the simple genetic algorithms (sGA) described by Goldberg [22] and later refined by Davis [17]. In SAGA, each individual is a multiple alignment. The data structure chosen for the internal representation of an individual is a straightforward twodimensional array where each line represents an aligned sequence and each cell is either a residue or a gap. The population has a constant size and does not contain any duplicate (i.e. identical individuals). The pseudo-code of the algorithm is given on figure 1. Each of these steps is developed over the next sections.



,QLWLDOL]DWLRQ

The challenge of the initialization (also known as seeding) is to generate a population as diverse as possible in terms of ’genotype’ and as uniform as possible in terms of scores. In SAGA, generation 0 consists of a 100 multiple alignments randomly generated that only contain terminal gaps. These initial alignments are less than twice the length of the longest sequence of the set (longer alignments can be generated later). To create one of these individuals, a random offset is chosen for each sequence (between 0 and the length of the longest sequence); each sequence is shifted to the right, according to the offset and empty

spaces are padded with null signs in order to give the same length L to all the sequences. Seeding can also be carried out by generating sub-optimal alignments using an implementation of dynamic programming that incorporates some randomness. This is the case in RAGA [47], an implementation of SAGA specialized in RNA alignment.



(YDOXDWLRQ

Fitness is measured by scoring each alignment according to the chosen objective function. The better the alignment, the better its score and the higher its fitness (scores are inverted if the OF is meant to be minimized). To minimize sampling errors, raw scores are turned into a normalized value known as the expected offspring (EO). The EO indicates how many children an alignment is likely to have. In SAGA, EOs are stochastically derived using a predefined recipe: ’the remainder stochastic sampling without replacement’ [22]. This gives values that are typically between 0 and 2. Only the weakest half of the population is replaced with the new offspring while the other half is carried over unchanged to the next generation. This practice is known as overlapping generations [16].



%UHHGLQJ

It is during the breeding that new individuals (children) are generated. The EO is used as a probability for each individual to be chosen as a parent. This selection is carried out by weighted wheel selection without replacement [22] and an individual’s EO is decreased by one unit each time it is chosen to be a parent. An operator is also chosen and applied onto the parent(s) to create the newborn child. Twenty-two operators are available in SAGA. They all have their own usage probability and can be divided in two categories: mutations that only require one parent and crossovers that require two parents. Since no duplicate is allowed in the population, a newborn child is only accepted if it differs from all the other members of the

generation already created. When a duplicate arises, the whole series of operations that lead to its creation is canceled. Breeding is over when the new generation is complete, and SAGA proceeds toward producing the next generation unless the finishing criterion is met.
 7HUPLQDWLRQ

Conditions that could guarantee optimality are not met in SAGA and there is no valid proof that it may reach a global optimum, even in an infinite amount of time (as opposed to SA). For that reason an empirical criterion is used for termination: the algorithm terminates when the search has been unable to improve for more than 100 generations. That type of stabilization is one of the most commonly used condition to stop a GA when working on a population with no duplicate (i.e. a population where all the individuals are different from one another) [17]. 3.2 Designing the Operators

As mentioned earlier, the design of an adequate set of operators has been the main point of focus in the work that lead to SAGA. According to the traditional nomenclature of genetic algorithms [22], two types of operators coexist in SAGA: crossover and mutation. An operator is designed as an independent program that inputs one or two alignments (the parents) and outputs one alignment (the child). Each operator requires one or more parameters that specify how the operation is to be carried out. For instance, an operator that inserts a new gap requires three parameters: the position of the insertion, the index of sequence to modify and the length of the insertion.

These parameters may be chosen completely at random (in some pre-defined range). In that case, the operator is said to be used in a stochastic manner [44]. Alternatively, all but one of the parameters may be chosen randomly, leaving the value of the remaining parameter to be fixed by exhaustive examination of all possible values. The value that yields the best fitness is

kept. An operator applied this way is said to be used in a semi-hill climbing mode. Most operators may be used either way (stochastic or semi-hill climbing). For the robustness of the GA, it is also important to make sure that the operators are completely independent from any characteristic of the objective function, unless one is interested in creating a very specific operator for the sake of efficiency. .
 7KH &URVVRYHUV

Crossovers are meant to generate a new alignment by combining two existing ones. Two types of crossover coexist in SAGA: the one point crossover that combines two parents through a single point of exchange (Figure 2a) and the uniform crossover that promotes multiple exchanges between two parents by swapping blocks between consistent bits (Figure 2b). The uniform crossover is much less disruptive than its one-point counterpart, but it can only be applied if the two parents share some consistency, a condition rarely met in the early stages of the search. Of the two children produced by a crossover, only the fittest is kept and inserted into the new population (if it is not a duplicate). Crossovers are essential for promoting the exchange of high quality blocks within the population. They make it possible to efficiently use existing diversity. However, the blocks present in the original population only represent a tiny proportion of all the possibilities. They may not be sufficient to reconstruct an optimal alignment, and since crossovers cannot create new blocks, another class of operators is needed: mutation.



0XWDWLRQV ([DPSOH RI WKH *DS LQVHUWLRQ 2SHUDWRU

SAGA’s mutation operators are extensively described in [44]. We will only review here the gap insertion operator, a crude attempt to reconstitute backward some of the events of insertion/deletions through which a set of sequences might have evolved. When that operator

is applied, alignments are modified following the mechanism shown on Figure 3. The aligned sequences are split into two groups. Within each group, every sequence receives a gap insertion at the same position. Groups are chosen by randomly splitting an estimated phylogenetic tree (as given by ClustalW [59]). The stochastic and the semi-hill climbing versions of this operator are implemented. In the stochastic version, the length of the inserted gaps and the two insertion positions are randomly chosen while in the semi-hill climbing mode the second insertion position is chosen by exhaustively trying all the possible positions and comparing the scores of the resulting alignments.



'\QDPLF 6FKHGXOLQJ RI WKH 2SHUDWRUV

When creating a child, the choice of the operator is just as important as the choice of the parents. Therefore, it makes sense to allow operators to compete for usage, just as the parents do for survival, in order to make sure that useful operators are more likely to be used. Since one cannot tell in advance the good operators from the bad ones, they initially all receive the same usage probability. Later during the run, these probabilities are dynamically reassessed to reflect each operator individual performances. The recipe used in SAGA is the dynamic scheduling method described by Davis [16]. It easily allows the adding and removal of new operators without any need for retuning. Under this model, an operator has a probability of being used that is a function of its recent efficiency (i.e. improvement generated over the 10 last generations). The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement over its parents, the operator that was directly responsible for its creation gets the largest part of the credit (e.g. 50%); then the operator(s) responsible for the creation of the parents also get their share on the remaining credit (50% of the remaining credit); and so on. This credit report goes on for

some specified number of generation (e.g. 4). Every 10 generations, results are summarized for each operator and the usage probabilities are reassessed based on the accumulated credit. To avoid early loss of some operators, each of them has a minimum usage probability higher than 0. It is common practice to set these minimal usage probabilities so that they sum to 0.5. To that effect one can use for each operator a minimum probability of 1/(2*number of operators)). A very interesting property of that scheme is that it helps using operators only when they are needed. For instance, uniform crossovers are generally more efficient than their one point counterpart. Unfortunately, they cannot be properly used in the optimization early stages because at that point there is not enough consistency within the population. The dynamic scheduling adapts very well to that situation by initially giving a high usage probability to the one point crossover, and by shifting that credit to the uniform crossover when the population has become tidy enough to support its usage. It is interesting to notice that these two operators are competing with one another although the GA does not explicitly know they both belong to the crossover category.

3.3

Parallelisation of SAGA

Long running times were SAGA’s main limitation. This became especially acute when aligning very long sequences such as ribosomal RNAs (>1000 nucleotides). It is common practice to use parallelisation in order to alleviate such problems. The technique applied on SAGA is specific of GAs and known as an island parallelisation model [21]. Instead of having a single GA running, several identically configured GAs run in parallel on separate processors. Every 5 generations they exchange some of their individuals. The GAs are arranged on the leaves and the nodes of a N-branched tree and the population exchange is unidirectional from the leaves to the root of the tree (Figure 4). By default, the individuals

migrating from one GA to another are those having the best score. The GA node where they come from keeps a copy of them but they replace low scoring individuals in the accepting GA [44]. Initially implemented in RAGA, the RNA version of SAGA, this model was also extended to SAGA, using a 3-branched trees with a depth of 3 that requires 13 GAs. These processes are synchronous and wait for each other to reach the same generation number before exchanging populations.

This distributed models benefits from the explicit parallelisation and is about 10 times faster than a non-parallel version (i.e. about 80% of the maximum speedup expected when distributing the computation over 13 processors). It also benefits from the new constraints imposed by the tree topology on the structure of the population. It seems to be the lack of feedback that makes it possible to retain within the population a much higher degree of diversity than a single unified population could afford. These are the terminal leaves that behave as a diversity reservoir and yield a much higher accuracy to the parallel GA than a non-parallel version with the same overall population. Nonetheless, these preliminary observations remain to be firmly established, using some extra thorough benchmarking.



$SSOLFDWLRQV &KRRVLQJ DQ 2EMHFWLYH )XQFWLRQ

The main motivation behind SAGA’s design was the creation of a robust platform or a black box on which any OF one could think of could be tested in a seamless manner. Such a black box allows discriminating between the functions that are biologically relevant and those that are not. For instance, let us consider the weighted sums of pairs. This function is one of the most widely used. It owes its popularity to the fact that algorithmic methods exist that allow its approximate optimization [43, 40]. Yet we know this function is not very meaningful from a biological point of view [4]. The three main limitations that have caught biologists’ attention

are the crude modeling of the insertions/deletions (gap), the assumed independence of each position and the fact that the evaluation cannot be made position dependant. Thanks to SAGA, it was possible to design new objective functions that make use of more complex gap penalties, take into account non-local dependencies or use position specific scoring schemes and to ask if this increased sophistication results in an improvement of the alignments biological quality. The following section reviews three classes of objective functions that were successfully optimized using SAGA [44, 47, 46].

4.1

The Weighted Sums of Pairs

MSA [40] is an algorithm that makes it possible to deliver an optimal (or a very close suboptimal) multiple sequence alignment using the sums of pairs measure. This sophisticated heuristic performs multi-dimensional dynamic programming in a bounded hyper-space. It is possible to assess the level of optimization reached by SAGA by comparing it to MSA while using exactly the same objective function.

The sums-of-pairs principle is to associate a cost to each pair of aligned residues in each column of an alignment (substitution cost), and another similar cost to the gaps (gap cost). The sum of these costs yields the global cost of the alignment. Major variations involve: i) using different sets of costs for the substitutions (PAM matrices [18] or BLOSUM tables [28]); ii) different schemes for the scoring of gaps [1]; iii) different sets of weights associated with each pair of sequence [2]. Formally, one can define the cost of a multiple alignment (A) as:
  ¡

¡

 

=1

= +1

¤

£

$/,*10(17 &267 $

ΣΣ
¡

¢

−1

¢

: , &267 $  $ 

Where N is the number of sequences, Ai the aligned sequence i, COST is the alignment score between two aligned sequences (Ai and Aj) and Wi,j is the weight associated with that pair of sequences. The COST includes the sum of the substitution costs as given by a substitution matrix and the cost of the insertions/deletions using a model with affine gap penalties (a gap opening penalty and a gap extension penalty). Two schemes exist for scoring gaps: natural affine gap penalties and quasi-natural affine gap penalties [1]. Quasi-natural gap penalties are the only scheme that the MSA program can efficiently optimize. This is unfortunate since these penalties are known to be biologically less accurate than their natural counterparts [1] because of a tendency to over-estimate the number of gaps. Under both scheme, terminal gaps are penalized for extension but not for opening.

It is common practice to validate a new method by comparing the alignments it produces with references assembled by experts. In the case of multiple alignments, one often uses structure based sequence alignments that are regarded as the best standard of truth available [24]. For SAGA, validation was carried out using 3Dali [49]. Biological validation should not be confused with the mathematical validation also required for an optimization method. In the case of SAGA, both validations were simultaneously carried out, and a summary of the results obtained when optimizing the sums of pairs is shown of Table 1.

Firstly, SAGA was used to optimize the sums of pairs with quasi-natural gap penalties, using MSA derived alignments as a reference. In two thirds of the cases, SAGA reached the same level of optimization as MSA. In the remaining test sets, SAGA outperformed MSA, and in every case that improvement correlated with an improvement of the alignment biological quality, as judged by comparison with a reference alignment. Although they fall short of a demonstration, these figures suggest that SAGA is an adequate optimization tool that

competes well with the most sophisticated heuristics. In a second aspect of that validation, SAGA was used to align test cases too large to be handled by MSA, and using as an objective function the weighted sums of pairs with natural gap penalties. ClustalW was the nonstochastic heuristic used as a reference. As expected, the use of natural penalties lead to some improvement over the optimization reached by ClustalW, and that mathematical improvement was also correlated with a biological improvement. Altogether, these results are indicative of the versatility of SAGA as an optimizer and of its ability to optimize functions that are beyond the scope of standard dynamic programming based algorithmic methods.

4.2

Consistency Based Objective Functions: The COFFEE Score

Ultimately, a multiple alignment aims at combining within a single unifying model every piece of information known about the sequences it contains. However, it may be the case that a part of this information is not as reliable as one may expect. It may also be the case that some elements of information are not compatible with one another. The model will reveal these inconsistencies and require decisions to be made in a way that takes into account the overall quality of the alignment.

A new objective function can be defined that measures the fit between a multiple alignments and the list of weighted elements of information. Of course, the relevance of that objective function will depend greatly on the quality of the pre-defined list. This list can take whatever forms one wishes. For instance, a convenient source is the generation of a list of pair wise alignments [46, 45] that given a set of N sequences will contain all the N2 possible pair wise alignments. In the context of COFFEE (Consistency Based Objective Function For alignmEnt Evaluation), that list of alignments is named a library, and the COFFEE function measures the level of consistency between a multiple alignments and its corresponding library. Evaluation

is made by comparing each pair of aligned residues observed in the multiple alignments to the list of residue pairs that constitute the library. During the comparison, residues are only identified by their index within the sequences. The consistency score is equal to the number of pairs of residues that are simultaneously found in the multiple alignment and in the library, divided by the total number of pairs observed in the multiple sequence alignment. The maximum is 1 but the real optimum depends on the level of consistency found within the library. To increase the biological relevance of this function, each pair of residues is associated with a weight indicative of the quality of the pair-wise alignment it comes from (a measure of the percentage of identity between the two sequences). The COFFEE function can be formalized as follows. Given N aligned sequences S1...SN in a multiple alignment. Ai,j is the pair wise projection (obtained from the multiple alignment) of the sequences Si and Sj. LEN (Ai,j) is the number of ungapped columns in this alignment. SCORE (Ai,j) is the overall consistency between Ai,j and the corresponding pair-wise alignment in the library and Wi,j is the weight associated with this pair-wise alignment.

COFFEE score=

[Σ Σ W
i=1 j= i+1

Ν -1

Ν

i, j

* SCORE (A i, j) /

] [Σ Σ W
i=1 j =i +1

Ν -1

Ν

i, j

*LEN(A i, j)

]

If we compare this function to the weighted sums of pairs developed earlier, we will find that the main difference is the library that replaces the substitution matrix and provides a position dependant mean of evaluation. It is also interesting to note that under this formulation an alignment having an optimal COFFEE score will be equivalent to a Maximum Weight Trace alignment using a ’pair-wise alignment graph’ [35].

Table 2 shows some of the results obtained using SAGA/COFFEE on 3Dali. For that experiment, the library of pair wise alignments had been generated using ClustalW alignments, and the resulting alignments proved to be of a higher biological quality than those obtained with alternative methods available at the time. Eventually, these results were convincing enough to prompt the development of a fast non-GA based method for the optimization of the COFFEE function. That new algorithm, named T-Coffee, was recently made available to the public [45].

4.3

Taking Non-Local Interactions Into Account: RAGA.

So far, we have reviewed the use of SAGA for sequence analysis problems that consider every position as independent from the others. While that approximation is acceptable when the sequence signal is strong enough to drive the alignment, this is not always the case when dealing with sequences that have a lower information content than proteins but carry explicit structural information, such as RNA or DNA. To illustrate one more usage of GAs it will now be interesting to examine a case where SAGA was used to optimize an RNA structure superimposition in which the OF takes into account local and non-local interactions altogether. RNA was chosen because its fold, largely based on Watson and Crick base pairings [63], generates characteristic structures (stems-loops) that are easy to predict and analyze [65]. Since the pairing potential of two RNA bases can be predicted with reasonable accuracy, the evaluation of an alignment can easily take into account structure (Se) and sequence (Pr) similarities altogether. The version of SAGA in which that new function is implemented is named RAGA (RNA Alignment by Genetic Algorithm) [47]. In RAGA, the OF evaluates the alignment of two RNA sequences, one with a known secondary structure (master) and one that is homologous to the master but whose exact secondary structure is unknown (slave). It can be formalized as follow:

Alignment score = Pr + (λ* Se) - Gap Penalty

(2)

λ is a constant (1-3) and

*DS SHQDOW\

is the sum of the affine gap penalties within the

alignment. Pr is simply the number of identities. Se uses the secondary structure of the master
sequence and evaluates the stability of the folding it induces onto the slave sequence. If two bases form a base pair (part of a stem) in the master, then the two ’slave’ bases they are aligned to should be able to form a Watson and Crick base pair as well. Se is the sum of the score of these induced pairs. The energetic model used in RAGA is very simplified and assigns 3 to GC pairs and 2 to UA and UG.

Assessing the accuracy and the efficiency of RAGA is a problem very similar to the one encountered when analyzing SAGA. In this case, the reference alignments were chosen from mitochondrial ribosomal small subunit RNA sequence alignments established by experts [61]. The human sequence was used as a master and realigned by RAGA to seven other homologous mitochondrial sequences used as slaves. Evaluation was made by comparing the optimized pairwise alignments to those contained in the reference alignment. The results on Table 3 indicate very clearly, that a proper optimization took place and that the secondary structure information was efficiently used to enhance the alignment quality. This is especially sensible for very divergent sequences that do not contain enough information at the primary level for an accurate alignment to be determined on these bases alone. It is also interesting to point out that RAGA could also take into account some elements of the tertiary structure known as pseudoknots, that were successfully added to the objective function. These elements, that are beyond the scope of most dynamic programming based methods, lead to even more accurate alignment optimization [47].



&RQFOXVLRQ *$ YHUVXV +HXULVWLF 0HWKRGV

Section 4 of this chapter illustrates three situations in which GAs proved able to solve very complex optimization problems with a reasonable level of accuracy. On its own, this clearly indicates the importance and the interest of these methods in the field of sequence analysis. Yet, when applied to that type of problems, GAs suffer from two major drawbacks: they are very slow and unreliable. By unreliable, we mean that given a set of sequences, a GA may not deliver twice the same answer, owing to the stochastic nature of the optimization process and to the difficulty of the optimization. This may be a great cause of concern to the average biologist who expects to use his multiple alignment as a prediction tool and possibly as a decision aide for the design of expensive wet lab experiments. How severe is this problem? If we consider the protein test cases analyzed here, SAGA reaches its best score in half of the runs on average. For RAGA, maybe because the solution space is more complex, this proportion goes down to 20%. If one is only interested in validating a new objective function, this is not a major source of concern since even in the worse cases the sub-optimal solutions are within a few percent of the best found solution. However, this instability is not unique to GAs and is not as severe as the second major drawback: the efficiency. Although much more practical than SA, GAs slowness means that they cannot really be expected to become part of any of the very large projects that require millions of alignments to be routinely made over a few days [15]. More robust, if less accurate, techniques are required for that purpose. Is the situation hopeless then? The answer is definitely no since two important fields of application exist for which GAs are uniquely suited. The first one is the analysis of rare and very complex problems for which no other alternative is available, such as the folding of very long RNAs. The second aspect is more general. GAs provide us with a unique way of probing very complex problems with little concern, at least in the first stages, for the algorithmic issues involved. It is quite remarkable that even with a very simple GA one can quickly ask

very important questions and decide weather a thread of investigation is worth being pursued or should simply be abandoned.

The COFFEE project is a good example of such a cycle of analysis. It followed this three steps process: (i) an objective function was first designed without any concern for the complexity of its optimization and the algorithmic issues. (ii) SAGA was used to evaluate the biological relevance of that function. (iii) This validation was convincing enough to prompt the conception of a new dynamic programming algorithm, much faster and appropriate for this function[45]. This non-GA based algorithm was named T-Coffee (Tree based COFFEE). The mere evocation of these two projects respective developing time makes a good case for the use of SAGA: the COFFEE project took four months to be carried out, while completion of the T-Coffee project required more than a year and a half for algorithm development and software engineering.



$YDLODELOLW\

SAGA, RAGA, COFFEE and T-Coffee are all available free of charge from the author either via Email (cedric.notredame@igs.cnrs-mrs.fr) or via the WWW (http://igs-server.cnrsmrs.fr/~cnotred).



$FNQRZOHGJHPHQWV

The author wishes to thank Dr Hiroyuki Ogata and Dr Gary Fogel for very helpful comments and for an in-depth review of the manuscript.



5HIHUHQFHV

[1]

S. F. Altschul, *DS FRVWV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, J. Theor. Biol., 138 (1989), pp. 297-309.

[2] [3] [4] [5]

[6] [7]

[8] [9] [10] [11]

[12] [13] [14] [15]

[16] [17] [18] [19] [20] [21] [22] [23] [24]

S. F. Altschul, R. J. Carroll and D. J. Lipman, :HLJKWV IRU GDWD UHODWHG E\ D WUHH, Journal of Molecular Biology, 207 (1989), pp. 647-653. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, %DVLF ORFDO DOLJQPHQW VHDUFK WRRO, Journal of Molecular Biology, 215 (1990), pp. 403-410. S. F. Altschul and D. J. Lipman, 7UHHV VWDUV DQG PXOWLSOH ELRORJLFDO VHTXHQFH DOLJQPHQW, SIAM J. Appl. Math., 49 (1989), pp. 197-209. L. A. Anabarasu, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ SDUDOOHO JHQHWLF DOJRULWKPV, , 7KH 6HFRQG $VLD3DFLILF &RQIHUHQFH RQ 6LPXODWHG (YROXWLRQ 6($/, Canberra,australia, 1998. A. Bairoch, P. Bucher and K. Hofmann, 7KH 3526,7( GDWDEDVH LWV VWDWXV LQ , Nucleic Acids Research, 25 (1997), pp. 217-221. G. J. Barton and M. J. E. Sternberg, $ VWUDWHJ\ IRU WKH UDSLG PXOWLSOH DOLJQPHQW RI SURWHLQ VHTXHQFHV FRQILGHQFH OHYHOV IURP WHUWLDU\ VWUXFWXUH FRPSDULVRQV, Journal of Molecular Biology, 198 (1987), pp. 327-337. S. A. Benner, M. A. Cohen and G. H. Gonnet, 5HVSRQVH WR %DUWRQ
V OHWWHU &RPSXWHU VSHHG DQG VHTXHQFH FRPSDULVRQ, Science, 257 (1992), pp. 1609-1610. P. Bucher, K. Karplus, N. Moeri and K. Hofmann, $ IOH[LEOH PRWLI VHDUFK WHFKQLTXH EDVHG RQ JHQHUDOL]HG SURILOHV, Comput Chem, 20 (1996), pp. 3-23. K. Bucka-Lassen, O. Caprani and J. Hein, &RPELQLQJ PDQ\ PXOWLSOH DOLJQPHQWV LQ RQH LPSURYHG DOLJQPHQW, Bioinformatics, 15 (1999), pp. 122-30. L. Cai, D. Juedes and E. Liakhovitch, (YROXWLRQDU\ FRPSXWDWLRQ WHFKQLTXHV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, , &RQJUHVV RQ HYROXWLRQDU\ FRPSXWDWLRQ , 2000, pp. 829-835. H. Carrillo and D. J. Lipman, 7KH PXOWLSOH VHTXHQFH DOLJQPHQW SUREOHP LQ ELRORJ\, SIAM J. Appl. Math., 48 (1988), pp. 1073-1082. K. Chellapilla and G. B. Fogel, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ HYROXWLRQDU\ SURJUDPPLQJ, , &RQJUHVV RQ (YROXWLRQDU\ &RPSXWDWLRQ , 1999, pp. 445-452. F. Corpet, 0XOWLSOH VHTXHQFH DOLJQPHQW ZLWK KLHUDUFKLFDO FOXVWHULQJ, Nucleic Acids Res., 16 (1988), pp. 10881-10890. F. Corpet, F. Servant, J. Gouzy and D. Kahn, 3UR'RP DQG 3UR'RP&* WRROV IRU SURWHLQ GRPDLQ DQDO\VLV DQG ZKROH JHQRPH FRPSDULVRQV, nucleic acids res, 28 (2000), pp. 267-9. L. Davis, 7KH KDQGERRN RI *HQHWLF $OJRULWKPV, Van Nostrand Reinhold, New York, 1991. P. J. Davis and R. Hersh, 7KH PDWKHPDWLFDO H[SHULHQFH, Birkauser, Boston, 1980. M. O. Dayhoff, $WODV RI 3URWHLQ 6HTXHQFH DQG 6WUXFWXUH, National Biomedical Research Foundation, Washington, D. C., U. S. A., 1978. J. Felsenstein, 3+</,3 SK\ORJHQ\ LQIHUHQFH SDFNDJH, Cladistics, 5 (1988), pp. 355356. D.-F. Feng and R. F. Doolittle, 3URJUHVVLYH VHTXHQFH DOLJQPHQW DV D SUHUHTXLVLWH WR FRUUHFW SK\ORJHQHWLF WUHHV, Journal of Molecular Evolution, 25 (1987), pp. 351-360. A. L. Goldberg and R. E. Wittes, *HQHWLF FRGH DVSHFWV RI RUJDQL]DWLRQ, Science, 153 (1966), pp. 420-424. D. E. Goldberg, *HQHWLF $OJRULWKPV LQ 6HDUFK 2SWLPL]DWLRQ DQG 0DFKLQH /HDUQLQJ, Addison-Wesley, New York, 1989. R. R. Gonzalez, 0XOWLSOH SURWHLQ VHTXHQFH FRPSDULVRQ E\ JHQHWLF DOJRULWKPV, , 63,( , 1999. O. Gotoh, 6LJQLILFDQW ,PSURYHPHQW LQ $FFXUDF\ RI 0XOWLSOH 3URWHLQ 6HTXHQFH
$OLJQPHQWV E\ ,WHUDWLYH 5HILQHPHQWV DV $VVHVVHG E\ 5HIHUHQFH WR 6WUXFWXUDO $OLJQPHQWV

, J. Mol. Biol., 264 (1996), pp. 823-838.

[25]

[26]

[27]

[28] [29]

[30] [31] [32] [33]

[34]

[35] [36]

[37] [38] [39]

[40] [41] [42]

[43]

M. Gribskov, M. McLachlan and D. Eisenberg, 3URILOH DQDO\VLV 'HWHFWLRQ RI GLVWDQWO\ , Proceedings of the National Academy of Sciences, 84 (1987), pp. 4355-5358. A. P. Gultayaev, F. D. H. van Batenburg and C. W. A. Pleij, 7KH FRPSXWHU 6LPXODWLRQ RI 51$ )ROGLQJ 3DWKZD\V 8VLQJ D *HQHWLF $OJRULWKP, J. Mol. Biol., 250 (1995), pp. 37-51. D. Haussler, A. Krogh, I. S. Mian and K. Sjölander, 3URWHLQ 0RGHOLQJ XVLQJ +LGGHQ 0DUNRY 0RGHOV $QDO\VLV RI *ORELQV, in L. Hunter, ed., 3URFHHGLQJV IRU WKH WK +DZDLL ,QWHUQDWLRQDO &RQIHUHQFH RQ 6\VWHPV 6FLHQFHV, Los Alamitos, CA: IEEE Computer Society Press, Wailea, HI, U.S.A., 1993, pp. 792-802. S. Henikoff and J. G. Henikoff, $PLQR DFLG VXEVWLWXWLRQ PDWULFHV IURP SURWHLQ EORFNV, Proc. Natl. Acad. Sci., 89 (1992), pp. 10915-10919. J. Heringa, 7ZR VWUDWHJLHV IRU VHTXHQFH FRPSDULVRQ SURILOHSUSURFHVVHG DQG VHFRQGDU\ VWUXFWXUHLQGXFHG PXOWLSOH DOLJQPHQW, Computers and Chemistry, 23 (1999), pp. 341-364. D. G. Higgins and P. M. Sharp, &/867$/ D SDFNDJH IRU SHUIRUPLQJ PXOWLSOH VHTXHQFH DOLJQPHQW RQ D PLFURFRPSXWHU, Gene, 73 (1988), pp. 237-244. P. Hogeweg and B. Hesper, 7KH DOLJQPHQW RI VHWV RI VHTXHQFHV DQG WKH FRQVWUXFWLRQ RI SK\ORJHQHWLF WUHHV $Q LQWHJUDWHG PHWKRG, J. Mol. Evol., 20 (1984), pp. 175-186. J. H. Holland, $GDSWDWLRQ LQ QDWXUDO DQG DUWLILFLDO V\VWHPV, University of Michigan Press, Ann Arbour, MI, 1975. M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara and M. Kanehisa, 0XOWLSOH VHTXHQFH DOLJQPHQW E\ SDUDOOHO VLPXODWHG DQQHDOLQJ, Comp. Applic. Biosci., 9 (1993), pp. 267-273. M. Ishikawa, T. Toya and Y. Tokoti, 3DUDOOHO ,WHUDWLYH $OLJQHU ZLWK *HQHWLF $OJRULWKP, , $UWLILILFLDO ,QWHOOLJHQFH DQG *HQRPH :RUNVKRS WK ,QWHUQDWLRQDO &RQIHUHQFH RQ $UWLILFLDO ,QWHOOLJHQFH, Chambery, France, 1993, pp. 13-22. J. D. Kececioglu, 7KH PD[LPXP ZHLJKW WUDFH SUREOHP LQ PXOWLSOH VHTXHQFH DOLJQPHQW, Lecture Notes in Computer Science, 684 (1983), pp. 106-119. J. Kim, J. R. Cole and S. Pramanik, $OLJQPHQW RI SRVVLEOH VHFRQGDU\ VWUXFWXUHV LQ PXOWLSOH 51$ VHTXHQFHV XVLQJ VLPXODWHG DQQHDOLQJ, Comp. Applic. Biosci., 12 (1996), pp. 259-267. J. Kim, S. Pramanik and M. J. Chung, 0XOWLSOH 6HTXHQFH $OLJQPHQW XVLQJ 6LPXODWHG $QQHDOLQJ, Comp. Applic. Biosci., 10 (1994), pp. 419-426. S. Kirkpatrick, C. D. J. Gelatt and M. P. Vecchi, 2SWLPL]DWLRQ E\ 6LPXODWHG $QQHDOLQJ, Science, 220 (1983), pp. 671-680. A. Krogh, M. Brown, I. S. Mian, K. Sjölander and D. Haussler, +LGGHQ 0DUNRY 0RGHOV LQ &RPSXWDWLRQDO %LRORJ\ $SSOLFDWLRQV WR 3URWHLQ 0RGHOLQJ, J. Mol. Biol., 235 (1994), pp. 1501-1531. D. J. Lipman, S. F. Altschul and J. D. Kececioglu, $ WRRO IRU PXOWLSOH VHTXHQFH DOLJQPHQW, Proc. Natl. Acad. Sci. USA, 86 (1989), pp. 4412-4415. A. C. May and M. S. Johnson, ,PSURYHG JHQHWLF DOJRULWKPEDVHG SURWHLQ VWUXFWXUH FRPSDULVRQV SDLUZLVH DQG PXOWLSOH VXSHUSRVLWLRQV, Protein Eng, 8 (1995), pp. 873-82. B. Morgenstern, A. Dress and T. Wener, 0XOWLSOH '1$ DQG 3URWHLQ VHTXHQFH EDVHG RQ VHJPHQWWRVHJPHQW FRPSDULVRQ, Proc. Natl. Acad. Sci. USA, 93 (1996), pp. 12098-12103. S. B. Needleman and C. D. Wunsch, $ JHQHUDO PHWKRG DSSOLFDEOH WR WKH VHDUFK IRU VLPLODULWLHV LQ WKH DPLQR DFLG VHTXHQFH RI WZR SURWHLQV, J. Mol. Biol., 48 (1970), pp. 443-53.
UHODWHG SURWHLQV

[44] [45] [46] [47] [48] [49] [50]

[51] [52]

[53]

[54] [55] [56] [57]

[58] [59]

C. Notredame and D. G. Higgins, 6$*$ VHTXHQFH DOLJQPHQW E\ JHQHWLF DOJRULWKP, Nucleic Acids Res., 24 (1996), pp. 1515-1524. C. Notredame, D. G. Higgins and J. Heringa, 7&RIIHH $ QRYHO DOJRULWKP IRU PXOWLSOH VHTXHQFH DOLJQPHQW, JMB, 302 (2000), pp. 205-217. C. Notredame, L. Holm and D. G. Higgins, &2))(( DQ REMHFWLYH IXQFWLRQ IRU PXOWLSOH VHTXHQFH DOLJQPHQWV, Bioinformatics, 14 (1998), pp. 407-22. C. Notredame, E. A. O’Brien and D. G. Higgins, 5$*$ 51$ 6HTXHQFH $OLJQPHQW E\ *HQHWLF $OJRULWKP, Nucleic Acids Res., 25 (1997), pp. 4570-4580. H. Ogata, A. Yutaka and K. Minoru, $ JHQHWLF DOJRULWKP EDVHG PROHFXODU PRGHOLQJ WHFKQLTXH IRU 51$ VWHPORRS VWUXFWXUHV, Nucleic Acids res., 23 (1995), pp. 419-426. S. Pascarella and P. Argos, $ GDWD EDQN PHUJLQJ UHODWHG SURWHLQ VWUXFWXUHV DQG VHTXHQFHV, Protein Eng., 5 (1992), pp. 121-137. A. A. Rabow and H. A. Scheraga, ,PSURYHG JHQHWLF DOJRULWKP IRU WKH SURWHLQ IROGLQJ SUREOHP E\ XVH RI D FDUWHVLDQ FRPELQDWLRQ RSHUDWRU, Protein Science, 5 (1996), pp. 1800-1815. B. Rost and C. Sander, 3UHGLFWLRQ RI SURWHLQ VHFRQGDU\ VWUXFWXUH DW EHWWHU WKDQ  DFFXUDF\, Journal of Molecular Biology, 232 (1993), pp. 584-599. C. Sander and R. Schneider, 'DWDEDVH RI KRPRORJ\GHULYHG VWUXFWXUHV DQG WKH VWUXFWXUDOO\ PHDQLQJ RI VHTXHQFH DOLJQPHQW, Proteins: Structure, Function, and Genetics, 9 (1991), pp. 56-68. S. Schulze-Kremer, *HQHWLF DOJRULWKPV IRU SURWHLQ WHUWLDU\ VWUXFWXUH SUHGLFWLRQ, in R. Männer and B. Manderick, eds., 3URFHHGLQJV RI WKH QG &RQIHUHQFH RQ 3DUDOOHO 3UREOHP 6ROYLQJ IURP 1DWXUH, Elsevier Science Publishers, Amsterdam, 1992, pp. 391-400. B. A. Shapiro and J. C. Wu, $Q DQQHDOLQJ PXWDWLRQ RSHUDWRU LQ WKH JHQHWLF DOJRULWKP IRU 51$ IROGLQJ, Comp. Applic. Biosci., 12 (1996), pp. 171-180. B. A. Shapiro and J. C. Wu, 3UHGLFWLQJ 51$ +7\SH SVHXGRNQRWV ZLWK WKH PDVVLYHO\ SDUDOOHO JHQHWLF DOJRULWKP, Comp. Applic. in Biosci., 13 (1997), pp. 459-471. T. F. Smith and M. S. Waterman, &RPSDULVRQ RI ELRVHTXHQFHV, Adv. Appl. Math., 2 (1981), pp. 482-489. J. Stoye, V. Moulton and A. W. Dress, '&$ DQ HIILFLHQW LPSOHPHQWDWLRQ RI WKH GLYLGH DQGFRQTXHU DSSURDFK WR VLPXOWDQHRXV PXOWLSOH VHTXHQFH DOLJQPHQW, Comput Appl Biosci, 13 (1997), pp. 625-6. W. R. Taylor, $ IOH[LEOH PHWKRG WR DOLJQ ODUJH QXPEHUV RI ELRORJLFDO VHTXHQFHV, Journal of Molecular Evolution, 28 (1988), pp. 161-169. J. Thompson, D. Higgins and T. Gibson, &/867$/ : LPSURYLQJ WKH VHQVLWLYLW\ RI
SURJUHVVLYH PXOWLSOH VHTXHQFH DOLJQPHQW WKURXJK VHTXHQFH ZHLJKWLQJ SRVLWLRQ VSHFLILF JDS SHQDOWLHV DQG ZHLJKW PDWUL[ FKRLFH, Nucleic Acids Res., 22 (1994), pp. 4673-4690. R. Unger and J. Moult, *HQHWLF $OJRULWKPV IRU 3URWHLQ )ROGLQJ 6LPXODWLRQV, J. Mol. Biol., 231 (1993), pp. 75-81. Y. Van de Peer, J. Jansen, P. De Rijk and R. De Watcher, 'DWDEDVH RQ WKH VWUXFWXUH RI VPDOO ULERVRPDO 51$, Nucleic Acids res., 25 (1997), pp. 111-116. L. Wang and T. Jiang, 2Q WKH FRPSOH[LW\ RI PXOWLSOH VHTXHQFH DOLJQPHQW, Journal of computational biology, 1 (1994), pp. 337-348. J. D. Watson and F. H. C. Crick, 0ROHFXODU VWUXFWXUH RI QXFOHLF DFLGV $ VWUXFWXUH IRU GHR[\ULERVH QXFOHLF DFLG, Nature, 171 (1953), pp. 737-738. C. Zhang and A. K. Wong, $ JHQHWLF DOJRULWKP IRU PXOWLSOH PROHFXODU VHTXHQFH DOLJQPHQW, Comput Appl Biosci, 13 (1997), pp. 565-81.

[60] [61] [62] [63] [64]

[65]

WKHUPRG\QDPLFV DQG DX[LOLDU\ LQIRUPDWLRQ

M. Zuker and P. Stiegler, 2SWLPDO FRPSXWHU IROGLQJ RI ODUJH 51$ VHTXHQFHV XVLQJ , Nucleic Acids Res., 9 (1981), pp. 133-148.

)LJXUH OHJHQGV

)LJXUH  /D\RXW RI WKH 6$*$ DOJRULWKP

This pseudo-code indicates the main steps that take place during the optimization carried out by SAGA. See the text for full details.

)LJXUH  &URVVRYHUV XVHG LQ 6$*$ D 2QH SRLQW FURVVRYHU between two parent alignments to produce two children. The arrows indicate the way the two parents are cut having randomly chosen a position in the left hand alignment. Child 1 is produced by combining the left side of parent 1 and the right side of

parent 2. Child 2 is produced by combining the right side of parent 1 and the left side of parent 2. Only one of these two children alignments is kept (whichever scores better). The boxed sections show some patterns from the parent alignments that are combined in the child.

E8QLIRUP FURVVRYHU. All of the positions in the two parents that are consistent between the two alignments are marked (stars). Children are produced by swapping blocks between the two parents where each block is randomly chosen between two consistent positions.

)LJXUH 

*DS LQVHUWLRQ 2SHUDWRU D The estimated phylogenetic tree connecting the five sequences is randomly divided into two sub trees. This gives two groups of sequences (G1 and G2). E Two positions P1 and P2 are randomly chosen in the alignment. A gap of random length (here 2 nulls) is inserted at position P1 in the sequences of subgroup G1, and the same number of nulls are inserted at position P2 in subgroup G2.

Figure 4. Layout of the Parallel version of RAGA. Each circle represents a RAGA process. The best individuals migrate from top to bottom. The best solution is to be found in the root (bottom).

7DEOHV

7DEOH  0DWKHPDWLFDO YDOLGDWLRQ RI 6$*$ DJDLQVW 06$ XVLQJ 'DOL

MSA SAGA-MSA Test case Nseq Length Score Q CPU Score Q CPU __________________________________________________________________________________ Cyt c 6 129 1051257 74.2 7 1051257 74.2 960  82.0 75 Gcr 8 60 371875 75.0 3 Ac protease 5 183 379997 80.1 13 379997 80.1 331 S protease 6 280 574884 91.0 184 574884 91.0 3500  * 3542 Chtp 6 247 111924 * 4525  82.5 411 Dfr secstr 4 189 171979 82.0 5 Sbt 4 296 271747 80.1 7 271747 80.1 210 Globin 7 167 659036 94.4 7 659036 94.4 330  54.0 510 Plasto 5 132 236343 54.0 22

Nseq is the number of sequences; Length is the length of the final SAGA alignment; Score is the score of the alignment returned by MSA using the weighted sums-of-pairs with quasi-natural affine gap penalties (the function is minimized and the best scores are the lowest). The columns marked ‘Q' give the percentage of an MSA alignment that matches the structural alignment. CPU time is given in seconds. SAGA-MSA indicates similar results with alignments obtained by SAGA. In the Score column, alignments for which SAGA outperforms MSA are indicated in bold. The PDB structure identifiers for each test case can be found in 3Dali. The PDB structure identifiers for each test case are as follows. &\W F: 451c, 1ccr, 1cyc, 5cyt, 3c2c, 155c. *FU: 2gcr, 2gcr-2, 2gcr-3, 2gcr-4, 1gcr, 1gcr-2, 1gcr-3, 1gcr-4. $F SURWHDVH: 1cms, 4ape, 3app, 2apr, 4pep. 6 SURWHDVH: 1ton, 2pka, 2ptn, 4cha, 3est, 3rp2. 'IU VHFVWU: 1dhf, 3dfr, 4dfr, 8dfr. &KWS: 3rp2, M13143 (EMBL accession number), 1gmh, 2tga, 1est, 1sgt. 6EW: 1cse, 1sbt, 1tec, 2prk. *ORELQ: 4hhb-2, 2mhb-2, 4hhb, 2mhb, 1mbd, 2lhb, 2lh1. 3ODVWR: 7pcy, 2paz, 1pcy, 1azu, 2aza.

7DEOH %LRORJLFDO YDOLGDWLRQ RI WKH &2))(( IXQFWLRQ XVLQJ 'DOL

Test case Nseq Length SAGA-MSA SAGA-COFFEE CLUSTAL ___________________________________________________________________________  50.2 39.2 ac_prot 21 14 binding 31 7 64.2  50.0  89.1 cytc 42 6 67.3  42.0 fnIII 17 9 45.2  80.8 gcr 36 8 80.8  globin 24 17 78.0 85.2  74.8 igb 24 37 70.1   72.2 lzm 39 6  58.5 Phenyldiox 22 8 55.6  96.7 Sbt 61 7 96.0  66.6 62.5 s-prot 27 15 Nseq is the number of sequences; Length is the length of the final SAGA alignment; SAGA-MSA is the percentage of the alignment that matches the structural alignment when SAGA is run to optimize the weighted sums-of-pairs with natural affine gap penalties. SAGA-COFFEE is similar but using the COFFEE function. CLUSTALW is similar with a comparison made on the default output of ClustalW. The method giving the best result is in bold.

7DEOH  %LRORJLFDO YDOLGDWLRQ RI DQ 51$VSHFLILF REMHFWLYH IXQFWLRQ

Master

Slave

Dist.

Pairs (%)

Len. DP

Q(%) PRAGA
86.6 76.1 92.5 92.5 76.6 56.0 63.8 53.2 60.2

Homo sapiens Homo sapiens Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit. Homo sapiens Mit.

Oxytrichia nova Giarda ardeae Latim. chalum. Mit. Xenopus laevis Mit. Dros. virilis Mit. Apis mellifera Mit. Penicil. chryso. Mit. Chla. reinha. Mit. Sacc. cerevis. Mit.

0.41 0.57 0.31 0.43 0.76 1.23 1.26 1.30 1.33

82.5 82.1 81.2 84.9 82.6 72.1 81.3 66.6 80.3

1914 1895 998 985 973 977 1478 1271 1699

83.9 72.2 85.9 83.9 66.8 45.2 37.7 34.1 31.6

Master: Sequence with a known structure. Slave: sequence with an unknown structure. Dist.: estimated mean number of substitutions per site between the master and the slave measured on the reference alignment. Pairs: percentage of residues involved in the master secondary structure. Len.: length of the reference alignment. Q: measure m1 (overall level of identity with the reference alignment) made on a Dynamic Programming with local gap penalties alignment (DP) or on a RAGA alignment. The sequences EMBL accession numbers are as follow: +RPR VDSLHQV (X03205), +RPR VDSLHQV mitochondria (V00702), 2[\WULFKLD QRYD (X03948), *LDUGD DUGHDH (Z177210), /DWLPHULD FKDOXPQDH mitochondria (Z21921), ;HQRSXV ODHYLV mitochondria (M27605), 'URVRSKLOD YLULOLV mitochondria (X05914), $SLV PHOOLIHUD mitochondria (S51650), 3HQLFLOOLXP FKU\VRJHQXP mitochondria (L01493), &KODP\GRPRQDV UHLQKDUGWLL mitochondria (M25119), 6DFFKDURP\FHV FHUHYLVLDH mitochondria (V00702).

Sociological Methods & Research
http://smr.sagepub.com How Much Does It Cost?: Optimization of Costs in Sequence Analysis of Social Science Data
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher and Cédric Notredame Sociological Methods Research 2009; 38; 197 DOI: 10.1177/0049124109342065 The online version of this article can be found at: http://smr.sagepub.com/cgi/content/abstract/38/1/197

Published by:
http://www.sagepublications.com

Additional services and information for Sociological Methods & Research can be found at: Email Alerts: http://smr.sagepub.com/cgi/alerts Subscriptions: http://smr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://smr.sagepub.com/cgi/content/refs/38/1/197

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

How Much Does It Cost?
Optimization of Costs in Sequence Analysis of Social Science Data
Jacques-Antoine Gauthier
University of Lausanne, Switzerland

Sociological Methods & Research Volume 38 Number 1 August 2009 197-231 Ó 2009 SAGE Publications 10.1177/0049124109342065 http://smr.sagepub.com hosted at http://online.sagepub.com

Eric D. Widmer
University of Geneva, Switzerland

Philipp Bucher
Swiss Institute of Bioinformatics and Swiss Institute for Experimental Cancer Research, Lausanne Switzerland

´ Cedric Notredame
Centre National de la Recherche Scientiﬁque, Marseille, France, and Centre for Genomic Regulation, Barcelona, Spain
One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justiﬁcation problem of cost schemes. Keywords: sequence analysis; optimal matching; trajectories; empirical cost optimization

O

ptimal matching analysis (OMA) has emerged since the 1990s as a main methodological innovation in the social sciences for ﬁnding

Authors’ Note: Please address correspondence to Jacques-Antoine Gauthier, University of ˆ Lausanne, SSP–MISC, Batiment de Vidy, CH-1015 Lausanne, Switzerland; e-mail: JacquesAntoine.Gauthier@unil.ch. 197
Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

198

Sociological Methods & Research

patterns in sequences of social events (Abbott and Tsay 2000). It is based on the assumption that succession of social statuses or events constitutes stories throughout the life course that can be measured in a set of data (Abbott 1984, 1990a, 1990b, 1995a, 2001). Usual measures of distance, such as the Euclidean distance, are ineffective for many sequential data, for example, when their lengths differ (Kruskal 1983; Abbott 1995b, 2001). Therefore, multivariate statistical methods falling within the framework of dynamic programming procedures and stemming from molecular biology (e.g., Needleman and Wunsch 1970) have been adapted to the study of social trajectories (Abbott and Hrycak 1990; Erzberger and Prein 1997; Giele and ¨ Elder 1998; Wilson 1998; Aisenbrey 2000, Rohwer and Potter 2002) and embodied in various softwares (TDA,1 Optimize,2 and CLUSTALG3). One problem identiﬁed as major in this set of methods, however, lies in the cost schemes on which empirical analyses are based. As a matter of fact, optimal matching methods decompose the total difference between any two sequences into a collection of individual elementary differences using substitution, deletion, and insertion operations (Kruskal 1983). The determination of the costs attributed to those operations is the subject of an ongoing debate in the social sciences (Abbott and Tsay 2000; Wu 2000) since the setting of costs is not in most cases based on explicit and strong theoretical stances. For example, given a pair of sequences to be aligned, one can wonder if it is the same to substitute 1 year of full-time employment with either 1 year of part-time employment or 1 year of being exclusively at home. If it is not, we should consider weighting the costs of those operations so that they contribute differently to the ﬁnal alignment of the two sequences. Some scholars emphasize the large impact that cost setting has on the ﬁnal results of their ¨ analysis (Rohwer and Potter 2002) whereas others take the opposite stance, underlining its marginal impact on similarity scores among sequences (Levine 2000). However, most argue for both sensitivity and stability of the effect of cost variations on the results of the analysis (Abbott and Hrycak 1990).4 Therefore, researchers in the social sciences are left wondering to what extent the ﬁnal results of their analyses are reproducible and valid. This article ﬁrst describes usual solutions proposed by social scientists in regard to the problem of the determination of costs in sequence analysis. Then, it proposes a method that computationally derives costs from the empirical data, based on state-of-the-art approaches in bioinformatics (Henikoff and ¨ Henikoff 1992; Muller and Vingron 2000; Ng, Henikoff, and Henikoff 2000; Yu and Altschul 2005). The proposed algorithm is then tested on three distinct social science data sets. We further discuss the consequences of the results for empirical analyses of sequence data in the social sciences.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

199

How Are Costs Determined in the Social Sciences?
The issue of costs concerns two operations in sequence analysis: substitution and insertion/deletion. Because this stage of sequence analysis is critical for further results, all publications that use OMA provide some sense of how costs are set, but with unequal degrees of details. Based on a literature review in the ﬁeld, we found ﬁve strategies regarding the setting of substitution costs as they are used in the social sciences. A ﬁrst strategy is to set all substitution costs to a constant, that is, using an identity matrix (Dijkstra and Taris 1995; Rohwer and Trappe 1997; Pentland et al. 1998; Wilson 1998; Schaeper 1999; Billari 2001). Those using this strategy argue that they have no rational way to set costs in another way. This strategy is used largely when no theoretical rationale is available for supporting the setting of costs. It has been criticized, however, for its inability to reﬂect unequal differences between a given set of social characteristics, on one hand, and the distribution of those different positions on the other. Abbott and Hrycak (1990) gave the example of determining the proximity of some occupational positions such as senior executive, ﬁrst-level supervisor, and line worker, which would be impossible. They proposed that in this case substituting or inserting the rarest one should be more costly. A second research strategy uses differentiated costs following theoretical intuitions concerning the ‘‘social weight’’ for substituting one status with another (Chan 1995; Erzberger and Prein 1997; Halpin and Chan 1998; Blair-Loy 1999; Giuffre 1999; Schaeper 1999; Scherer 2001; Widmer, Levy, et al. 2003). For instance, Chan (1995) underlined that decisions about costs have to be grounded in theoretically important divisions between social classes. One may agree only in principle with this and comparable statements, but the social sciences are currently characterized by various contradicting theories rather than by a common theoretical framework such as evolutionary theory in biology (Grauer and Li 2000; Turner 2001; Giddens, Duneier, and Applebaum 2003). Therefore, backing costs with theoretical statements often proves difﬁcult because of the large number of alternatives, depending on the theory chosen. Also, because results from sequence analysis are used to support and contradict theoretical statements at the same time, there is some circularity in building the costs on the same theoretical statements that they are supposed to help prove or disprove. This is as true for research on social classes as for other research areas in the social sciences. A third strategy consists of applying some empirical coding scheme based on common sense or face value. Aisenbrey (2000) set the substitution costs

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

200

Sociological Methods & Research

according to a hierarchical ordering of the statuses that constitute the sequences. Abbott and Forrest (1986, 1989), for instance, categorized the statuses of sequences according to the number of steps up the hierarchy necessary to put them under a common heading. The substitution cost is computed as the ratio of this number to the total number of steps possible. Applying the ‘‘garbage can model’’ to estimate the institutional inﬂuence on the textbook publishing process in physics and sociology, by means of sequence comparison technique, Levitt and Nass (1989) based the setting of their substitution costs on a list of topics and subtopics used in structuring textbooks. The cost was set to 1 for a change from one topic to another (e.g., stratiﬁcation vs. ideology) and to 0.5 for a change between subtopics of the same topic (race vs. gender as substructures of stratiﬁcation). Studying the structure of sociological articles across time, Abbott and Barman (1997) deﬁned two levels of elementary states of sociological articles. Level 1 comprised statuses such as ‘‘introductory,’’ ‘‘hypotheses,’’ and ‘‘literature,’’ whereas Level 2 encompassed subdivisions such as ‘‘topic,’’ ‘‘state of affairs,’’ ‘‘questions,’’ and ‘‘author’s theory/assertion’’ for the introductory heading. A substitution cost of 1 was attributed to subheadings falling under different headings and 0.25 for subheadings falling under the same heading. In all cases reviewed, the setting of costs is not done on strong theoretical bases, but rather on rules that make empirical sense considering the problem at hand. Fourth, some authors set costs based on a combination of common sense (the third strategy) with the likelihood of transitions between statuses in the empirical data (Abbott and Hrycak 1990; Stovel, Savage, and Bearman 1996; Stovel and Bolan 2004). For instance, in their programmatic study of musicians’ careers, Abbott and Hrycak (1990) ﬁrst distinguished for each musician nine spheres of activity (court, town, church, etc.) and 15 positions (vocalist, composer, Kapellmeister, etc.). Among the 135 combinations, they ﬁnally kept 35 different occupational positions as statuses in a musician’s career. To set the costs of substitution, they proposed that a change in both sphere and position is more drastic than a change in only one sphere. They set to 0.75 the cost for a change within either a sphere or a position. The cost was set to 1 when the change occurred on both levels. Second, in order to take into account the fact that some pairs of occupational positions seem to be closely connected with mobility (i.e., they often lie on the same career line), they combined the distance matrix, based on mobility, with a position/sphere dissimilarity matrix. This matrix was constructed by classifying all moves in all careers according to their frequency. The ﬁnal substitution matrix is then a linear combination of corresponding symbols of the two matrices.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

201

An alternative to the development of substitution costs is represented by the use of transition costs, estimated directly on collections of trajectories. Such an option is available in TDA software, but to our knowledge, no empirical results based solely on this way of determining substitution costs have yet been published. In a transition matrix, low costs indicate pairs of symbols that are likely to co-occur in a speciﬁc life trajectory (such as work and retirement). In a substitution matrix, low substitution costs indicate symbols that are likely to occur simultaneously in two different trajectories. A low substitution cost does not imply any transition, but rather an equivalence of some sort between the two considered statuses. While transition matrices are ideal for analyzing individual strings and identifying trajectory anomalies, they are much less suitable to comparisons of alternative trajectories that rely on the comparison of symbols occurring simultaneously in different trajectories. Some scholars have used costs based on transitions, combined with some additional criteria. Stovel et al. (1996) derived the substitution costs from an analysis of the complete transition matrix reporting the distribution of work transitions of all workers of Lloyds Bank over the period 1890 through 1970. They then distinguished costs for positions and for branch changes and combined them. Considering residential trajectories, Stovel and Bolan (2004) used a similar strategy. They ﬁrst constructed a place-type variable (nine categories) based on a continuum ranging from small rural towns to large metropolitan cities. This theoretically based distinction was then combined with the empirical distribution of the frequency of all possible transitions among types of places. The substitution matrix was constructed as a repeated adjustment between the initial theoretical model and the empirical transition rates. In contrast to the previous three strategies, this strategy marks a signiﬁcant improvement as it is at least partially empirically driven. There are, however, various problems existing with the solutions currently proposed. First, all reviewed solutions are at least partially driven by intuition or face value, or by some kind of theoretical stance. Second, the choice of simple frequencies (or a linear function of them) to weight the substitution cost is not supported by any formalized computational methods nor by any statistical theoretical grounds. Third, even in cases where ‘‘pseudo’’ or intuitive iterative methods are used to set the substitution costs (cf. Stovel and Bolan 2004), no formal rules are presented that justify the solution chosen by the researchers. Fourth, none of those models succeed in giving a systematic and fully empirically driven procedure of substitution cost settings. Finally, no attempts are made to optimize costs based on the empirical data at hand.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

202

Sociological Methods & Research

In the ﬁfth strategy, some researchers acknowledge having used a mix of several if not all approaches listed above, insisting on the exploratory dimension of the process and the fact that guidelines are few and rather ¨ fuzzy (Rohwer and Potter 2002). To summarize, researchers in the ﬁeld have underlined that the issue of the determination of costs in OMA remains presently open.

An Alternative: Deriving the Cost Empirically
To develop a more systematic and reliable method for cost setting to the ones currently existing in the social sciences, one should get back to the basics of sequence alignment. Given two strings I and J, Given two strings I and J, a penalty for insertions and deletions (called INDEL), and a cost matrix C, where CSi Sj is the cost for aligning Si, the ith symbol of I, against Sj, the jth symbol of J, the score of the optimal alignment can be computed using the following recursion:
8 < OMAði − 1, j − 1Þ + CSi Sj OMAði, jÞ = Best OMAði − 1, jÞ + INDEL : : OMAði, j − 1Þ + INDEL ð1Þ

In a general sequence comparison perspective, one considers that a substitution is equivalent to a deletion followed by an insertion. Therefore, the value of an INDEL is often arbitrarily set to half of that of a substitution (Kruskal 1983). Each line in equation (1) corresponds to the optimal match score of two substrings. For instance, OMA(i − 1, j − 1) corresponds to the optimal match score of a subsequence containing the symbols 1 to i − 1 of Sequence 1, against a subsequence containing the symbols 1 to j − 1 in the second sequence. As such, this equation deﬁnes a recursion in which the score of any alignment OMA(i, j) can be estimated by considering an optimal extension of the three shorter alignments OMA(i − 1, j), OMA(i − 1, j − 1), and OMA(i, j − 1). Considering that each of these shorter alignments is already an optimal matching of the associated substrings, we know that OMA(i, j) is optimal. This strategy relies on the assumption that each position is independent and that the alignment scores are additive. The alignment of Sequences A and B in Figure 1 is produced by applying recursion, as in equation (1), and iteratively ﬁlling up the OMA(i, j) array until the optimal matching score OMA(I, J) is obtained (Kruskal 1983). By recording the results of all the comparisons made at each step of the recursion, it is possible to trace back the optimal scores from

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

203

Figure 1 Example of Optimal Matching Score Computation and Alignment

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

204

Sociological Methods & Research

the cell OMA(I, J), thus generating an alignment, as shown in Figure 1, where an identity substitution matrix has been used. Such a matrix assigns the value 0 to the matching of two identical letters and the value −2 to the substitution of two different letters. Insertion or deletion occurring at one extremity of the alignment takes the value −1 (terminal INDEL) and the value −2 when they are used within the alignment (internal INDEL). In the OMA(i, j) array below, the traceback is indicated in bold. Starting from the bottom-right corner of the array, vertical moves correspond to an INDEL in Sequence A, horizontal moves to an INDEL in Sequence B, and diagonals to a match or a substitution. One of the main issues that arises in equation (1) concerns the estimation of the substitution costs (Cij). This issue is also central in biology. Given 20 amino acids, some so similar that they are almost interchangeable while others are very different, one cannot simply use any a priori substitution matrix; some modeling is required. Dayhoff, Schwartz, and Orcutt (1978) addressed this problem in the 1970s using a data-driven empirical approach. They manually aligned sets of highly similar, samelength sequences of amino acids and counted the number of mutations tolerated by evolution. A mutation is characterized by the presence of 2 different amino acids at the same position of the alignment. In a general sequence comparison perspective, this is called a substitution (Kruskal 1983). In this context, highly similar sequences are deﬁned as those having more than 80 percent identity, where the percentage of identity is calculated by dividing the number of positions in the alignment in which the same letter appears in both sequences (identities) by the length of the alignment, as shown in equation (2). All positions with a gap in either sequence are nonidentities; thus, only the alignment of two identical sequences yields to 100 percent identity:
Percentage of Identity = W = Number of Identical Matches : Alignment Length ð2Þ

Selecting sequences with a high percentage of identity for computing data-driven costs of substitution prevents biases due to uncontrolled heterogeneity. For instance, the alignment of Figure 1 displays 57 percent identity (four identical pairs of letters found over the seven positions of the alignment). Finally, for each pairwise alignment, the relative frequency of substitutions occurring between two particular amino acids is compared to what was expected by chance alone. These values are computed as log odds and tabulated into a data-driven substitution matrix, as in equation (3):

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

205

Dayhoff Costða, bÞ = log



 fab : fa Ã fb

ð3Þ

In equation (3), fab is the relative frequency with which the symbols a and b have actually matched at the same position of a given set of pairwise alignments, while fa Ã fb is the product of the relative frequencies of a and b in the same data set and therefore an estimation of the probability of seeing a and b aligned throughout all the alignments of the data set. If we consider fab to be an estimate of the probability of ﬁnding a and b matched in the data set, then it becomes possible to estimate the ratio of those two probabilities (their odds) and evaluate the extent to which a given substitution (match) between two symbols is over- or underrepresented in the alignments. The most notable property of log odds is to yield negative scores for events observed less often than expected by chance. In the context of optimal matching, this amounts to having a cost matrix that penalizes unexpected matches with negative values while expected matches or identities are rewarded with positive values. As in an alignment, two identical symbols do not systematically match, and the Dayhoff cost for substituting two identical symbols is often different from zero. In biology matching, various pairs of identical symbols can be associated with different positive values. The rationale is that in biology, all conservations are not equally important. In the social sciences, however, the decision was made early to set conservation costs to 0 and substitution to variable costs. This model suggests that all social statuses are equally conserved, regardless of their nature. This may or may not be true. For instance, one may ask if the social cost should be the same for matching years as unemployed or years spent on the labor market. The equality of these statuses cannot be ruled out as long as it has not been formally demonstrated. For the time being, the proposed algorithm sticks to the mainstream procedure in the social sciences, but it would be trivial to adapt it so that different costs may be used for different types of identities. To get a cost of zero for the substitution of two identical symbols, we use a normalized cost (N_cost) that is derived from the cost deﬁned in equation (3) as follows:
N– costða,bÞ=Dayhoff Costða,bÞ− Dayhoff Costða,aÞ+Dayhoff Costðb,bÞ : ð4Þ 2

In equation (4), Dayhoff cost refers to the original Dayhoff cost (equation [3]) that is positive and maximized for identities while yielding lower (often negative) values for mismatches. A substitution matrix based on

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

206

Sociological Methods & Research

N_costs has the same properties as the Dayhoff cost matrix except that it yields a null cost to the alignment of two identical sequences, a convenient property for cluster analysis based on a distance matrix. In biology, it is common practice to use log-odds matrices as a scoring scheme when applying the optimal matching algorithm. The main reason is that the versatility of the log-odds method makes it possible to discriminate between different types of mismatches in an objective and quantitative fashion. As substitution and INDEL operations are mutually dependent, using cost matrices as deﬁned in equation (3) or (4) calls for setting the value of the INDELs according to the cost matrix at hand, as shown in equation (5).
PN − 1 PN INDEL =
i=1 j = i + 1 Cij 2 − NÞ × 0:5 ðN

:

ð5Þ

In equation (5), the cost for not matching a symbol (INDEL) was estimated using the Thompson formula (Thompson, Higgins, and Gibson 1994), where INDEL is set to the average substitution cost of the substitution matrix (i.e., the matrix average ignoring the values in the main diagonals). It is possible to distinguish two kinds of indels, the internal ones that occur between two given symbols and the terminal ones that come at the end of the shorter sequence to make its length equal to the longer one. In the context of this work, we simply attributed the INDEL value of equation (5) to internal indels only and lowered it to INDEL/2 for terminal ones, thus making it easier for indels to be terminal rather than internal. Given a collection of sequences, the main difﬁculty is the proper estimation of an appropriate cost matrix. Using reference alignments is possible but may require some arbitrary knowledge. In the case of Dayhoff, using reference alignments was possible because closely related sequences were available whose alignment could be assembled in an unambiguous manner (i.e., without INDEL). In the social sciences, reference alignments are not available, and a strategy must therefore be worked out to generate them in a systematic and unbiased fashion. Over the last 15 years, several techniques have been introduced in biology and aimed at training position-speciﬁc substitution matrices through iterative sequence-alignment procedures (Lawrence et al. 1993; Hughey and Krogh 1996; Altschul et al. 1997; Bateman et al. 1999). PSI-BLAST (PositionSpeciﬁc Iterative Basic Local Alignment Search Tool), one of the most popular tools in biology, is the one whose principle resembles most the one developed here. In PSI-BLAST, a biological sequence is ﬁrst compared to all the others in the database, using an off-the-shelf substitution matrix. The

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

207

Figure 2 Pseudocode Describing the Iterative Training Procedure
0 - set CURRENT matrix to the identity matrix I - Optimal matching of the sequences with the CURRENT matrix II - Estimation of the NEW matrix on the alignments produced in (I) Measure the Percent Identity on the alignments Select alignments Yielding more than 60% identity Count the matches/mismatches on the selected alignments Weighting the counts of each alignment with its percent identity Compute the NEW matrix III–Comparison of NEW matrix and CURRENT matrix If CURRENT==NEW, terminate Else set CURRENT to NEW and proceed to I.

best alignments thus generated are selected according to their percentage of identity and used to update the matrix in a process that goes on, cycle after cycle, until successive cycles fail to modify the matrix, in which case the algorithm is said to have reached convergence. The proposed strategy is directly adapted from this iterative method and is outlined in the pseudocode shown in Figure 2. In this context, the matrix can be viewed as a model used for generating optimal matches of the sequences. In other words, a correct matrix must be able to generate alignments similar to those it was estimated from. This equivalence is sought in the iteration procedure, in which matrices and alignments are alternatively generated until they both become invariant, suggesting an equivalent information content. Overall, this amounts to generating matrices whose purpose is to optimally summarize the information contained in the sequences. In this context, the alignments and the matrix can be viewed as two alternative models of the relationships among sequences. The convergence is meant to ensure that these two models are equivalent.

Empirical Cost Matrix Estimation of Social Science Data
Given a set of sequences of social statuses and a preestimated matrix (in this case, an identity matrix), pairwise alignments are generated with the OMA algorithm. This can be done either by exhaustively considering

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

208

Sociological Methods & Research

all possible pairs of sequences or by restricting the training procedure (cf. Figure 2) to a random subset of the sequences if computation time is an issue. When computing matrix statistics from these alignments, the main caveat is the uneven alignment quality. While mismatches measured on almost identical strings can be expected to be meaningful substitutions, matches and mismatches measured on poorly matched strings may be suspicious. Dealing with low-quality alignments is a delicate issue in the social sciences as well as in biology. The simplest approach to deal with this limitation is to ignore alignments with a low percentage of identity, as done in PSI-BLAST (discussed previously). For instance, in the context of this article, we excluded all the alignments yielding less than 60 percent identity (equation [2]). Such a conservative threshold ensures the quality of the considered alignments and therefore the relevance of the observed substitutions. Furthermore, based on strategies developed in biology, we also applied an extra weight on the selected alignments in order to ensure that the best alignments contribute more to the ﬁnal matrix. This extra step is similar to the selection made for empirically estimating the costs of substitution (equation [2]), but it speciﬁcally helps smooth the convergence of the iterative process and also guarantees a stronger contribution of the most reliable alignments. We again used percentage of identity (as measured on the alignment) as a weighting scheme. This parameter is often regarded as a good indicator of correctness and was successfully used by Notredame, Holm, and Higgins (1998) to design local scoring schemes. We therefore used equation (6) to derive a collection of weights that are speciﬁcally applied to the relative frequency of each possible substitution associated with a given alphabet:
fab = PS
1

Wi Ã Ni ða, bÞ : PS Ã 1 Wi Li

ð6Þ

Thus, weighting the relative frequency fab of symbol a matching symbol b in the alignments is estimated in the following fashion, where Wi is the weight associated with the alignment i, Li is the length of that alignment, and Ni(a, b) is the number of pairs ab perfectly matched in that alignment. The term Ni(a, b) indifferently represents identical matches if a and b are the same symbols or mismatches if a and b are different symbols. In equation (6), the weight Wi is meant to increase the contribution of trustworthy alignments, thus speeding the convergence process and decreasing the amount of noise contributed by spurious alignments. In the case of social science data in which sequence patterns are shaped

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

209

following less stringent rules than in biological sequences and therefore show more diversity, this approach allows us at the same time to take a greater variability of sequences into account and to limit the inﬂuence of outliers. To prevent possible underﬂow (i.e., division by 0) caused by a rare mismatch or match, a small value (0.001) is added to every frequency. Given frequencies tabulated for every possible pair of symbols, the substitution matrix is then computed using equation (3). This matrix is the new matrix that will be used in the next training round (cf. Figure 2). The iteration procedure is meant to optimize the cost matrix so that it summarizes as accurately as possible the information contained in the alignments from which it is estimated. That procedure is complete when a matrix is able to generate alignments with statistical properties similar to those it originates from. That convergence can easily be measured by estimating the difference between two successive matrices in the evaluation procedure ( ), using, for instance, the mean square differences between them:
= r ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ   Avg ðM1 ða, bÞ − M2 ða, bÞÞ2 : ð7Þ

The iterative procedure is stopped when becomes equal to 0. However, this procedure is merely an attempt to reach optimality, with no proven guarantee (Hughey and Krogh 1996). In this context, the simplest criterion to ensure optimality is to check that alternative trainings converge on similar matrices as indicated by low values, as in equation (7). To validate this, we randomly selected 10 sets of 100 sequences in the test data set and trained the corresponding matrices, keeping the intermediate matrices obtained at every cycle. Figure 3 shows the average measured between all these matrices against the iteration number. Given a data set of 100 sequences 40 symbols long, Figure 3 shows the typical proﬁle of several training procedures. The is an estimation of the difference between the matrices of two successive rounds (low s indicate highly similar matrices). While values tend to decrease over cycles, increasing values (peaks) usually result from the exceeding of a local minimum by the training procedure. Each curve in Figure 3 corresponds to one matrix estimation run. For each run, a set of 1,000 sequence pairs was chosen randomly (out of the 100*100 possible pairs) and kept through all the iterations. The results suggest that the estimation procedure is insensitive to this initial choice with a convergence systematically occurring in 5 to 6 cycles and ﬁnal matrices highly correlated.5 Altogether, this high correlation and the constant number of cycles suggest an efﬁcient and robust training procedure.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

210

Sociological Methods & Research

Figure 3 Value of Against the Number of Iterations of the Training Procedure of Substitution Costs for 10 Randomly Selected Sets of Sequences From the Swiss Household Panel Data
12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8

All the procedures described here have been encapsulated in a sequence analysis package called SALTT (Search Algorithm for Life Trajectories and Transitions). It can be compiled and installed on any UNIX-like platform including Linux, Cygwin, and Mac OSX. The package and its documentation are distributed under the General Public License and available free of charge from the authors (Notredame et al. 2005).

Criteria for Comparing Outcomes From Various Cost-Setting Strategies
To test whether the proposed solution provides more adequate results than previous methods, one may consider some criteria speciﬁc to the training procedure as well as a set of criteria widely used to estimate how well sequence analysis and cluster analysis perform. The ﬁrst and simplest criterion to establish the validity of the proposed strategy is to apply it to biological sequences and train matrices that could be compared to standard biological matrices. We have done so on a

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

211

Figure 4 Random Splitting of Symbol A into Two New Symbols (M and N) Not Belonging to the Original Alphabet

Seq1:

AABDEEBADBEDA Seq2:

BDAEAEAADBADA

A -> M ou N Seq1b: MNBDEEBNDBEDM Seq2b: BDNEMEMNDBNDM

well-known collection of 500 related human sequences known as the kinome (Manning et al. 2002). The procedure delivered a substitution matrix highly correlated to a standard point accepted mutation in which all the known mutational preferences between amino acids could easily be recognized. We then do the same by comparing three distinct sets of social sciences data representing the same sequential reality. Then, the training procedure is evaluated by testing its ability to correctly identify the closeness of two different symbols, using solely the information contained in the data. To do so, we use a set of sequences to compare a reference cost of substitution between two given symbols produced by the training procedure (e.g., AB), with the cost produced by the training procedure for the same substitution, in the case where one of the symbols (e.g., A) has been randomly split into two new symbols (M and N) not belonging to the alphabet. As symbols M and N are actually ‘‘hidden A,’’ we expect the training procedure to determine the substitution costs AB, MB, and NB as equivalent. Figure 4 shows for two given sequences how a symbol is randomly split into two new symbols not belonging to the original alphabet.

Testing the Quality of the Clustering
A third set of criteria pertains to quality testing of cluster analysis. One of the main difﬁculties with clustering methods lies in the determination of the number of clusters really present in the data (Milligan and Cooper 1985, 1987). There is no perfect method to establish this number, but several indicators may be used to help decide (Everitt 1979; Bock 1985; Hartigan 1985; Milligan and Cooper 1985; SAS Institute 2004). For

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

212

Sociological Methods & Research

Milligan and Cooper (1987), there are two categories of tests concerning the quality of cluster analysis: The ﬁrst considers that internal criteria are able to validate the results of the clustering, that is, to justify the number of clusters chosen. The second one uses external criteria. Such criteria represent information that is external to the cluster analysis and was not used at any other point in the cluster analysis (Milligan and Cooper 1987). In terms of internal criteria, Milligan and Cooper (1985) have evaluated and compared 30 statistics known as stopping rules that help in deciding how many ‘‘real’’ clusters are present in the data. The availability of such indices in main statistical software packages (such as SAS or SPSS) is of course a nonnegligible element of choice concerning what criteria to use. Two of the most efﬁcient indices among the 30 that Milligan and Cooper (1985, 1987) have evaluated are part of the SAS software. The ﬁrst one is a pseudo-F developed by Calinski and Harabasz (1974); it represents an approximation of the ratio between intercluster and intracluster variance. The second index is expressed as Je(2)/Je(1) (Duda and Hart 1973) and may be transformed into a pseudo-t2.6 The third criteria we used is R2, which expresses the size of the experimental effect. It is reasonable to look for a consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). We can then deﬁne the stopping rule for a statistically optimal cluster solution as a local peak of the pseudo-F (high ratio between inter- and intracluster variance), associated with a low value of pseudo-t2 that increases at the next fusion and a marked drop of the overall R2.7 Generally, a cluster solution is said to be statistically optimal when the number of classes is kept constant across strategies, when the intercluster variance is highest, and when the intracluster variance is lowest. Put another way, clusters should exhibit two properties, external isolation and internal cohesion (Punj and Stewart 1983). Therefore, using comparative scree plots is a straightforward way of dealing with the issue of testing cluster solutions drawn from distances based on various cost schemes, including the computationally derived one. A given cluster solution is retained for analysis only if at least two among those three criteria (pseudo-F, pseudo-t2, and R2) support its validity. External criteria refer to the extent to which clusters drawn from the data correlate with either independent variables or outcomes (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help in social research as the ultimate goal of social sciences is explanation rather than description. A third criterion is more intuitive: To what extent are empirical clusters easily comprehended, based on prior knowledge of the phenomenon and the central hypothesis of the research? This criterion

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

213

can be approached by using experts and computing interreliability estimates. The procedure in that case is as follows: Provide cluster solutions based on the various cost schemes, and have a set of raters who decide independently which is their favorite solution. Then one may compute interrater reliability and see which coding scheme comes up ﬁrst in the list. Given the importance of the debate concerning the inﬂuence of sociostructural factors on the occupational trajectories of women in the sociological ﬁeld and the availability of high-quality data on occupational status during entire life courses, we test these methods on data sets addressing this topic.

Description of the Test Samples
Considering the fact that women’s labor market participation is more diverse than that of men (Myrdal and Klein 1956; Levy 1977; Mott 1978; ¨ Elder 1985; Moen 1985; Hopﬂinger, Charles, and Debrunner 1991; Moen ¨ and Yu 2000; Blossfeld and Drobnic 2001; Kruger and Levy 2001; Levy, Widmer, and Kellerhals 2002; Moen 2003; Widmer, Kellerhals, and Levy ¨ 2003; Bird and Kruger 2005; Levy, Gauthier, and Widmer 2006), and in order to facilitate the comparisons between the data sets, for each database we selected only women who were married or living with a partner at the time of the interview. Moreover, in order to maximize the quality of the data, we retain only the trajectories that had less than 10 percent of missing values.

Sample Test 1: Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families
The ﬁrst sample of occupational trajectories is drawn from a retrospective questionnaire of the study Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families (SCF) that was conducted in 1998 with 1,400 individuals living as couples in Switzerland (Widmer, Kellerhals, and Levy et al. 2003; Widmer, Kellerhals, and Levy 2004). Respondents were asked to provide information about every year of their occupational trajectory starting from age 16, onward to 64. Every year of the trajectories was coded using a seven-category code scheme: full-time employment, part-time employment, positive interruption (sabbatical, trip abroad, etc.), negative interruption (unemployment, illnesses, etc.), housework, retirement, and full-time education. Data were right truncated as most individuals had not yet reached the age of 64 at the time of the interview. Sociostructural indicators (socioeconomic status of orientation family, educational

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

214

Sociological Methods & Research

level, number of children, and income) were measured for the time of the interviews only. The ﬁnal sample size was 564 women.

Sample Test 2: The Swiss Household Panel
Since 1999, the Swiss Household Panel (SHP) has collected data on a representative sample of private households in Switzerland on a yearly basis.8 In its third wave, the SHP included a retrospective questionnaire sent to 4,217 households (representing 8,913 individuals). For reasons of validity, the analysis of the subsample of individuals who answered the retrospective questionnaire was restricted to those aged 30 and older, decreasing the sampled female population to 1,935. The SHP asked respondents to provide information on their educational and occupational status from their birth to the present. Each change in status is associated with a starting year and an ending year. We recoded these the same way as for Sample Test 1. Sociostructural indicators comparable to those in Sample 1 were also obtained. This sample included 1,107 women.

Sample Test 3: Female Job Histories From the Wisconsin Longitudinal Study
The Wisconsin Longitudinal Study (WLS) is a long-term study of a random sample of 10,317 men and women who graduated from Wisconsin high schools in 1957. This data set is for public use and available at the University of Wisconsin–Madison Web site (http://www.ssc.wisc.edu/ wlsresearch). The female job histories of 1957-1992 were constructed by Sheridan (1997) from the 1957, 1964, 1975, and 1992 WLS data collections. The data also include social background, youthful aspirations, schooling, military service, family formation, labor market experiences, and social participation. The ‘‘female job histories’’ data concern 5,042 women born in 1938 and 1939. We could retain only three main occupational statuses, namely, full-time paid work, part-time paid work, and fulltime housewife. There were 2,243 women in this sample.

Results
Production of Data-Driven Costs of Substitution
From a sociological point of view, we could expect a relative stability of the costs of substitution from one set of sequences to another, the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

215

occupational trajectories of contemporary Swiss and North American women being to a certain extent comparable, at least in regard to the inﬂuence of the birth of children on the reduction or cessation of paid work. The individual sequences of occupational statuses are built by attributing a single symbol (a code corresponding to a given occupational status) to each year of life of the respondents.9 Table 1 compares the different costs of substitution either set arbitrarily to identity, following theoretical arguments concerning differences among types and rates of occupational activities (for details, see Widmer, Levy, et al. 2003), or by means of a training procedure in the different databases. Table 1 shows that the training procedure produces costs that are more differentiated than identity costs. The range of costs is also broader, partly because the procedure is sensitive to very rare substitutions. The stability of the trained costs of substitution from one database to another conﬁrms the ability of the training to produce meaningful cost schemes. The training procedure reﬂects some relations between the different statuses that are sociologically relevant. Compared to identity costs that may not be differentiated between men and women, the trained costs reveal, for example, the relative ease (the low costs) with which women in the samples go from paid work to housework. The comparison of knowledge-based costs and trained costs of substitution shows a high similarity between the two sets of values, which are correlated at .68 (p < .01) with trained costs for SCF data, at .63 (p < .01) for SHP data, and at .73 (p < .05) for WLS data. Table 2 shows Pearson’s coefﬁcient of correlation between the costs by method of cost setting and database. Table 2 shows that the trained costs of substitution are more strongly associated with each other from one data set to another than they are with costs set to either identity or to knowledge-based values. On the other hand, even if it remains relatively high, the associations between trained, knowledge-based, and identity costs are systematically weaker than those between trained costs. This conﬁrms the stability of the results stemming from the training procedure and explains at least partly the slightly but systematically different (and more highly correlated) results it provides compared to the two other strategies (identity and knowledge based).

Validation of the Training Procedure
An important issue in the use of a computerized data–based determination of substitution costs is to assess the extent to which this procedure is able to process information in a sociologically relevant way. Three

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

216

Sociological Methods & Research

Table 1 Comparisons of Identity, Knowledge-Based, and Trained Costs of Substitution for Three Data Sets: SCF, SHP, and WLS
Costs of Substitution Identity Knowledge Trained Trained Trained Based SCF SHP WLS 1.0 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.8 1.0 0.8 1.0 1.0 1.0 0.3 0.8 0.8 1.0 0.8 0.8 0.3 1.0 1.0 1.0 0.8 0.3 1.0 1.0 0.8 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.6 0.7 0.7 0.5 0.9 0.5 0.6 0.7 0.9 0.5 1.0 0.7 0.7 1.5 0.7 1.3 0.8 0.9 0.9 1.1 0.8 0.9 1.0 0.6 0.6 1.5 1.3 0.7 0.8 0.5 0.7 0.7 0.5 0.8 0.5 0.6 0.7 0.8 0.5 0.8 0.7 0.7 1.2 0.8 0.9 0.9 1.0 0.8 1.4 0.8 0.9 0.8 0.7 0.7 1.6 1.5 0.7 0.8 0.5 0.4

Substitutions of Occupational Status Full-Time * Part-Time Full-Time * Negative Interruption Full-Time * Positive Interruption Full-Time * At Home Full-Time * Retirement Full-Time * Education Full-Time * Missing Part-Time * Negative Interruption Part-Time * Positive Interruption Part-Time * At Home Part-Time * Retirement Part-Time * Education Part-Time * Missing Negative Interruption * Positive Interruption Negative Interruption * At Home Negative Interruption * Retirement Negative Interruption * Education Negative Interruption * Missing Positive Interruption * At Home Positive Interruption * Retirement Positive Interruption * Education Positive Interruption * Missing At Home * Retirement At Home * Education At Home * Missing Retirement * Education Retirement * Missing Education * Missing Insertion or Deletion

0.5 0.5

0.5

0.5

0.5

Note: SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study.

different tests were used. The ﬁrst one referred to the ability of the procedure to evaluate the closeness of a symbol belonging to the alphabet with an unknown symbol not belonging to it. The second one focused on the degree of agreement between classiﬁcations of social trajectories made by

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

217

Table 2 Pearson’s Correlation Between Costs Matrices, by Method of Cost Setting and (Full) Data Sets
Identity Identity Knowledge based SCF trained SHP trained WLS trained 1.00 .98*** .66*** .61*** .71* Knowledge .98*** 1.00 .68*** .63*** .73* SCF Trained .66*** .68*** 1.00 .96*** .97*** SHP Trained .61*** .63*** .96*** 1.00 .94*** WLS Trained .71* .73* .97*** .94*** 1.00

Note: UNIX command line to produce the trained matrix: saltt -e ‘-in dataset.dat -action + pavie_seq2pavie_mat _TGEPF50_THR60_TWE04_SAMPLE50000_’. SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .05. ***p < .001.

specialists in the ﬁeld compared with classiﬁcations of the same data based on identity, knowledge-based, and trained costs of substitution. The third one consisted of measuring the extent to which clusters drawn from the data correlate with some independent sociostructural variables or outcomes.

Identifying the Proximity of Unknown Symbols
A ﬁrst way of validating the training procedure consists of measuring the extent to which it is able to unravel the proximity of two given symbols, based on no other information than the data itself. We tested this for the SCF set of sequences by randomly replacing a given symbol of the sequences alphabet A = {A, B, C, D, E, F, G, X}, which corresponds in this case to an occupational status, with two symbols that did not belong to the original alphabet of that set of sequences, that is, symbols whose actual identity was hidden. Using the training procedure, we then compared the original costs for substituting, for example, Symbol A with Symbol B, with the costs we obtained after having randomly replaced every A with either the hidden symbol M or N (cf. Figure 4). In a second run, we did the same by replacing each B with the hidden symbol O or P. We ﬁnally got ﬁve different expressions of the same initial substitution (in this example, AB = NB = MB = AO = AP), each associated with a speciﬁc cost. This procedure was applied to all pairs of symbols of the data set in turn. If we consider Ei and Ej to be respectively the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

218

Sociological Methods & Research

ith and jth elements of the original alphabet and their two random substitutes—respectively S1(Ei), S2(Ei) and S1(Ej), S2(Ej)—there are ﬁve costs of substitution to take into account if we consider only the substitutions involving at least one symbol belonging to the original alphabet. Under these conditions, as they are actually different expressions of the same initial substitution, we should expect those ﬁve trained costs to be identical, or at least close to each other. To compare all those values in a synthetic way for the entire alphabet, we computed a standardized difference between the trained costs of substitution associated with a given pair of symbols belonging to the original alphabet and the trained costs of substitution between one of those original symbols and the substitute of the other one, as shown in equation (8).
Std Difference= ðcostðEi ½S1 ðEj ÞÞ−costðEi Ej Þ+ðcost½S1 ðEi ÞEj Þ−costðEi Ej ÞÞ : ð8Þ 2 Ã costðEi Ej Þ

The proximity of the ﬁve substitution costs associated with a given original pair of symbols and their substitutes was compared in two ways, using either the ﬁrst substitute of that pair of symbols (as shown in equation [8]) or the second one (where S2 replaces S1 in equation [8]). All those values were then tabulated in Table 3. Its lower part contains the standard differences between the substitutions of Ei, Ej, and their ﬁrst substitute (cost EiEj compared to Ei[S1(Ej)] and [S1(Ei)]Ej), whereas the upper part contains the values associated with their second substitute (cost EiEj compared to Ei[S2(Ej)] and [S2(Ei)]Ej)). Table 3 shows clearly that the training procedure identiﬁes very precisely the closeness of two distinct, but actually identical, symbols.10 Among the 56 different costs of substitution in Table 3, 49 (87 percent) show a difference not larger than 10 percent compared with the original cost. The greater differences may be attributed to the fact that the training procedure is relatively sensitive to rare symbols. For example, symbols C, D, F, and X represent altogether only about 2 percent of the total symbols used in the sequences. The great majority of the hidden costs differing notably from their original costs are concerned with such rare symbols. The difference is maximal when it concerns two rare symbols. The ability of the training procedure to identify the similarity of two unknown symbols based on the data set at hand is one of the main strengths of this way of setting costs of substitution. Even if it stays relatively close to identity costs of substitution, this procedure takes into account the real relations of the different symbols present in the sequences and is therefore highly informative. On one hand, it avoids particular relationships remaining undetermined; on the other hand, it works as a predictive tool in the sense

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

219

Table 3 Standardized Difference Between the Original Trained Costs of Substitution and Their Substitutes
A (%) B (%) C (%) D (%) E (%) F (%) G (%) X (%) Relative Frequency (%) A B C D E F G X — 0 7 6 0 6 10 0 8 — 0 6 –10 10 0 0 7 0 — 5 –7 0 0 –15 6 6 0 — –6 0 6 6 0 0 –7 0 — 13 8 0 0 25 9 11 13 — 0 0 10 7 0 0 8 0 — –7 0 7 –15 –28 0 0 –7 — 33.5 19.5 0.5 1.0 31.0 0.1 14.0 0.4

Note: Rows and columns are given the name of a symbol belonging to the alphabet, although each cell of the table compares the substitution cost of three pairs of symbols (the original one and two substitutes) according to equation (8).

that two different symbols with low substitution costs can be predicted to substitute easily for one another in real life.

Automatic Versus Classiﬁcation by Judges
Another way to validate the training procedure is to test the extent to which automatic classiﬁcation succeeds in replicating a classiﬁcation of sequences made by experts on a small subset of well-identiﬁed sequences. To do so, we extracted a sample of 100 occupational trajectories of women from each data set. Four judges were asked to classify them in a number of clusters that corresponded to previous empirical ﬁndings (Widmer, Levy, et al. 2003; Levy, Gauthier, and Widmer 2006) and to theoretical schemes (i.e., Kohli 1986). In each case, we retain only the sequences that were classiﬁed the same way by at least three (out of four) judges. The interrater agreement lies between 83 percent and 88 percent. To keep the computation procedures as parsimonious as possible, we ﬁrst exactly replicated with SALTT the results we obtained with TDA using two different cost settings (identity and knowledge based). That allowed us to produce optimal alignments and to compare the distance matrices for the three strategies (identity, knowledge based, and training) from within SALTT. For each set of sequences, we ran three optimal matching analyses, the ﬁrst one using identity costs of substitution (for details, see above), the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

220

Sociological Methods & Research

Table 4 Association (khi2 and Symmetric) Between Judges and Automatic Classiﬁcation, by Method of Cost Setting
Database Method Identity * Judges SCF 213.2454* (khi2, df = 16) symmetric 0.8034 (value) 0.0458 (ASE) 206.1951* (khi2, df = 16) symmetric 0.8120 (value) 0.0440 (ASE) 224.5436* (khi2, df = 16) symmetric 0.8291 (value) 0.0434 (ASE) SHP 213.4108* (khi2, df = 16) symmetric 0.7500 (value) 0.0582 (ASE) 228.4631* (khi2, df = 16) symmetric 0.7705 (value) 0.0623 (ASE) 235.1387* (khi2, df = 16) symmetric 0.7797 (value) 0.0602 (ASE) WLS 143.9678* (khi2, df = 9) symmetric 0.7037 (value) 0.0684 (ASE) 148.6864* (khi2, df = 9) symmetric 0.7196 (value) 0.0677 (ASE) 143.2652* (khi2, df = 9) symmetric 0.7037 (value) 0.0677 (ASE)

Knowledge Based * Judges

Trained * Judges

Note: ASE = Asymptotic standard error; SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .001

second one using knowledge-based costs, and the third one using costs stemming from the training procedure. A distance matrix was computed for each set of sequences and for each strategy and then entered into a cluster analysis. Table 4 shows the degree of association of khi2 and (Goodman and Kruskal 1979; Olszak and Ritschard 1995) between the clusters made by the judges and those stemming from automatic classiﬁcation. Table 4 shows that results provided by a trained matrix lead to signiﬁcant associations with the classiﬁcation by judges for the three data sets considered. For the Wisconsin study, results are about the same when using either identity or trained costs of substitution. Trained costs never lead to a weaker association ( symmetric) with judges’ classiﬁcations than identity costs or knowledge-based costs for the SCF and SHP data sets. Results are less straightforward concerning the WLS data, with knowledge-based costs performing slightly better than trained costs. The fact that Wisconsin data are less differentiated (sequences with only three

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

221

different statuses as opposed to seven in the other databases and respondents all about the same age) may explain why trained costs do not lead to a markedly different solution than the two alternative strategies. In all cases, the associations are quite high and signiﬁcant, suggesting the ability of the method to provide meaningful cost schemes. Given the fact that the reference classiﬁcation based on judges responses was very consensual and based on predeﬁned categories, results of that test express the ability of the procedure to differentiate clear-cut, sociologically relevant categories out of the data rather than to evaluate the extent to which those results and the underlying costs of substitution reﬂect the theoretical and subjective conceptual frame of an expert.

Association With External Criteria
A third validation procedure consisted of measuring the extent to which clusters drawn from the data correlate with either independent sociostructural variables or outcomes (Milligan and Cooper 1987; Rapkin and Luke 1993), an approach that seemingly few studies have used so far (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help as the ultimate goal of social sciences is explanation rather than description. For each strategy, the three stopping-rule criteria aimed at determining the number of clusters in the data (pseudo-t2, pseudo-F, and R2) suggested the presence of three clusters in the SCF and SHP data and of four clusters in the WLS data. A closer look at the data reveal that those clusters correspond precisely to typical female trajectories, as described elsewhere ¨ (Moen 1985; Hopﬂinger et al. 1991; Erzberger and Prein 1997; Widmer, Levy, et al. 2003; Levy et al. 2006), namely, trajectories characterized by full-time employment, part-time employment, and full-time as a housewife. In the Wisconsin data, the clusters are the same, but with a fourth one representing a return to the labor market after a period at home. Such a cluster also appears when the clusters of SCF and SHP data are further subdivided. The greater homogeneity of WLS data in terms of age of respondents and completeness of the sequences (no right truncatures) may explain the better visibility (consensus between stopping-rule criteria) of that fourth cluster, which is also documented in the literature (Widmer, Levy, et al. 2003; Levy et al. 2006). We ﬁrst ran a multinomial logistic regression11 on each data set (SFC, SHP and WLS), using cluster membership (which represents in this case types of occupational trajectories) as response variables and a set of

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

222

Sociological Methods & Research

Table 5 khi2 Values of the Likelihood Ratio Test by Database and Cost-Setting Method
Cost-Setting Method Identity Data Sets Set 1: SCF data, 3 clusters Set 1: SCF data, 5 clusters Set 2: SHP data, 3 clusters Set 2: SHP data, 5 clusters Set 3: WLS data, 4 clusters df 272 596 404 808 258 khi
2

Knowledge Based
2

Trained khi
2

p > khi

khi

2

p > khi

2

p > khi2

290.19 .2143 553.02 .8956 568.36 < .0001 897.67 .0150 307.35 .0189

280.87 547.03 562.04 863.32 323.81

.3428 .9250 < .0001 .0865 .0034

288.60 .2339 522.11 .9867 512.34 < .0002 740.12 .9574 288.37 .0939

Note: SCF = Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study.

indicators of social positioning (socioeconomic position of the orientation family, including level of education, number of children, age, and household income) generally considered (cf. description of the sample) as intervening variables in shaping female occupational trajectories. To be consistent with the stopping-rule criteria—that is, a consensus between pseudo-t2, pseudo-F, and R2—we retained in this ﬁrst step the threecluster solutions that those criteria pointed out for each data set. As they are more homogeneous, they represent about the same social reality in each data set and therefore remain sociologically relevant. We then performed the tests on the ﬁve-cluster solutions for SCF and SHP data to check the efﬁciency of the different cost-setting methods on other empirically founded classiﬁcations (Widmer, Levy, et al. 2003; Levy et al. 2006). We felt justiﬁed in doing this because two new clusters emerged from further subdivision of the ﬁrst three clusters deﬁned by the proposed criteria (R2, pseudo-F, and pseudo-t2). Table 5 shows the test of likelihood ratio applied to those multinomial regressions. The likelihood tests compare a given model with the saturated one (a model that exactly replicates the data), meaning in this case that the smaller the value of khi2 (i.e., the larger the p value), the better the model ﬁt to the data.12 One can read from Table 5 that the trained costs of substitution allow building a model that ﬁt better to the data in all cases compared to identity costs and in four out of ﬁve cases compared to knowledge-based costs. Put another way, clusters produced by trained costs of substitution are

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

223

more sensitive to predictors than clusters produced by either identity costs or knowledge-based costs. This is true, although not with the same strength, for the three sets of sequences. The ﬁt is signiﬁcantly better (i.e., the model stemming from trained costs does not differ signiﬁcantly from the saturated model, whereas the two others do) in two cases and with two data samples.

Discussion
Setting costs of substitution in the process of aligning sequences of social statuses is controversial because it may signiﬁcantly inﬂuence the results of the analysis. We propose a method to determine costs of substitution empirically, which we tested using three distinct sets of social science data. The training procedure that we present appears to be, to our knowledge, the only one that is exclusively empirically grounded and optimized. First, we considered the correlation between the substitution matrices for a given alphabet across three data sets of the social sciences realm representing the same social realities (sequences of occupational statuses along the life course) and three cost-setting strategies. The training procedure leads to results that are very similar to those stemming from the two other methods (substitution costs represented as an identity matrix or following some knowledge-based weighting). In this sense, cost variability did not appear to modify the general results of the analysis. Nevertheless, the costs stemming from the training procedure may claim a greater legitimacy as they reﬂect the actual relationships of the symbols considered. That legitimacy is reinforced by the very high correlation existing between the substitution matrices stemming from the training procedures applied to the three data sets at hand. In this sense, the values of the trained cost matrices may even be considered as a validation a posteriori of the use of alternative costs of substitution (knowledge based or identity) found in the literature for the speciﬁc case of occupational trajectories. Moreover, the training procedure shows some interesting features that should be further explored, such as the possibility to differentiate speciﬁc substitution costs according, for example, to gender. The ability of the trained costs to provide a clustering that is better associated to some sociologically unequivocal classiﬁcation of reference than the identity costs and the knowledge-based costs did illustrate the ability of the training procedure to discover some structural features of the data that are sociologically relevant.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

224

Sociological Methods & Research

Second, based on likelihood ratio tests of multinomial logistic regressions, we compared the associations between cluster solutions (response variables) and a set of relevant sociostructural variables (intervening variables) for the three cost-setting strategies across the three data sets at hand. Here again, the training procedure led to better results than the identity and the knowledge-based costs did. That is, the data-driven costs of substitution contributed to classiﬁcations that ﬁt better with widely recognized sociological models of women’s labor market participation than the two other strategies. Taking into account the actual structure of the data provides models that ﬁt better with external factors than undifferentiated or knowledge-based costs schemes. Finally, the ability of the training procedure to discover certain actual internal relationships of the data and therefore to offer an efﬁcient and empirically grounded way to determine costs of substitution is demonstrated in another way as it is able to accurately identify the closeness of two formally identical, but artiﬁcially differentiated, substitution costs (here, between two occupational statuses). Moreover, the degree of closeness between the substitution costs is also informative about the relative proximity of the symbols and the sociological reality they represent. The training procedure offers signiﬁcant improvements compared to the methods generally used until now in social sciences. By revealing every symmetric relationship among those symbols, this procedure avoids assigning a cost based on prior knowledge that would later appear to be erroneous when compared to the actual data. The results show that for any pair of symbols of a given alphabet, the produced trained costs of substitution remain remarkably similar from one data set to another. It means that those costs do reﬂect some important information concerning the actual (in this case, social) signiﬁcance of the symbols constituting the sequences and do not represent just abstract values varying from data set to data set (or from one training session to another). Therefore, these costs also constitute a predictive feature, in the sense that two different symbols with low substitution costs can be predicted to easily substitute for one another in real life. Identiﬁcation of these low substitution costs therefore make it possible to predict situations likely to occur in similar contexts at similar ages and at similar frequencies. In comparison with approaches based on transitions costs, which are computed within each single sequence taken separately, the proposed method aims to determine substitution costs by looking for a match or mismatch at each speciﬁc position throughout all pairs of sequences. In this sense, the latter method is based on richer information and grants a

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

225

higher importance to time (i.e., to age and social age) and to the relations between sequences than cost schemes based on transitions rates. There is on one hand a constant and clear similarity between the results stemming from the three cost-setting strategies (identity, knowledge based, and training) and on the other hand a signiﬁcant improvement in the tests of internal and external validity of the results provided by the training procedure. The conditions under which the method is most appropriate remain to be systematically tested. The experiments presented in this article point in several directions. First, the method provides a strong leverage when no or few theoretical arguments may be brought up to the scene in support of a cost solution or when contradicting theories propose different cost schemes. In other words, it is best suited for an exploratory research design. Second, this method is ideal whenever too many statuses have been used to characterize the data. We show, for instance, in this study that the proposed procedure reveals the identity between two statuses that may have been coded separately. Finally, the cost estimation provides a means for quantifying the relationship among symbols; as such, it can be used to identify and discover equivalences among categories. In itself, this means of quantiﬁcation may prove to be a useful investigative tool for the social sciences. There are several limitations to the solution proposed in this article. First, the method deals poorly with symbols occurring rarely in sequences. Whenever this happens, the estimations of substitution costs are less accurate and more variable. Second, a key property of the optimal matching algorithm is to rely on the assumption that events deﬁning a life trajectory are chronologically ordered and collinear among the considered sequences. This is, of course, a simpliﬁcation, but it seems to hold reasonably well when considering sequences with a high percentage of identity. However, it should be mentioned that if recurrent subsequences were to be found scattered in different periods of life, they could probably be recovered using techniques related to the one that we describe here, such as Gibbs sampling (Lawrence et al. 1993; Abbott and Barman 1997) or the local alignment algorithm (Smith and Waterman 1981). Second, this algorithm, like other optimal matching algorithms, assumes the independence of each position constituting a sequence. This may be oversimplifying as one can argue that life trajectories are not homogeneous. They may be substructured in smaller units (life stages, transitions, turning points, speciﬁc life events, etc.), whose sizes may vary but should be kept intact in the alignments. This issue is likely to arise when comparing very distinct sequences. When this situation occurs, it may be worthwhile to modify the

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

226

Sociological Methods & Research

proposed algorithm. Nevertheless, the issue remains to automatically identify meaningful borders deﬁning those subsequences. In biology, multiple sequence alignments have been used successfully to identify the exact extent of subsequences conserved across related sequences (Notredame, Higgins, and Heringa 2000). It is certainly worthwhile to explore the potential of this method in the social sciences.

Notes
¨ 1. This freeware is available from the Ruhr-Universitat Bochum Web site at http://steinhaus .stat.ruhr-uni-bochum.de/tda.html. 2. This freeware is available from the University of Chicago Web site at http://home .uchicago.edu/aabbott/om.html. ~ 3. This freeware is available from the Strasbourg Bioinformatics Platform Web site at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX. 4. ‘‘Thus, while substitution must be carefully handled, it is not a supersensitive task whose errors will be compounded by later stages in the analysis’’ (Abbott and Hrycak 1990:176). 5. Student’s t tests performed on the 10 values generated by the training procedures for each cost of substitution reveal that those values do not differ from the mean (p < .0001, df = 9). 6. Hotelling’s T2 is a statistical measure of the multivariate distance of each observation from the center of the data set. This is an analytical way to ﬁnd the most extreme points in the data. 7. This is the ratio between interclass variance and total variance. 8. This data set is for public use. Access to the data is provided by the Swiss Household Panel (SHP) Web site at http://www.swisspanel.ch. 9. Following the availability of the data, the range considered is 16 to 65 years old for Social Stratiﬁcation, Cohesion, and Conﬂict in Contemporary Families and SHP data, and 20 to 56 years old for Wisconsin Longitudinal Study data. 10. Spearman correlation coefﬁcient = .734 (p < .01). 11. We used PROC CATMOD of the SAS software. 12. At p ≤ .05, the tested model is not statistically different than the saturated one.

References
Abbott, Andrew. 1984. ‘‘Event Sequence and Event Duration: Collocation and Measurement.’’ Historical Methods 17:192-204. Abbott, Andrew. 1990a. ‘‘Conception of Time and Events in Social Science Methods: Causal and Narrative Approach.’’ Historical Methods 23:140-50. Abbott, Andrew. 1990b. ‘‘A Primer on Sequence Methods.’’ Organization Science 1:375-92.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

227

Abbott, Andrew. 1995a. ‘‘A Comment on ‘Measuring the Agreement Between Sequences.’’’ Sociological Methods & Research 24:232-43. Abbott, Andrew. 1995b. ‘‘Sequence Analysis: New Methods for Old Ideas.’’ Annual Review of Sociology 21:93-113. Abbott, Andrew. 2001. Time Matters: On Theory and Method. Chicago: University of Chicago Press. Abbott, Andrew and Emily Barman. 1997. ‘‘Sequence Comparison Via Alignment and Gibbs Sampling: A Formal Analysis of the Emergence of the Modern Sociological Article.’’ Sociological Methodology 27:47-87. Abbott, Andrew and John Forrest. 1986. ‘‘Optimal Matching Methods for Historical Sequences.’’ Journal of Interdisciplinary History XVI:471-94. Abbott, Andrew and Alexandra Hrycak. 1990. ‘‘Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.’’ American Journal of Sociology 96:144-85. Abbott, Andrew and Angela Tsay. 2000. ‘‘Sequence Analysis and Optimal Matching Methods in Sociology.’’ Sociological Methods & Research 29:3-33. Aisenbrey, Silke. 2000. Optimal Matching Analyse. Anwendungen in Den Sozialwissenschaften (Optimal Matching Analysis: Applications in the Social Sciences). Opladen, Germany: Leske + Budrich. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jianzhi Zhang, Zhu Zhang, Webb Miller, and David Lipman. 1997. ‘‘Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs.’’ Nucleic Acids Research 25:3389-3402. Bateman, Alex, Evan Birney, Richard Durbin, Sean R. Eddy, Robert D. Finn, and Erik L. Sonnhammer. 1999. ‘‘Pfam 3.1: 1313 Multiple Alignments and Proﬁle HMMs Match the Majority of Proteins.’’ Nucleic Acids Research 27:260-62. Billari, Francesco C. 2001. ‘‘Sequence Analysis in Demographic Research and Applications.’’ Canadian Studies in Population 28:439-58. ¨ Bird, Katherine and Helga Kruger. 2005. ‘‘The Secret of Transitions: The Interplay of Complexity and Reduction in Life Course Analysis.’’ Pp. 173-94 in Towards an Interdisciplinary Perspective on the Life Course, vol. 10, edited by R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, and E. Widmer. Amsterdam: Elsevier JAI. Blair-Loy, Mary. 1999. ‘‘Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.’’ American Journal of Sociology 104:1346-97. Blossfeld, Hans-Peter and Sonja Drobnic. 2001. Careers of Couples in Contemporary Society: From Male Breadwinner to Dual Earner Families. New York: Oxford University Press. Bock, Hans H. 1985. ‘‘On Some Signiﬁcance Tests in Cluster Analysis.’’ Journal of Classiﬁcation 2:77-108. Calinski, Tadeusz and Joachim Harabasz. 1974. ‘‘A Dendrite Method for Cluster Analysis.’’ Communication in Statistics 3:1-27. Chan, Tak Wing. 1995. ‘‘Optimal Matching Analysis: A Methodological Note on Studying Career Mobility.’’ Work and Occupations 22:467-90. Dayhoff, Margaret O., Robert M. Schwartz, and Bruce C. Orcutt. 1978. ‘‘A Model in Evolutionary Change in Proteins.’’ Pp. 345-52 in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, edited by M. O. Dayhoff. Washington, DC: National Biomedical Research Foundation. Dijkstra, Wil and Toon Taris. 1995. ‘‘Measuring the Agreement Between Sequences.’’ Sociological Methods & Research 24:214-31.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

228

Sociological Methods & Research

Duda, Richard O. and Peter E. Hart. 1973. Pattern Classiﬁcation and Scene Analysis. New York: John Wiley. Durbin, Richard, Sean E. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. Elder, Glen H. 1985. Life Course Dynamics: Trajectories and Transitions, 1968-1980. Ithaca, NY: Cornell University Press. Erzberger, Christian and Gerald Prein. 1997. ‘‘Optimal-Matching-Technik: Ein Analysever¨ fahren zur Vergleichbarkeit und Ordnung individuell differenter Lebensverlaufe.’’ [Optimal matching technique: an analytical process to compare and classify individual life courses] ZUMA-Nachrichten 40:52-80. Everitt, Brian S. 1979. ‘‘Unresolved Problems in Cluster Analysis.’’ Biometrics 35:169-81. Forrest, John and Andrew Abbott. 1989. ‘‘The Optimal Matching Method for Studying Anthropological Sequence Data: An Introduction and Reliability Analysis.’’ Journal of Quantitative Anthropology 1:151-70. Giddens, Anthony, Mitchell Duneier, and Richard P. Applebaum. 2003. Introduction to Sociology. New York: W. W. Norton. Giele, Janet Z. and Glen H. Elder. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Giuffre, Katherine A. 1999. ‘‘Sandpiles of Opportunity: Success in the Art World.’’ Social Forces 77:815-32. Goodman, Leo A. and William H. Kruskal. 1979. Measures of Association for Cross Classiﬁcation. New York: Springer. Grauer, Dan and Wen-Hsiung Li. 2000. Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer. Halpin, Brendan and Tak Wing Chan. 1998. ‘‘Class Careers as Sequences: An Optimal Matching Analysis of Work-Life Histories.’’ American Sociological Review 14:111-30. Hartigan, John A. 1985. ‘‘Statistical Theory in Clustering.’’ Journal of Classiﬁcation 2:63-76. Henikoff, Steven and Jorja G. Henikoff. 1992. ‘‘Amino Acid Substitution Matrices From Protein Blocks.’’ Proceedings of the National Academy of Sciences 89:10915-19. ¨ Hopﬂinger, Francois, Maria Charles, and Annelies Debrunner. 1991. Familienleben und Ber¸ ufsarbeit (Family Life and Professional Work). Zurich, Switzerland: Seismo. Hughey, Richard and Anders Krogh. 1996. ‘‘Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method.’’ Computer Applications in Biological Science 12:95-107. Kohli, Martin. 1986. ‘‘The World We Forgot: A Historical Review of the Life Course.’’ Pp. 271-303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. ´ ¨ Kruger, Helga and Rene Levy. 2001. ‘‘Linking Life Courses, Work and the Family: Theorizing a Not So Visible Nexus Between Women and Men.’’ Canadian Journal of Sociology 26:145-66. Kruskal, Joseph. 1983. ‘‘An Overview of Sequence Comparison.’’ Pp. 1-44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Toronto, Canada: Addison-Wesley. Lawrence, Charles E., Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. 1993. ‘‘Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment.’’ Science 262:208-14.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

229

Levine, Joel H. 2000. ‘‘But What Have You Done for Us Lately?’’ Sociological Methods & Research 29:34-40. Levitt, Barbara and Clifford Nass. 1989. ‘‘The Lid on the Garbage Can: Institutional Constraints on Decision Making in the Technical Core of College-Text Publishers.’’ Administrative Science Quarterly 34:190-207. ´ Levy, Rene. 1977. Der Lebenslauf als Statusbiographie. Die weibliche Normalbiographie in makrosziologisher Perspektive. [The life course as a sequence of statuses. The female standard biography in a macrosociological perpsective]. Stuttgart, Germany: Enke. ´ Levy, Rene, Jacques-Antoine Gauthier, and Eric Widmer. 2006. ‘‘Entre contraintes institu´ tionnelle et domestique: Les parcours de vie masculins et feminins en Suisse.’’ [Between institutional and domestic constraints: the life courses of women and men in Switzerland] Revue canadienne de sociologie 31:461-89. ´ Levy, Rene, Eric Widmer, and Jean Kellerhals. 2002. ‘‘Modern Family or Modernized Family Traditionalism? Master Status and the Gender Order in Switzerland.’’ Electronic Journal of Sociology 6(4). Manning, Gerard, David B. Whyte, Ricardo Martinez, Tony Hunter, and Sucha Sudarsanam. 2002. ‘‘The Protein Kinase Complement of the Human Genome.’’ Science 298:1912-34. Milligan, Glenn W. and Martha C. Cooper. 1985. ‘‘An Examination of Procedures for Determining the Number Clusters in a Dataset.’’ Psychometrika 50:159-79. Milligan, Glenn W. and Martha C. Cooper. 1987. ‘‘Methodology Review: Clustering Methods.’’ Applied Psychological Measurement 11:329-54. Moen, Phillis. 1985. ‘‘Continuities and Discontinuities in Women’s Labor Force Activity.’’ Pp. 113-55 in Life Course Dynamics: Trajectories and Transitions, 1968-1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Moen, Phillis. 2003. It’s About Time: Couples and Careers. Ithaca, NY: Cornell University Press. Moen, Phyllis and Yan Yu. 2000. ‘‘Effective Work/Life Strategies: Working Couples, Work Conditions, Gender, and Life Quality.’’ Social Problems 47:291-326. Mott, Frank L. 1978. Women, Work and Family. Lexington, MA: Lexington Books. ¨ Muller, Tobias and Martin Vingron. 2000. ‘‘Modeling Amino Acid Replacement.’’ Journal of Computational Biology 7:761-76. Myrdal, Alva and Viola Klein. 1956. Women’s Two Roles: Home and Work. London: Routledge. Nargundkar, Satish and Timothy J. Olzer. 1998. ‘‘An Application of Cluster Analysis in the Financial Services Industry.’’ Presented at the sixth annual conference of the South East SAS Users Group, September 13-15, Norfolk, VA. Needleman, Saul B. and Christian D. Wunsch. 1970. ‘‘A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.’’ Journal of Molecular Biology 48:443-53. Ng, Pauline C., Jorja G. Henikoff, and Steven Henikoff. 2000. ‘‘PHAT: A TransmembraneSpeciﬁc Substitution Matrix. Predicted Hydrophobic and Transmembrane.’’ Bioinformatics 16:760-66. ´ Notredame, Cedric, Philipp Bucher, Jacques-Antoine Gauthier, and Eric Widmer. 2005. T-Coffee/saltt: User Guide and Reference Manual. Lausanne: Swiss Institute of Bioinformatics. Retrieved from http://www.tcoffee.org/saltt. ´ Notredame, Cedric, Desmond G. Higgins, and Jaap Heringa. 2000. ‘‘T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment.’’ Journal of Molecular Biology 302:205-17.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

230

Sociological Methods & Research

´ Notredame, Cedric, Liisa Holm, and Desmond G. Higgins. 1998. ‘‘Coffee: An Objective Function for Multiple Sequence Alignments.’’ Bioinformatics 14:407-22. Olszak, Michael and Gilbert Ritschard. 1995. ‘‘The Behavior of Nominal and Ordinal Partial Association Measures.’’ Statistician 44:195-212. Pentland, Brian T., Malu Roldan, Ahmed A. Shabana, Louise L. Soe, and Sidne G. Ward. 1998. ‘‘Lexical and Sequential Variety in Organizational Processes.’’ School of Labor and Industrial Relations, Michigan State University, East Lansing. Unpublished manuscript. Punj, Girish and David W. Stewart. 1983. ‘‘Cluster Analysis in Marketing Research: Review and Suggestions for Application.’’ Journal of Marketing Research 20:134-48. Rapkin, Bruce D. and Douglas A. Luke. 1993. ‘‘Cluster Analysis in Community Research: Epistemology and Practice.’’ American Journal of Community Psychology 21:247-77. ¨ ¨ Rohwer, Gotz and Ulrich Potter. 2002. TDA User’s Manual. Bochum, Germany: Ruhr ¨ Universitat Bochum. Retrieved from http://www.stat.ruhr-uni-bochum.de/pub/tda/doc/ tman63/tman-pdf.zip. ¨ Rohwer, Gotz and Heike Trappe. 1997. ‘‘Describing Life Courses. An Illustration Based on NLSY Data.’’ Pp. 30 in POLIS Project Conference. Florence, Italy: European University Institute. SAS Institute, Inc. 2004. SAS/STAT User’s Guide. Cary, NC: Author. ¨ Schaeper, Hildegard. 1999. ‘‘Erwerbesverlaufe von Ausbildungsabsolventinnen und -Absolventen: Eine Anwendung der Optimal-Matching-Technik.’’ [Employment history of girls and boys after completion of vocational education and training: an appli¨ cation of optimal matching technique]. Sonderforschungsbereich 186, Universitat Bremen, Germany. Scherer, Stefani. 2001. ‘‘Early Career Patterns: A Comparison of Great Britain and West Germany.’’ European Sociological Review 17:119-44. Sheridan, Jennifer T. 1997. ‘‘The Effects of the Determinants of Women’s Movement Into and Out of Male Dominated Occupations on Occupational Sex Segregation.’’ CDE Working Paper 97-07, Department of Sociology, University of Wisconsin, Madison. Smith, Temple F. and Michael S. Waterman. 1981. ‘‘Identiﬁcation of Common Molecular Subsequences.’’ Journal of Molecular Biology 147:195-97. Stovel, Katherine and Marc Bolan. 2004. ‘‘Residential Trajectories: Using Optimal Alignment to Reveal the Structure of Residential Mobility.’’ Sociological Methods & Research 32:559-98. Stovel, Katherine, Michael Savage, and Peter Bearman. 1996. ‘‘Ascription Into Achievement: Models of Career Systems at Lloyds Bank, 1890-1970.’’ American Journal of Sociology 102:358-99. Thompson, Julie, Desmond G. Higgins, and Toby Gibson. 1994. ‘‘Clustal W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Speciﬁc Gap Penalties and Weight Matrix Choice.’’ Nucleic Acids Research 22:4673-80. Turner, Jonathan H. 2001. ‘‘Sociological Theory Today.’’ Pp. 1-17 in Handbook of Sociological Theory, edited by J. H. Turner. New York: Kluwer Academic. ´ ´ ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2003. Couples contemporains: Cohesion, reg¨ ulation et conﬂits. [Contemporary couples: cohesion, regulation, conﬂicts] Zurich: Seismo. ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2004. ‘‘Quelle pluralisation des relations familiales?’’ [What pluralization of family relations]. Revue francaise de sociologie ¸ 45:37-67.

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

Gauthier et al. / How Much Does It Cost?

231

´ Widmer, Eric, Rene Levy, Alexandre Pollien, Raphael Hammer, and Jacques-Antoine Gauthier. 2003. ‘‘Entre standardisation, individualisation et sexuation: une analyse des trajectoires personnelles en Suisse’’ [Between standardization, individualization and gendering: an analysis of personal life courses in Switzerland] Revue suisse de sociologie 29:35-67 Wilson, W. Clarke. 1998. ‘‘Activity Pattern Analysis by Means of Sequence-Alignment Methods.’’ Environment and Planning A 30:1017-38. Wu, Lawrence L. 2000. ‘‘Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect.’ ’’ Sociological Methods & Research 29:41-64. Yu, Yi-Kuo and Stefen F. Altschul. 2005. ‘‘The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins With Non-Standard Compositions.’’ Bioinformatics 21:902-11. Jacques-Antoine Gauthier is a senior lecturer in sociology at the University of Lausanne and member of the Center for Life Course and Lifestyle Studies (Pavie). He has worked in the ﬁelds of health, addiction, and family sociology. His latest publications have appeared in the Canadian Journal of Sociology, European Journal of Operational Research, and the Swiss Journal of Sociology. Eric D. Widmer is a professor of sociology at the University of Geneva, with an appointment at the Center for Life Course and Lifestyle Studies (Pavie). His long-term interests include life course research, family research, and social networks. His latest publications have appeared in the Journal of Personal and Social Relationships, European Sociological Review, and Journal of Marriage and Family. Philipp Bucher is a group leader at the Swiss Institute for Experimental Cancer Research and a founding member of the Swiss Institute of Bioinformatics. His long-term interests include the development of algorithms for the analysis of molecular sequences and the application of these algorithms in various areas of biomedical research. His latest publications have appeared in PLoS Computational Biology and Nucleic Acids Research. ´ Cedric Notredame is a group leader at the Centre for Genomic Regulation in Barcelona (Spain) and a research investigator for the Centre National de la Recherche Scientiﬁque (France). The focus of his work is the development and improvement of multiple sequence alignment algorithms. His latest publications have appeared in the Journal of Molecular Biology and Nucleic Acid Research. He is also the coauthor, with J. M. Claverie, of a popular introductory textbook in bioinformatics, Bioinformatics for Dummies (New York: Wiley, 2003).

Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009

BIOINFORMATICS

Vol. 17 no. 0 2001 Pages 1–3

Mocca: semi-automatic method for domain hunting
´ Cedric Notredame
Information Genetique et Structurale, CNRS-UMR 1889, 31 Ch. Joseph Aiguier, 13 402 Marseille, France
Received on Month xx, 2000; revised and accepted on Month xx, 2000

ABSTRACT Motivation: Multiple OCCurrences Analysis (Mocca) is a new method for repeat extraction. It is based on the TCoffee package (Notredame et al., JMB, 302, 205–217, 2000). Given a sequence or a set of sequences, and a library of local alignments, Mocca extracts every segment of sequence homologous to a pre-speciﬁed master. The Q: implementation is meant for domain hunting and makes Please it fast and easy to test for new boundaries or extend known supply recevied repeats in an interactive manner. Mocca is designed to date Q: deal with highly divergent protein repeats (less than 30% Applications amino acid identity) of more than 30 amino acids. Note? Q: Availability: Mocca is available on request (cedric. There are some notredame@gmail.com). The software is free of charge differences in the and comes along with complete documentation.
electronic version and hardcopy. We follow electronic version

information concerning the whereabouts of one of the repeats (master repeat), it allows the user to tune the parameters describing the repeat family (i.e. start position, length of the master repeat and stringency of the search), and extract other occurrences of that repeat within the dataset. The procedure is fast and simple.

INTRODUCTION Many proteins consist of separately evolved, independent structural units called modules or domains. The great diversity of protein functions is partly due to the vast number of possibilities to arrange a ﬁnite number of those basic units (Chothia, 1992). It is generally agreed that a domain is a self-folding unit made of a minimum of 25 amino acids (Bairoch et al., 1997; Corpet et al., 1998). Many of these domains appear as homologous subsequences repeated within a sequence or within a set of sequences, hence the importance of repeats identiﬁcation in the course of domain hunting. Many tools exist for discovering and extracting these repeats and without being exhaustive, one can cite PSi-Blast (Altschul et al., 1997), Dot matrices (Junier and Pagni, 2000); Repro (Heringa and Argos, 1993) and the Gibbs Sampler (Lawrence et al., 1993). More recently, Heger and Holm developed a method meant to scan databases for repeats without manual intervention (Heger and Holm, 2000). These automatic methods all share the same drawback: while none of them is 100% accurate, they give the user little scope for testing his own hypothesis in a seamless manner. Multiple OCCurrences Alignment (Mocca) addresses that speciﬁc problem. Given some approximate
c Oxford University Press 2001

METHODS Mocca uses a pair-wise sequence alignment algorithm (Durbin et al., 1998). The cost associated with the alignment of each pair of residues uses the ‘library extension’ developed for T-Coffee (Notredame et al., 1998, 2000). Figure 1 outlines the strategy used to generate the T-Coffee scoring scheme. Firstly, a primary library is compiled; it contains a series of local alignments obtained using Lalign, an implementation of the Sim algorithm (Huang and Miller, 1991). Given two sequences, Lalign extracts the N top scoring non-overlapping local alignments. We used a modiﬁed version that compares two sequences (or a sequence with itself), and extracts every top scoring alignment having a length longer than ten residues and an average level of identity higher than 30%. Lalign reports each alignment along with a score that indicates its statistical signiﬁcance. In our primary library, such local alignments appear as a series of pairs of residues where each pair receives a weight equal to the score of the alignment it comes from. Given a set of N sequences, the library contains the result of all the possible pair-wise comparisons (including the self-comparisons). This library is fed into T-Coffee to generate the position speciﬁc scoring scheme using the ‘library extension’ algorithm (Notredame et al., 2000). In Mocca, a pre-requisite to repeat extraction is the estimation of at least one basic unit repeat among the sequences being analysed (master repeat). In the context of this work, we made the estimation using dotlet, a Java-based dot matrix method (Junier and Pagni, 2000). The master repeat is a sub-string selected within the sequence(s) used to build the library. Mocca extracts every sub-string homologous to the master in a single pass over the target sequences. It is the library extension that
1

C.Notredame

putational requirement is the Lalign library O(N2 L2 ), the motif extraction itself only requires little time (12 s on an IRIX O2 station for 20 sequences totalling 5000 residues). If the position of one of the repeats is known, the procedure can also be run automatically from the command line. It is recommended to use Mocca in conjunction with other means for the initial estimation of the repeat boundaries (PSi-Blast, Altschul et al., 1997; Dotlet, Junier and Pagni, 2000; Dotter, Sonnhammer and Durbin, 1995;. . . ). Our tests show that Mocca can properly deal with sets of repeats whose multiple alignment indicate less than 15% average identity. While we currently use Lalign as a source of local information, any other sensible source could be considered. For instance, structural information could easily be added to our procedure, using off the shelf libraries of local structural similarities such as the Dali Domain Dictionary (Holm and Sander, 1998). The input format of Mocca is straightforward and well documented. Mocca is a reﬁnement tool for the discovery and the establishment of new domains. If the master repeat is replaced with a proﬁle or a collection of known characterized repeats, Mocca could also be used to improve the model of a given repeat family and extend the predictive power of its proﬁles.

Fig. 1. Layout of the Mocca strategy. The main steps required to extract a repeat with Mocca method are shown. Square blocks designate procedures while rounded blocks indicate data structures.

ACKNOWLEDGEMENTS The author wishes to thank the following people: Des Higgins for very helpful comments. Jaap Heringa, Philipp Bucher and Kay Hoffmann for useful discussions and advice at an early stage of the project, Hiroyuki Ogata for helpful comments on the program. REFERENCES
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. Chothia,C. (1992) Proteins: 1000 families for the molecular biologist. Nature, 357, 543–544. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of protein domain families. Nucleic Acids Res., 26, 323–326. Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis. 1 vols, Cambridge University Press, Cambridge. Heger,A. and Holm,L. (2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins, 41, 224–237. Heringa,J. and Argos,P. (1993) A method to recognise distant repeats in protein sequences. Proteins: Struct. Funct. Genet., 17, 391–411. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.

makes it possible for a single repeat to ‘recognize’ each of its homologues (even the distant ones). The extraction process relies on a very efﬁcient dynamic programming procedure known as repeated matches (Durbin et al., 1998). This algorithm reports a series of non-overlapping sub-strings each of them having an alignment to the master associated with a score higher than some pre-speciﬁed threshold T h. T h is empirically set to be a function of the maser repeat length (L): Th = S ∗ L S has a value between 0 and 1. By default, S = 0.05, but its value can be modiﬁed interactively. Two other parameters can also be modiﬁed to increase sensitivity and accuracy: the gap opening penalty and the gap extension. Mocca is part of the T-Coffee package. It is written in Perl and ANSI C. It runs on any UNIX or LINUX platform. It is available free of charge along with documentation. Copies can be obtained on request by sending an e-mail to cedric.notredame@gmail.com. The main com2

Mocca and domain hunting

Huang,X. and Miller,W. (1991) A time-efﬁcient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. Junier,T. and Pagni,M. (2000) Dotlet: diagonal plots in a web browser. Bioinformatics, 16, 178–179. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel algorithm for multiple sequence alignment. JMB, 302, 205–217. Sonnhammer,E.L. and Durbin,R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene, 167, GC1-10.

To be balanced at ﬁnal stage

3

Optimization of ribosomal RNA profile alignments
    ¢ ¡   ¢ © £     ©  ¢ !     $  $ ¢   © # ¨ § ¥ ¦ " ¥  ¤  £  ¢ ¡ ¡ ! ¢ ¡  

Motivation: Large alignments of ribosomal RNA sequences are maintained at various sites. New sequences are added to these alignments using a combination of manual and automatic methods. We examine the use of profile alignment methods for rRNA alignment and try to optimize the choice of parameters and sequence weights. Results: Using a large alignment of eukaryotic SSU rRNA sequences as a test case, we empirically compared the performance of various sequence weighting schemes over a range of gap penalties. We developed a new weighting scheme which gives most weight to the sequences in the profile that are most similar to the new sequence. We show that it gives the most accurate alignments when combined with a more traditional sequence weighting scheme. Availability: The source code of all software is freely available by anonymous ftp from chah.ucc.ie in the directory /home/ftp/pub/emmet, in the compressed file PRNAA.tar. Contact: emmet@chah.ucc.ie, des@chah.ucc.ie

Introduction
Ribosomal RNA sequences (rRNA) are widely used to estimate the phylogenetic relatedness of groups of organisms (e.g. Sogin et al., 1986; Pawlowski et al., 1996), especially that of the small subunit (SSU rRNA). The SSU rRNA has been sequenced from thousands of different species and large alignments are maintained at several sites (Maidak et al., 1997; Van de Peer et al., 1997). The alignments are large and complex and the addition of new sequences is a demanding task, either for the alignment curators or for individuals who wish to align new sequences with existing aligned sequences. In simple cases, automatic alignment programs such as Clustal W (Thompson et al., 1994a) may be used to align groups of closely related sequences or as a prelude to manual refinement. There may be large stretches of unambiguous alignment with high sequence identity which may be useful for phylogenetic purposes. The fully automated, accurate alignment of rRNA sequences remains a difficult problem, however. In principle, one can use profile alignment methods (Gribskov, 1987) which use dynamic programming algorithms (Needleman and Wunsch, 1970, Gotoh, 1982) to align a new sequence against an existing ‘expert’ alignment. For example,

one could take an alignment of all SSU rRNA sequences from one of the rRNA collections and one could use this as a guide; aligning each new sequence in turn, treating the large alignment as a profile. This approach has the advantage of simplicity and speed but the final accuracy may be limited by the lack of any ability to use secondary structure information. The RNALIGN approach (Corpet and Michot, 1994) or the stochastic context free grammar approach (Eddy and Durbin, 1994; Sakakibara et al., 1994) provide elegant methods for the alignment of rRNA sequences taking both primary sequence and secondary structure into account. These methods, however, are very demanding in computer resources and cannot deal easily with pseudoknots so that their immediate application to the alignment of SSU rRNA sequences is not trivial. In this paper, we examine, empirically, the effectiveness of profile alignment methods for the alignment of RNA sequences. We remove test sequences from existing ‘expert’ alignments and measure the extent to which they can be realigned with the original alignment, automatically. We use the eukaryotic SSU rRNA sequences from Van de Peer et al. (1997) as a test case. For a range of test sequences, we measure the number of positions that can be correctly realigned over a range of different parameters (gap opening and gap extension penalties). Sequence weighting has been shown to increase the reliability of profile alignments using amino acid sequences (Thompson et al., 1994b). This can be used to give less weight to clusters of closely related sequences and increased weight to sequences with no close relatives in order to counteract the effect of unequal sampling across a phylogenetic tree of possible sequences. We examine the effectiveness of one commonly used scheme (Thompson et al., 1994b). We also propose a new weighting scheme which is designed to give increased weight to those sequences in the profile (reference alignment) which are closest (highest sequence identity) to the new sequence being aligned. If a new mammalian sequence is being aligned, for example, it makes most sense to give a high weight to other mammalian sequences and decreasing weights to sequences that are more and more distantly related. Some sections of SSU rRNA sequences are from regions whose secondary structure is conserved across many species. These conserved, ‘core’, regions are relatively easy to align

332

Oxford University Press





h

j

2

i

(

h

&

h

'

g

3

)

f

U

e

Q

d

T



BIOINFORMATICS























 S



 5 R Q k P I 2 ( I w 2 c  (  E B y & A x ) H  b v A a G  ) Y u 3  `  D q Y  A 5  & D q F p & & E E i F 3 e s I  D 6 ) e @ f X 0 f 6  1 9 ) ( i & q D C  6 A 2 i 2 e B 3  0 g A h W @ e ) 2 u 0 6  9 V 6   A 1  & 0 y & x U 8 0 w 7 6 0 v 3 u 9 6 e 2 5 H t 4 p s 9 3 f 7 0 6 0 r 2 ( q & p 1 ) 1 i 0 3 e ) 4 h g ( 2 e 6 ' f 3 & 6 e d 5 %

rRNA profiles

with high accuracy but are interspersed with less conserved regions that may be very difficult to align. We empirically determine which regions of the eukaryotic reference alignment can be aligned with high accuracy by a simple jack-knife experiment. We remove each sequence, one at a time, and try to realign it with the rest. It is then a simple matter to count how often each nucleotide of each sequence is correctly realigned. This gives a definition of conserved core regions that is purely empirical and which can be used by users to delimit regions of alignment which can be safely used in phylogenetic research. Finally, we examine the effect of G+C content of each sequence on the accuracy of alignment. Sequences of high or low G+C may be expected to be more difficult to align than those with more balanced nucleotide compositions.

umn in the profile (just one of the four residues), with no gaps will get a score of 1.0 when aligned with the same residue in the test sequence and a score of 0 otherwise. Other columns score in proportion to the frequency of each of the four residue types. In positions in the profile where one or more of the sequences has a gap, gaps were treated as a class of residue for frequency calculations. Other methods have been proposed for generating profiles using the natural logarithms of residue frequencies which may be normalized by overall residue frequencies to give log-odds scores (see Henikoff and Henikoff, 1996 for a review). We carried out some tests using the latter scheme and found that performance was comparable although slightly inferior to that using simple frequencies. Therefore we only present results obtained using the frequencies.

Gap penalties System and methods Small subunit ribosomal RNA
An alignment of eukaryotic, nuclear SSU rRNA sequences (that dated May 6, 1997) was obtained from the World Wide Web server at http://www-rrna.uia.ac.be/ssu/index.html (Van de Peer et al., 1997). After removal of columns which consist only of gaps, the two incomplete sequences of Butomus umbellatus and the unaligned sequence Babesia bovis 4 the alignment contains 1517 sequences and is 5370 characters long. Individual sequences vary widely in length, from<1300 nucleotides to >2500. Sixteen test sequences were removed from and realigned with the reference alignment in order to measure the accuracy with which it was possible to recreate their original alignment. The sequences used were Drosophila melanogaster, Xenopus laevis, Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Oryza sativa, Dictyostelium discoideum, Euglena gracilis, Ammonia beccarii, Physarum polycephalum, Entamoeba histolytica 1, Vahlkampfia lobospinosa, Giardia sp., Naegleria gruberi, Hexamita sp. and Trypanosoma brucei. These sequences were chosen based on a phylogenetic tree of all the sequences in the alignment, in order to give a spread of test cases over a wide range of different positions in the tree. Re-alignment was carried out over a range of gap penalties and using a number of sequence weighting schemes as described below. A range of gap opening and extension penalties were used in alignment generation. For each test sequence and each weighting scheme, a total of 81 alignments were carried out. Gap opening penalties were used ranging from 1 to 9 in increments of 1, and gap extension penalties ranging from 0.1 to 0.9 in increments of 0.1. This range of ratios between gap penalties and residue match scores was chosen as it encompasses values empirically shown to give alignments of biological relevance. Terminal gaps were penalized solely with an extension penalty. Position-specific gap opening penalties were derived from the frequency of gaps at each position along the alignment. At each position, a value equal to the number of residues (nongap characters) in the column divided by the number of sequences in the alignment was derived. This value was then multiplied by the gap opening penalty, as taken from the range above, to give a specific gap opening penalty at each position. This gives gap opening penalties which are higher in positions at which residues mostly occur in comparison with positions which are occupied mostly by gaps.

Sequence weighting
By default, each sequence in the existing alignment will have an equal effect on the alignment of new sequences with the profile. If additional information is available concerning the relationships of sequences within the alignment to each other and to the sequence being aligned, this may not be optimal. For example, if a new sequence is identical to a sequence already in the alignment, the correctly aligned position of each residue in the new sequence could be deduced solely from that one identical sequence, and no information concerning the other sequences is necessary. Further, sampling bias can lead to an unequal representation of taxa within the alignment (e.g. there might be very many sequences from some taxa and very few from others), and it is possible to use sequence weighting to correct for this also. Three different weighting schemes

Dynamic programming
The reference alignment was converted into a profile (Gribskov et al., 1987) which contains information on the frequency of each residue and gaps at each position. The test sequences were aligned with this using a dynamic programming algorithm (Needleman and Wunsch, 1970). We used Gotoh’s algorithm (Gotoh, 1982) and maximized the similarity between the sequence and the profile. A homogenous col-

333

E.A.O’Brien, C.Notredame and D.G.Higgins

Fig. 1. Tree of the sequences that were used as test cases. The weights for these sequences under different weighting schemes are given in Table 1.

were applied to the sequences in the SSU rRNA alignment, and compared with the default of equal weights. The first weighting scheme, referred to as tree-based weights, is based on a phylogenetic tree of the sequences in the alignment. A neighbour-joining tree (Saitou and Nei, 1987) of all the sequences in the profile was generated using the DNADIST and NEIGHBOR programs of the PHYLIP package (Felsenstein, 1989). Weights were then derived from the branch lengths as described by Thompson et al. (1994b). These weights are then normalized to have a mean of 1.0. This gives a total weight for the profile equal to that where each sequence is weighted equally, which is necessary in order to keep the effects of changing gap penalties congruent across the different schemes. The general effect of these tree-based weights is to downweight sequences with many close relatives in order to prevent the more densely populated regions of the tree exerting a disproportionate effect on the alignment of sequences from other regions of the tree. The second weighting scheme is based on the level of similarity between the sequence being aligned and each individual sequence in the alignment, and is referred to as identitybased weighting. The new sequence is first aligned with the profile using equal weights. A distance is then calculated between the new sequence and each other sequence in the alignment equal to the mean number of differences per site in this initial approximate alignment. This is percent difference divided by 100 and there is no correction for multiple hits or unequal rates of transition and transversion. The recip-

rocal of this distance is used as a weight for each sequence and these are again normalized to give a mean of 1.0. This weighting scheme has the effect of upweighting sequences more similar to the sequence being added relative to those that are more distantly related. The upweighting effect increases as the sequences become more similar to the sequence being aligned. The third scheme is a combination of these weighting schemes, in which the weight derived for each sequence based on branch lengths is multiplied by the weight derived from sequence identities, and the values are again renormalized. This scheme is referred to as combination weights. Table 1 shows the values given by the various weighting schemes for the case shown in the example tree in Figure 1. The tree-based weights are independent of the new sequence that is to be added, being derived wholly from the structure of the existing data. Weights are calculated using the method of Thompson et al. (1994b), which are then renormalized to give a mean of 1, leaving the values shown. The identitybased weights are derived by taking the distance of each sequence in the tree from the new sequence, defined as the mean number of differences per aligned pair of residues, ignoring any pairs with a gap in either sequence. The reciprocals of these values are renormalized around 1 to give the figures shown. For the final set of combination weights, the product is taken of the weights in each of the preceding columns and again renormalized to give a mean of 1.

334

rRNA profiles

Table 1. The weights assigned to the sequences in the test tree shown in Figure 1 when the sequences Mus musculus and Plasmodium gallinaceae were added a Ammonia beccarii Caenorhabditis elegans Dictyostelium discoideum Drosophila melanogaster Entamoeba histolytica Euglena gracilis Giardia sp. Hexamita sp. Homo sapiens Naegleria gruberi Oryza sativa Physarum polycephalum Saccharomyces cerevisiae Trypanosoma brucei Vahlkampfia lobospinosa Xenopus laevis 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 b 0.746 0.974 0.875 0.727 1.194 1.519 1.340 1.266 0.411 1.212 0.511 1.435 0.516 1.488 1.398 0.383 c 0.273 0.289 0.250 0.349 0.225 0.198 0.193 0.206 10.628 0.204 0.390 0.205 0.377 0.211 0.196 1.798 d 1.256 1.008 1.049 1.054 0.984 0.809 0.773 0.854 1.053 0.942 1.235 0.856 1.302 0.846 0.889 1.082 e 0.379 0.522 0.406 0.470 0.500 0.557 0.481 0.484 8.088 0.459 0.370 0.547 0.361 0.583 0.508 1.278 f 0.991 1.038 0.968 0.809 1.241 1.298 1.094 1.141 0.456 1.205 0.667 1.298 0.708 1.329 1.313 0.438

Columns represent the following schemes: (a) equal sequence weights, (b) tree-based sequence weights, (c) identity-derived weights for each sequence for the alignment of Mus musculus, (d) identity-derived sequence weights for each sequence for the alignment of Plasmodium gallinaceae, (e) combination of tree and identity-derived weights for Mus musculus, (f) combination of tree and identity-derived weights for Plasmodium gallinaceae

For each of the three defined weighting schemes and the default of equal weights, alignments were generated using position-specific gap-opening penalties across the range of gap extension penalties and base gap opening penalties described above. This procedure was repeated for each of the test sequences. The number of residues correctly placed in each alignment was determined by comparison with the sequence as originally aligned in the reference alignment, and this was then divided by the total number of residues in the sequence to give a percentage score for the alignment. From the scores for the alignments across the range of gap opening and gap extension penalties for each test case, the gap penalties giving the best performance across all or most of the test cases were obtained.

Results
The performance of a set of weights was judged by its efficacy across the range of gap opening and gap extension penalties used. The peak score and the range of gap penalties giving a comparable score were taken into account in making this judgement (Table 2). For scoring purposes, each residue is counted as distinct, and is only considered correctly aligned if it is in the same position as the same residue in the reference sequence. The score for a sequence is counted as the percentage of the total number of residues in the sequence that have been correctly realigned. The main results are presented in Table 2. In the first column, the percentage accuracy of alignment scores are given for each of the 16 test cases. These scores are the best obtained across the range of gap opening and extension penalties with no sequence weights. The scores are low and range from 43% (Euglena) up to 88% (Oryza). The addition of position specific gap penalties has a dramatic effect. The scores all increase by about 10–15% which represents an improvement of several hundred residues in the original sequences that have been correctly aligned. The use of sequence weights yields further improvements, although not as dramatically as this. It should be noted that an improvement in score of just 1% is the equivalent of 20 residues in a molecule of 2000 nucleotides. We only give the peak scores from across the full range of gap opening and extension penalties. These were all obtained with a gap opening penalty of between 5.0 and 7.0 and a gap extension penalty of either 0.1 or 0.2.

Implementation
Programs were developed and/or run on DEC Alpha workstations running DEC UNIX. All new code was written in the C programming language and is freely available by anonymous FTP (login as anonymous to chah.ucc.ie and transfer the compressed tar archive PRNAA.tar). The code is not designed for portability and users will have to down load their own rRNA alignments and build their own profiles; a JAVA version of the programs is being developed which will be used to provide future access to all the methods via the Internet.

335

E.A.O’Brien, C.Notredame and D.G.Higgins

Table 2.The highest % identity between the reference alignment and the realigned sequence obtained using each of the weighting schemes a A.beccarii C.elegans D.discoideum D.melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberi O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 71.65 69.26 64.42 70.14 55.68 43.12 55.00 56.13 79.88 50.37 88.85 53.62 86.71 47.62 46.23 82.47 b 84.19 83.98 78.95 82.72 73.50 60.22 73.89 73.10 91.01 63.60 97.08 65.02 93.94 62.86 56.20 93.59 c 83.66 83.98 78.95 82.97 74.83 60.22 73.96 73.61 92.88 63.74 97.13 64.66 94.55 63.39 55.69 95.18 d 84.05 86.99 79.59 81.11 75.04 60.22 76.81 78.39 91.49 67.81 96.69 68.64 93.38 64.77 56.20 94.25 e 83.96 87.84 79.06 84.02 78.17 61.08 77.29 77.16 92.30 67.86 97.35 67.52 94.10 65.04 58.96 95.07

(a) Fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based weights, (d) position-specific gap penalties and tree-based weights, (e) position-specific gap penalties and combination weights. The underlined values are the absolute maximum scores obtained for each sequence Table 3. Alignment percentage accuracy scores for various weighting schemes and gap penalties Gap extension penalty (a) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 58 58 59 58 58 58 58 57 57 59 59 60 59 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 62 62 62 62 62 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Cont.... 46 32 16 13 4 2 2 1 1 47 32 17 12 4 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 45 31 17 10 6 5 4 3 3 46 33 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 Trypanosoma brucei gap opening penalty 1 2 3 4 5 6 7 8 9 Vahlkampfia lobospinosa gap opening penalty 1 2 3 4 5 6 7 8 9

336

rRNA profiles

Table 3. Continued Gap extension penalty (c) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (d) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (e) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 62 60 61 60 60 59 59 59 59 64 62 62 62 60 61 61 61 62 64 63 63 62 62 63 62 62 62 65 64 64 64 63 63 63 62 62 65 64 64 64 64 63 63 62 62 65 64 64 64 64 62 63 62 62 65 64 64 64 64 62 63 62 62 65 63 64 64 64 63 63 62 62 65 63 64 64 64 63 63 62 62 56 56 55 55 55 55 55 55 55 58 57 57 57 57 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 62 61 60 61 60 59 59 59 59 63 63 63 62 61 61 61 61 61 65 64 63 63 62 62 61 61 61 65 64 64 64 64 63 62 62 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 55 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 58 59 59 58 58 58 58 57 57 59 59 60 58 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 61 62 62 62 62 62 63 63 63 63 62 62 62 61 62 63 63 63 63 62 62 62 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 54 53 53 54 54 55 54 54 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Trypanosoma brucei gap opening penalty Vahlkampfia lobospinosa gap opening penalty

Italics represent those regions at or above the highest score attainable with equal sequence weights. Underlining represents the highest score attained across all the different parameters. Parameter sets are: (a) fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based sequence weights, (d) position-specific gap penalties and tree-derived sequence weights, (e) position-specific gap penalties and weights derived from combination of tree-based and identity-based weights.

In nine out of the 16 test cases, the single best alignment score generated across the ranges of gap penalties was obtained using the combined weights (the last column of Table 2). In three of the remaining cases, tree-based weights give the best performance (column c). The identity weights give the highest score in three cases, and Ammonia beccarii is aligned most accurately with equal weights. Both identitybased and tree-based methods of sequence weighting are shown to improve over equal weights in most cases, with the combination of both these weights giving the best overall performance. Two examples are shown in detail in Table 3. Here the scores for all values of gap opening and gap extension penalties are given for each weighting scheme for just two of the

test cases: Vahlkampfia lobospinosa and Trypanosoma brucei. In both cases, the results with uniform gap penalties, shown in row (a), are very poor and depend strongly on the exact value of the parameters. There is a huge improvement in row (b) where the values for position specific gap penalties are shown. Here, the values are much higher than in row (a) and there is almost no dependence on the exact values chosen for the gap penalties. In the case of Vahlkampfia there is no noticeable difference between the use of tree-based or identity-based weights [the results are shown in rows (c), (d) and (b)]. Use of the combined weighting scheme, as seen in row (e), gives a consistent improvement, showing increase of 2% across the entire range of gap penalties. In the case of Trypanosoma the relative performance of each weighting scheme is more dis-

337

E.A.O’Brien, C.Notredame and D.G.Higgins

tinct. In comparing identity weights to equal weights in this case, there is improvement for some values of gap penalty. The effect of using tree-based weights is to produce improvement across a larger range of gap penalties, particularly for gap extension penalties <0.3. The combination of the two weighting schemes again shows a synergistic effect, with a further increase visible across the range of gap penalties. The values of gap opening and gap extension penalties giving the maximum scores for each test case are given in Table 4. These are the optimum parameters when using the combined weighting scheme with position specific gap penalties. They all fall in a very narrow range.
Table 4. Gap opening and extension penalties giving optimum alignment scores for each test case using combined weights Gap opening A.beccarii C.elegans D.discoideum D. melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberii O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 6.0 6.0 6.0 5.0 6.0 6.0 7.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 Gap extension 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.2

sequences with average G+C contents (∼50%). As expected, sequences with extreme nucleotide compositions (very high or very low G+C content) tend to be less easy to align accurately. High levels of a particular nucleotide increase the chance that a residue in the sequence being aligned may align with the wrong column in the profile. The test cases cover a range of G+C content from 38.4% (Entamoeba histolytica) to 68.5% (Giardia sp.).

Discussion
The generation of alignments under various parameters shows that position-specific gap opening penalties have a very strong positive effect on the accuracy with which alignments can be generated. Fixed gap penalties perform extremely poorly, particularly at high values of gap extension penalty. This corresponds to situations in which the long gaps that occur in virtually all sequences in certain regions of the alignment, which correspond to long insertions in a few sequences, are penalized very heavily and do not occur in an alignment giving an optimum score. Experimentation with position-specific gap extension penalties did not give any further improvement. Sequence weighting can have a further positive effect on alignment quality. Both weighting schemes based on sequence identity and those based on the tree structure and branch lengths are seen to have generally positive effects. As expected, the tree-based weights are seen to perform at their best in the case of sequences which are quite distant from the main taxa, with few or no close relatives, such as Hexamita, and to be of least benefit to alignment quality with sequences which have many close relatives such as O.sativa. With identity-based weights the greatest positive effects are seen in sequences within highly represented taxa such as S.cerevisiae. These two weighting schemes have opposing effects on the values of the sequence weights in the case of sequences aligning into densely populated regions of the tree, and so the net result of combining them, in cases such as S.cerevisiae, may not perform any better than either of the weighting schemes used individually. The examples given (Table 3) indicate that there are cases where tree-based and identitybased weights show a synergistic effect when combined, the combination outperforming either of the schemes applied individually. The combined weights give the best result in more than half of the test cases, and the average difference between the score generated with the combined weights and the overall best score is substantially less than the difference between the scores from any of the other weighting schemes and the overall best score in each case. This synergy is seen to occur most strongly in sequences which are distant from the main bulk of the alignment and therefore more difficult to align correctly. Those which are located in highly repre-

In order to tell which sections of the reference alignment may be reliably aligned, each of the 1517 sequences in turn was removed from the alignment and re-aligned with the remaining sequences. Each column of the original, reference alignment was scored depending on what percentage of the residues in it can be realigned in the correct positions. Figure 2 shows the estimated secondary structure of the Saccharomyces cerevisiae nuclear SSU rRNA with those positions from the full alignment which can be realigned with ≥95% accuracy marked in black and those which realign with <95% accuracy in grey. Stems forming pseudoknots are not displayed in this representation. This is a conservative estimate of the regions that may be reliably aligned as there are some positions that are not found in this molecule and sequences from some taxonomic groupings may be aligned almost perfectly. Figure 3 shows the accuracy with which each sequence can be realigned compared to its original alignment as a function of G+C content. The re-alignment accuracy is greatest for

338

rRNA profiles

Fig. 2. Secondary structure of Saccharomyces cerevisiae SSU rRNA with stable regions indicated in black., generated using the ESSA program (Chetouani et al., 1997).

339

E.A.O’Brien, C.Notredame and D.G.Higgins

Fig. 3. Graph of percentage of sequence correctly re-aligned against G+C content for each of the 1517 sequences in the reference alignment.

sented taxa do not show such strong effects from any of the weighting schemes, but these tend to be those sequences which have the best alignments initially. We have shown how to improve the accuracy of alignment of rRNA sequences using some simple methods. It is quite possible that alignments of 100% accuracy will not be possible due to the existence of errors introduced manually into the reference alignment. Nonetheless, we can already see that some sequences may be aligned with >95% accuracy (Oryza and Xenopus), and across the entirety of the alignment 89.84% of all residues can be realigned correctly. Some sequences are still disappointing and this can partially be explained by very biased G+C content (e.g. Giardia). Others come from poorly sampled parts of the overall Eukaryote phylogenetic tree and these will become easier to align as new sequences are added. Nonetheless, it may be difficult for users to evaluate the quality of a new alignment. We provide one, extremely simple method for choosing regions of the overall alignment that can be reliably aligned in almost all cases. This covers about half of the positions in any given molecule and provides a selection of sites which can be reliably chosen for phylogenetic purposes. This site selection can be fine-tuned by looking at regions which may be reliably aligned in specific taxa. Finally, it is very obvious that these methods could benefit from some consideration of secondary structure, which

could be used for evaluation of alignments or as part of the alignment process. We are investigating the use of genetic algorithms to optimize the quality of profile alignments where secondary structure is considered (Notredame et al., 1997). We will use a genetic algorithm to optimize the quality function of Corpet and Michot (1994) but based on profiles rather than pairs of sequences.

Acknowledgements
The authors thank Richard Durbin for suggesting the use of the 1/d weights. We also thank Manolo Gouy for his help with rRNA sequences in general. This work was supported by a grant (BIO4-CT95–0130) from the EU Biotechnology programme.

References
Chetouani,F.,Monestie,P.,Thebault,P.,Gaspin,C. and Michot,B. (1997) ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. Nucleic Acids Res., 25, 2514–3522. Corpet,F. and Michot,B. (1994) RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comput. Applic. Biosci., 10, 389–399. Eddy,S. and Durbin,R. (1994) RNA sequence analysis using covariance models. N ucleic Acids Res., 22, 2079–2088. Felsenstein,J. (1989) Cladistics, 5, 164–166.

340

rRNA profiles

Gotoh,O. (1982) J. Mol. Biol., 162, 705. Gotoh,O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput. Applic. Biosci., 11, 543–551. Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,J. and Henikoff,S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Applic. Biosci., 12, 135–143. Luthy,R., Xenarios,I., and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, 139–146. Maidak,B., Olsen,G.,Larsen,N., Overbeek,R.,McCaughey,M. and Woese,C. (1997) The Ribosomal Database Project (RDP). Nucleic Acids Res., 25, 109–111. Needleman,S. and Wunsch,C. (1970)A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Neefs,J.-M., Van de Peer,Y., Hendriks,L. and De Wachter,R. (1990) Database on the structure of small subunit ribosomal RNA. N ucleic Acids Res., 18, 2237–2217. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580.

Pawlowski,J., Bolivar,I., Fahrni,J.F., Cavalier-Smith,T. and Gouy,M. (1996) Early origin of Foraminifera suggested by SSU rRNA gene sequences. Mol. Biol. Evol., 13, 445–450. Saitou,N. and Nei,M. (1987) The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Sakakibara,Y.,Brown,M.,Hughey,R., Mian,I.S.,Sjolander,K, Underwood,R.C., and Haussler,D. (1994) Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res., 22, 5112–5120. Sogin,M.,Elwood,H, and Gunderson,J. (1986) Evolutionary diversity of eukaryotic small-subunit rRNA genes. Proc. Natl Acad. Sci. USA, 83, 1383–1387. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Thompson,J., Higgins,D. and Gibson,T. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Applic. Biosci., 10, 19–29. Van de Peer,Y.,Jansen,J.,De Rijk,P. and De Wachter,R. (1997) Database on the structure of small ribosomal subunit RNA. Nucleic Acids Res., 25, 111–116.

341

R EVIEW

Recent progresses in multiple sequence alignment: a survey
Cédric Notredame
Information Génétique et Structurale, UMR 1889, 31 Chemin Joseph Aiguier, 13 006 Marseille, France Tel.: +33 (0)4 911 646 06 Fax: +33 (0)4 911 645 49 E-mail: cedric.notredame @igs.cnrs-mrs.fr

The assembly of a multiple sequence alignment (MSA) has become one of the most common tasks when dealing with sequence analysis. Unfortunately, the wide range of available methods and the differences in the results given by these methods makes it hard for a non-specialist to decide which program is best suited for a given purpose. In this review we briefly describe existing techniques and expose the potential strengths and weaknesses of the most widely used multiple alignment packages Introduction

Sequence alignment is by far the most common task in bioinformatics. Procedures relying on sequence comparison are diverse and range from database searches [1] to secondary structure prediction [2]. Sequences can be compared two by two to scour databases for homologues, or they can be multiply aligned to visualize the effect of evolution across a whole protein family. In this study we will focus on the later methods, dedicated to the global simultaneous comparison of more than two sequences. Special emphasis will be given to the most recently described techniques.
The many uses of MSAs

to provide a very accurate estimation of pair-wise distances and to make it possible to estimate the reliability of each branch by bootstrapping [5].
Identification of conserved motifs and domains

Multiple alignments constitute an extremely powerful means of revealing the constraints imposed by structure and function on the evolution of a protein family. They make it possible to ask a wide range of important biological questions and they will each be discussed in turn.
Phylogenetic analyses

Phylogenetic trees are instrumental in elucidating the evolutionary relationships that exist among various organisms. Nowadays, highly accurate phylogenetic trees rely on molecular data. Their computation typically involves four steps: • collection of a set of orthologous sequences in a database • multiple alignment of the sequences • measure of pair-wise phylogenetic distances on the multiple alignment and computation of a distance matrix • computation of the tree by applying a clustering algorithm [3] to the distance matrix As an alternative to the last two bullets the tree may also be computed using maximum likelihood [4]. In both cases, the role of multiple alignment is

Keywords: please provide

MSAs make it possible to identify motifs preserved by evolution that play an important role in the structure and function of a group of related proteins. Within a multiple alignment, these elements often appear as columns with a lower level of variation than their surroundings. When coupled with experimental data, these motifs constitute a very powerful means of characterizing sequences of unknown function. Important databases like PROSITE [6] or PRINTS [7] rely on this principle. When a motif is too subtle to be defined with a standard pattern, one may use another type of descriptor known as a profile [8] or a hidden Markov model (HMM) [9]. These are meant to exhaustively summarize (column by column) the properties of a protein family or a domain. Profiles and HMMs make it possible to identify very distant members of a protein family when searching a database. Their sensitivity and specificity is much higher than that provided by a single sequence or a pattern. In practice, one can derive their own profile from multiple alignments using packages such as: the PFTOOLS [10], pre-established collections like Pfam [11], or compute the profiles on the fly with PSI-BLAST [12] the position specific version of BLAST. The specificity and sensitivity of a profile are tightly correlated to the biological quality of the multiple alignment it was derived from.
Structure prediction

Ashley Publications Ltd www.ashley-pub.com

Structure prediction is another important use of multiple alignments. Secondary and tertiary
Pharmacogenomics (2002) 3 (1) 1

2001 © Ashley Publications Ltd ISSN 1462-2416

REVIEW

structure prediction aim at predicting the role a residue plays in a protein structure (buried or exposed, helix or strand etc.). Secondary structure predictions based on a single sequence yield a low accuracy (in the order of 60%) [13], while predictions based on a MSA go much higher (in the order of 75%) [2,14,15]. The rationale behind such improvements is that the pattern of substitutions observed in a column directly reflects the type of constraints imposed on that position in the course of evolution. In the context of tertiary structure determination or when predicting nonlocal contacts, multiple alignments can also help to identify correlated mutations. This approach has only given limited results when applied to proteins [16], it has been much more successful in RNA analysis where it allows highly accurate predictions [17] well confirmed by structural analysis. Altogether, these very important applications explain the amount of attention dedicated to the MSA problem and any biologist should be aware that very few bioinformatics protocols bypass the multiple alignment stage. Unfortunately, available tools are only heuristics providing an approximate solution to a problem that remains largely open. These many heuristics are based on different paradigms, each well suited to a limited range of situations.
A complicated problem.

aware that given inappropriate sequences, most multiple alignment routines will nonetheless produce an alignment. It will be the responsibility of the biologist to realize that this alignment is meaningless. This is not an easy task, and a few years ago Henikoff reviewed a series of problems that can occur when one forces multiple alignments with unrelated sequences [22]. In order to recruit a set of homologous sequences, it is common practice to use one of the BLAST programs (WU-BLAST, PSI-BLAST, GAPPED BLAST etc.) [12], for searching within a database all the sequences similar to some query sequence. When doing so, an observed similarity is considered good when it is unlikely to arise by chance (given the database and the amino-acid frequencies). To make this estimation, BLAST uses powerful statistical models developed by Altschul and Karlin [23]. Of course, these statistical models merely approximate the biological reality, and homology may be misrepresented by similarity, leading to the incorporation of improper sequences within a multiple alignment.
The choice of an objective function

MSA is a complicated problem. It stands at thecross road of three distinct technical difficulties: • the choice of the sequences • the choice of an objective function (i.e., a comparison model) • the optimization of that function Altogether, properly solving these three problems would require an understanding of statistics, biology and computer science that lies far beyond our grasp.
The choice of the sequences

The methods reviewed here (i.e., global MSA methods) only make sense if they are assumed to be dealing with a set of homologous sequences i.e., sequences sharing a common ancestor. Furthermore, with the exception of DiAlign [18], global methods require the sequences to be related over their whole length (or at least most of it). When that condition is not met, one should consider the use of local MSA methods such as the Gibbs sampler [19], Match-Box [20] or MACAW [21]. In any case, one should always be
2

This is purely a biological problem that lies in the definition of correctness. What should a biologically correct alignment look like? Can we define its expected properties and will we recognize it when we see it? These intricate questions can only be answered by means of a mathematical function able to measure an alignment biological quality. We name this function an Objective Function (OF) because it defines the mathematical objective of the search. Given a perfect function, the mathematically optimal alignment will also be biologically optimal. Yet this is rarely the case, and while the function defines a mathematical optimum, we rarely have an argument that this optimum will also be biologically optimal. Defining a proper objective function is a highly non-trivial task and an active research field of its own right. In theory, an OF should incorporate everything that is known about the sequences, including their structure, function and evolutionary history. This information is rarely at hand and is hard to use, so it is usually replaced with sequence similarity. Thus, a very simple general function is often used: the weighted sums-of-pairs with affine gap penalties [24]. Under this model, each sequence receives a weight proportional to the amount of independent information it contains [25] and the cost of the multiple alignment is equal to the sum of the
Pharmacogenomics (2002) 3 (1)

REVIEW

cost of all the weighted pair-wise substitutions. The substitution costs are evaluated using a predefined evolutionary model known as a substitution matrix [26], in which a score is assigned to every possible substitution or conservation according to its biological likeliness (i.e., rarely observed mutations receive a negative score while mutations observed more often would be expected by chance receive to a positive score). Insertions or deletions are scored using affine gap penalties that penalize a gap once for opening and then proportionally to its length. This penalty scheme is a major source of concern because it requires two parameters: • The gap opening • The gap extension penalty whose adequate values can only be set empirically and may vary from one set of sequences to the next [27]. Although this function is clearly wrong from an evolutionary point of view [24], because it assumes every sequence within the set to be an ancestor of every other sequence, the ease of its implementation has made it popular with the most widely used MSA packages [28-30]. This validation was recently confirmed by a more thorough benchmarking [31] indicating that packages that rely on the sums-of-pairs are reasonable performers as judged by the biological quality of the alignments they produce. Very recently, a new variant of the sum-of-pairs function has been introduced that seems less likely to over-estimate evolutionary events [32]. Over the last years, new OFs were described that seem to be less sensitive to gap penalty estimation thanks to the incorporation of local information. These include the segment-based evaluation of DiAlign [33] and the consistency objective function of T-Coffee [34]. HMMs [9,35] constitute another line of thought recently explored. HMMs describe the multiple alignment in a statistical context, using a Bayesian approach. Although from a formal point of view they provide us with the most attractive solution, their performances for ab initio alignments have so far been disappointing and recent work shows that carefully tuned HMM packages barely outperform ClustalW [36]. Other statistically-based methods that attempt to associate a P-value to the multiple alignment have been described Unfortunately, these measures are [19,37]. restricted to ungapped MSAs. All things considered, one should be well aware that there is no such thing as the ideal OF and every available scheme suffers from major
http://www.ashley.com

drawbacks. In an ideal world, a perfect OF would be available for every situation. In practice, this is not the case and the user is always left to make a decision when choosing the method that is most suitable to the problem.
Computational

The third problem associated with MSAs is computational. Assuming we have at our disposal an adequate set of sequences and a biologically perfect objective function, the computation of a mathematically optimal alignment is too complex a task for an exact method to be used [38]. Even if the function we are interested in was as simple as a maximization of the number of perfect identities within each column, the problem would already be out of reach for more than three sequences. This is why all the current implementations of multiple alignment algorithms are heuristics and that none of them guarantee a full optimization. Considering their most obvious properties, it is convenient to classify existing algorithms in three main categories: exact, progressive and iterative. Exact algorithms are high quality heuristics that deliver an alignment usually very close to optimality [28,39], sometimes but not always within well-defined boundaries. They can only handle a small number of sequences (< 20) and are limited to the sums-of-pairs objective function. Progressive alignments are by far the most widely used [34,40,41]. They depend on a progressive assembly of the multiple alignment [42-44] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [45]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity, even if it is by nature a heuristic that does not guarantee any level of optimization. Other progressive alignment methods exist such as DiAlign [18] or Match-Box [20], which assemble the alignment in a sequence-independent manner by combining segment pairs in an order dictated by their score, until every residue of every sequence has been incorporated in the multiple alignment. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvements can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one
3

REVIEW

from a multiple alignment and realigning them to the remaining sequences [46,47], some of these methods can even be a mixture of progressive and iterative strategies [48]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [49] and simulated annealing or genetic algorithms [50-56]. The main advantage is to allow for a good conceptual separation between optimization processes and OF. Recent examples of algorithms belonging to these three categories are reviewed in the next section.
Review

The number of available MSA methods has steadily increased over the last 20 years. Being exhaustive on these will not be possible within the scope of this work, and this review should be seen as complementary with another recent review [57]. Furthermore, it should be pointed out that only a minority of the methods described in the literature have found their way towards regular usage. There are many reasons for failure, but the main one stems from a simple fact: there is no satisfactory theoretical framework in sequence analysis, in this context an algorithm is only as good as it is useful. Improvements are driven by results and not theory, so that programs with badly designed interfaces or poor portability have been disgarded by natural selection, leaving their algorithms to be reinvented by later generations. Over the last few years, the field of MSA has undergone drastic evolutionary changes with the introduction of several new algorithms and new evaluation methods. Some of the methods used for mutiple sequence alignments are listed in Table 1. Among all this, two new trends have emerged: • the increasing use of iterative optimisation strategies (stochastic or non-stochastic) • the use of consistency-based scoring schemes In this section, we review some of these new algorithms, their main characteristics and potential shortcomings. Another major trend, that will not be extensively covered here, has been the introduction of HMMs methods [9,35]. A very detailed account on HMM-based methods for MSAs may be found in [58].
The progressive algorithms.

ing a set of sequences in little time and with little memory. This algorithm was initially described by Hogeweg [42] and later re-invented by Feng [43] and Taylor [44]. The most widely used MSA packages are based on an implementation of this algorithm, which include: Pileup, a part of the GCG package [59], MultAlign [41] and ClustalW [29] that has become the standard method for multiple alignments. ClustalW is a non-iterative, deterministic algorithm that attempts to optimize the weighted sums-of-pairs with affine gap penalties. It is a straightforward progressive alignment strategy where sequences are added one by one to the multiple alignment according to the order indicated by a pre-computed dendrogram. Sequence addition is made using a pair-wise sequence alignment algorithm [45]. The main shortcoming of this strategy is that once a sequence has been aligned, that alignment will never be modified even if it conflicts with sequences added later in the process as shown in Figure 1. ClustalW also includes many highly specialized heuristics meant to maximally exploit sequence information: • • • • local gap penalties automatic substitution matrix choice automatic gap penalty adjustment the delaying of the alignment of distantly related sequences

Benchmarking tests, carried out on BAliBASE a database of reference multiple sequence alignments. In general, ClustalW performs better when the Phylogenetic tree is relatively dense without any obvious outlier. It does not matter how widely the sequences are spread just as long as every sequence remains close enough (a bit like crossing a river stepping from stone to stone). Long insertions or deletions also cause trouble, due to the intrinsic limitation of the affine penalty scheme used by ClustalW. The latest improvement to the progressive alignment algorithm is T-Coffee, a novel strategy where sequences are aligned in a progressive manner but using a consistency-based objective function that makes it possible to minimize potential errors, especially in the early stages of the alignment assembly. T-Coffee is reviewed in more detail in the consistency-based algorithm section.
[31],

Exact algorithms

Progressive alignment constitutes one of the simplest and most effective ways of multiply align4

As mentioned earlier, progressive alignment is only an approximate solution. In order to use the
Pharmacogenomics (2002) 3 (1)

REVIEW

Table 1. Some recent and less recent available methods for MSAs.
Name MSA DCA OMA ClustalW, ClustalX MultAlin DiAlign ComAlign T-Coffee Praline IterAlign Prrp SAM HMMER SAGA GA Algorithm Exact Exact (requires MSA) Iterative DCA Progressive Progressive Consistency-based Consistency-based Consistency-based/progressive Iterative/progressive Iterative Iterative/Stochastic Iterative/Stochastic/HMM Iterative/Stochastic/HMM Iterative/Stochastic/GA Iterative/Stochastic/GA URL http://www.ibc.wustl.edu/ibc/msa.html http://bibiserv.techfak.uni-biefield.de/dca http://bibiserv.techfak.uni-biefield.de/oma ftp://ftp-igbmc.u-strasbg.fr/pub/clustalW or clustalX http://www.toulouse.inra.fr/multalin.html http://www.gsf.de/biodv/dialign.html http://www.daimi.au.df/~ ocaprani http://igs-server.cnrs-mrs.fr/~ cnotred jhering@nimr.mrc.ac.uk http://giotto.Stanford.edu/~ luciano/iteralign.html ftp://ftp.genome.ad.jp/pub/genome/saitama-cc/ rph@cse.ucsc.edu http://hmmer.wustl.edu/ http://igs-server.cnrs-mrs.fr/~ cnotred czhang@watnow.uwaterloo.ca Ref.
[28] [39] [61] [29] [41] [18] [75] [66] [48] [70] [47] [84] [68] [51] [52]

signal contained in the sequences properly, one would like to simultaneously align them, rather than adding them one by one to a multiple alignment. This would be especially useful when dealing with sets of extremly divergent sequences whose pair-wise alignments are all likely to be incorrect. Unfortunately, to align several sequences, one would need to generalize the Needlman and Wunsch algorithm [45] to a multidimensional space and for practical reasons (time and memory) this is only possible for a maximum of three sequences. That limit can be pushed a bit further if one finds a way to identify in advance the portion of the hyperspace that does not contribute to the solution and exclude it from computation. This is achieved in the MSA program, an implementation of the Carrillo and Lipman algorithm [60] that makes it possible to align up to ten closely related sequences [28]. It should be stressed here that, contrary to a widespread belief, the MSA program is only a heuristic implementation of the Carillo and Lipman algorithm, that is not guaranteed to reach the mathematical optimum. MSA uses lower and upper bounds tighter than the guaranteed ones (Altschul, personal communication). Even so, the high memory requirement, the lengthy computational time and the limitation on the number of sequences explain why the MSA program quickly gave way to ClustalW. Yet, MSA met again with popularity when Stoye described a new divide and conquer algorithmDCA [39] that sits on the top of MSA
http://www.ashley.com

and extends its capabilities. The DCA algorithm cuts the sequences in subsets of segments that are small enough to be fed to MSA. The sub-alignments are later reassembled by DCA. The trick is to cut the sequences at the right points so that the produced alignment remains as close as possible to optimality. The way it is done in DCA is slightly heuristic albeit fairly accurate. Benchmarking on BAliBASE indicated that the DCA strategy does slightly better that ClustalW, even if the four largest BAliBASE test sets could not be computed with DCA (Notredame, unpublished results). Even when MSA is coupled to DCA, strong limitations remain on the number of sequences that can be handled (20–30) and on their phylogenetic spread. Recently, an iterative implementation of DCA [61], optimal multiple alignment (OMA) was described that is meant to speed up the DCA strategy and decreases its memory requirements.
Iterative algorithms

Iterative algorithms are based on the idea that the solution to a given problem can be computed by modifying an already existing sub-optimal solution. Each ‘modification’ step is an iteration. In the examples considered here, modifications can be made using dynamic programming or various random protocols. While the dynamic programming-based protocols can also include elements of randomization, we distinguish them from more traditional stochastic iterative meth-

5

REVIEW

Figure 1. Limits of the progressive strategy.

GARFIELD THE LAST FA-T CAT GARFIELD THE FAST CA-T --GARFIELD THE VERY FAST CAT

GARFIELD GARFIELD GARFIELD --------

THE THE THE THE

LAST FAST VERY ----

FA-T CA-T FAST FA-T

CAT --CAT CAT

GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT ---

THE FAT CAT

GARFIELD THE VERY FAST CAT

GARFIELD THE FAST CAT

GARFIELD THE LAST FAT CAT

This example shows how a progressive alignment strategy can be misled. In the initial alignment of sequences 1 and 2, ClustalW h a as choice between aligning CAT with CAT and making an internal gap or making a mismatch between C and F and having a terminal gap. Since terminal gaps are much cheaper than internals, the ClustalW scoring schemes prefers the former. In the next stage, when the extra sequence is added, it turns out that properly aligning the two CATs in the previous stage would have led to a better scori ng sumsof-pairs multiple alignment.

ods such as simulated annealing (SA) genetic algorithms (GA) [63].
Stochastic iterative algorithms

[62]

or

SA was the first stochastic iterative method described for simultaneously aligning a set of sequences. Various schemes have been published [50,64], which all involve the same chain of processes: an alignment is randomly modified, its score assessed, it is kept or discarded according to an acceptance function that gets more stringent while the iteration number increases (by analogy with a decreasing temperature during crystallization), the process goes on until a finishing criteria such as convergence is met. In practice, despite being intellectually very attractive, SA is too slow for making ab initio alignment and it can only be used as an alignment improver. GAs constitute an interesting alternative to SA as shown in SAGA [51], a GA dedicated to MSA. Like SA, SAGA is an optimization black box in which any OF invented can be tested. The principle of SAGA is very straightforward and follows closely the ‘simple GA’ [65]: randomly

generated multiple alignments of a given set of sequences evolve under some selection pressure. These alignments are in competition with each other for survival (survival of the fittest) and reproduction. Within SAGA, fitness depends on the score measured by the objective function (the better the score, the fitter the multiple alignment). Over a series of cycles known as generations, alignments will die or survive, depending on their fitness. They can also improve and reproduce through some stochastic modifications known as mutations and crossovers. Mutations randomly insert or shift gaps while crossovers combine the content of two alignments ( Figure 2). Overall, 20 operators co-exist in SAGA and compete for usage. The program does not guarantee optimality but has been shown to equal or outperform MSA from a mathematical point of view on 13 test sets (using exactly the same OF in both programs). The complete disconnection between the operators and the original OF made it possible to seamlessly modify the original OF in order to test SAGA with a new OF named COFFEE (Consistency Objective Function For alignPharmacogenomics (2002) 3 (1)

6

REVIEW

Figure 2. One point crossover in SAGA.

Parent Alignment 1 W W W W G D G S KV KV KV KV N---VDEVGGEALNEEE---VGGEALG--AHAGEYGAEAL GGHA--GEYGAEAL

Parent Alignment 2 --WGKV NVDEVG-G WD--KV NEEEVG-G WGKV GA-HAGEYGA WSKV GGHAGEY-GA EAL EAL EAL EAL

W W W W

G D G S

KV KV KV KV

--NVDEVG-G --NEEEVG-G GA-HAGEYGA GGHAGEY-GA

EAL EAL EAL EAL

--WGKV WD--KV WGKV-WSKV--

N---VDEVGGEALNEEE---VGGEALG--AHAGEYGAEAL GGHA--GEYGAEAL

Child Alignment 1

Child Alignment 2

W W W W

G D G S

KV KV KV KV

--NVDEVG-G --NEEEVG-G GA-HAGEYGA GGHAGEY-GA

EAL EAL EAL EAL

Chosen Child Alignment

This figure illustrates the manner in which two alignments are combined into one in SAGA, a genetic algorithm that evolves a population of alignments toward optimality. The principle is to cut straight one of the alignments and to cut the second one so that compatible ends are generated.

evolutionary programming community and, in recent years, at least three algorithms based on the SAGA principle have been published [54-56]. The Gibbs sampler is another interesting stochastic iterative strategy [19]. It is a local multiple alignment method that finds ungapped motifs among a set of unaligned sequences. From a multiple alignment perspective, the most interesting feature of the Gibbs sampler is its OF. The algorithm aims to build an alignment with a good P-value (i.e., a low probability of having been generated by chance). At each iteration, segments are removed or added according to the probability that the current model (the rest of the alignment) could have generated them. If that probability is high enough, the model is then updated with the new segments and the algorithm proceeds toward the next iteration. The overall result is an alignment that has a good P-value and maximizes the probability of the data it contains (i.e., each sequence fits well within the alignment). This Bayesian idea of simultaneously maximizing the data and the model is also central to HMMs [9,35], thus it is not surprising to find that HMMs can also be trained by expected maximization [49,68] . However, like GAs, HMMs proved rather disappointing when it came to ab initio alignments. Today, HMMs such as those found in Pfam [11] are no longer generated from unaligned sequences. State of the art protocols are much more inclined toward turning a pre-computed alignment into an HMM and further refining it using HMMER [49] or SAM [68].
Non stochastic iterative algorithms.

mEnt Evaluation) [66]. This series of studies revealed the suitability of GAs to become investigation tools but also made it clear that GAs were too slow a strategy for large-scale projects or everyday use. Another similar MSA GA was later introduced by Zhang and Wong [52]. The authors report a very high efficiency for their GA but these results must be considered with care since their strategy (especially the mutations) is driven by the presence of completely conserved segments that guide the assembly of the alignments. The assumption that such segments will always exist when aligning proteins is not realistic. This method appears to be appropriate when comparing very long highly similar sequences (such as portions of genomes). SAGA was later parallelized by two independent groups [67,53], in order to improve its efficiency. The model described in SAGA has been met with considerable interest in the

The first non-stochastic iterative algorithms date back to the origins of MSAs [46]. The idea is simple and attractive: since mistakes may arise in the early stages of a progressive alignment, why not correct them later by re-aligning each sequence in turn to the multiple alignment using standard dynamic programming algorithms [45]. The procedure terminates when iterations consistently fail to improve the alignment. This very simple algorithm constitutes most of the iterative strategies described in the early 1990s. The main scope for variation is the way sequences are divided into two groups before being re-aligned. In AMPS [46], sequences are chosen according to their input order and re-aligned one by one. In the algorithm of Berger and Munsen [69], the choice is made in a random manner and sequences are divided into two groups that can contain more than one sequence. The element of
7

http://www.ashley.com

REVIEW

randomization makes the algorithm more robust and improves its accuracy. Few of these early iterative methods have been properly benchmarked, making it hard to estimate their true biological significance. The most sophisticated DP-based iterative algorithm available was recently described by Gotoh [47]. It is a double nested iterative strategy with randomization that optimizes the weighted sums-of-pairs with affine gap penalties (Figure 3). The originality of this algorithm is that the weights and the alignment are simultaneously optimised. The inner iteration optimizes the weighted sums of pairs while the outer iteration optimizes the weights that are calculated on a phylogenetic tree estimated from the current alignment [25]. The algorithm terminates when the weights have converged. Prrp was the first multiple alignment program to be extensively benchmarked, using JOY, a database of structural alignments. The results were confirmed on BAliBASE [31,34]. Prrp significantly out-performs most of the traditional progressive methods as well as some of the most recent iterative strategies (Table 2). Two other iterative alignment methods were recently described: Praline [48] and IterAlign [70]. These two methods share very similar protocols. They both start with a preprocessing of the sequences to align. In IterAlign, sequences are ‘ameliorated’(sic), this means that each sequence is locally compared to others and that every segment that shows high similarity with other proteins is replaced by a consensus. One round of ‘amelioration’ constitutes one iteration. Other iterations are run on the new set of ‘ameliorated’ sequences, until the collection of consensus converges. Consistent blocks are then extracted from the consensus collection and these blocks are chained in order to produce the final alignment. Praline uses a very similar protocol: sequences are replaced with a complete profile made from a multiple alignment that only includes their closest relatives. That profile step is iterated until the collection of profiles converges. This collection of profiles is conceptually similar to the ‘ameliorated’ set of sequences used by IterAlign. The multiple alignment is then assembled by using a straightforward progressive algorithm where sequences are replaced with profiles. One of the most interesting consequences of the protocol used in Praline is the possibility of measuring the consistency between the final alignment and the collection of profiles used for its assembly. There

may be some correlation between this measure and the true accuracy of the alignment. Regardless of the potential performances of these two methods (neither have been properly bench marked), some emphasis should be given to the novel concepts they incorporate: • the first one is the use of local information in IterAlign, in order to decrease sensitivity to the gap penalty parameterization • the second key concept is consistency Sequences are preprocessed so that the regions consistently conserved across the family see their signal enhanced and become more likely to drive the alignment. This search for consistency has been one of the strongest trend in recent developments of MSA. It is also central to the noniterative methods.
Consistency-based algorithm

The first consistency-based MSA method was described by Kececioglu in the 80s [71]. Given a set of sequences, the optimal MSA is defined as the one that agrees the most with all the possible optimal pair-wise alignments. Computing that alignment is an NP complete problem that can only be solved for a small number of related sequences, using an MSA-like algorithm. Nonetheless, there are at least three good reasons that make consistency-based OFs very attractive: • firstly, they do not depend on a specific substitution matrix but rather on any method or collection of methods able to align two sequences at a time • secondly, the consistency-based scheme is position dependant, given the collection of pair-wise alignments. This means that the score associated with the alignment of two residues depends on their indexes (position within the protein sequence) rather than their individual nature • the third reason is more general and has to do with consistency. Experience shows that given a set of independent observations, the most consistent are often closer to the truth This principle generally holds well in biology and can be loosely connected to the observation that, given a series of measurements, noise spreads while signal accumulates. Although the first consistency-based OF was described in 1983, it took several more years to develop heuristic algorithms able to deal with optimization and it is only recently that a GA, (SAGA [51]) was used to show the biological
Pharmacogenomics (2002) 3 (1)

8

REVIEW

Figure 3. Layout of Prrp.

Initial alignment

Tree and weights computation

Weights converged

Yes

End

Outer iteration

No Realign two sub-groups

Inner iteration

Yes

Alignment converged

No

This figure shows the layout of Prrp, a double-nested strategy for optimizing multiple alignments. When the inner iteration has converged, new sequence weights are estimated. The convergence of these weights is the criteria for the outer iteration to stop.

Table 2. Some elements of validation on BAliBASE.
Method DiAlign ClustalW Prrp T-Coffee Ref1 71.0 78.5 78.6 80.7 Ref2 25.2 32.2 32.5 37.3 Ref3 35.1 42.5 50.2 52.9 Ref4 74.7 65.7 51.1 83.2 Ref5 80.4 74.3 82.7 88.7 Total 57.3 58.7 59.0 68.7

Each method in the Method column was used to align the 141 test-sets contained in BAliBASE. The alignments were then compared with the reference BAliBASE alignment using aln_compare [34]. Ref1–5 indicates the five BAliBASE categories. Results obtained in each category were averaged. All the observed differences are statistically significant, as assessed by the Wilcoxon rank-based test [34,47]. Ref1 contains a homogenous set of sequences, ref2 contains a homogenous group of sequences and an outlayer, ref3 contains two distantly related groups of sequences. Ref4 contains sequences that require long internal gaps to be properly aligned and ref5 contains sequences that require long-terminal gaps to be properly aligned. Total is the average of ref1–5.

sequence comparison, database search, experimental knowledge etc.). Although SAGA-COFFEE yielded interesting results, the GA was too slow for everyday use. This prompted the development of a new heuristic algorithm to optimize the COFFEE function in a time efficient manner: T-Coffee (Figure 4). In T-Coffee, the COFFEE library is turned into a so-called ‘extended library’, a position-specific substitution matrix where the score associated with each pair of residues depends on the compatibility of that pair with the rest of the library. T-Coffee uses a procedure reminiscent of Vingron’s Dot matrix multiplication [72] and Morgenstern overlapping weights [73]. The multiple alignment is assembled using a progressive alignment algorithm similar to the one used in ClustalW: • pair-wise distances are computed • a neighbour joining tree is estimated [3] • the sequences are aligned one by one following the topology of the tree The main difference between T-Coffee and ClustalW is that in T-Coffee, the extended library replaces a substitution matrix. Another important characteristic of T-Coffee is that its primary library is made of a mixture of global alignments (produced with ClustalW) and local

advantages of such a function, COFFEE [66], which emulates the maximum weight trace problem. In SAGA-COFFEE, the collection of weighted pair-wise alignments is named a library and SAGA is used to compute the alignment that has the highest level of consistency with the library. In practice, the library may contain more than one alignment for each pair of sequences, the information it contains may be redundant, conflicting and may originate from sources as various as one wishes (structure analysis,
http://www.ashley.com

9

REVIEW

alignments (produced with Lalign [74]). The bench-marking carried out on BAliBASE shows that this combination of local and global information makes the T-Coffee implementation able to outperform Prrp, ClustalW and DiAlign on the five categories of test-sets contained in this reference database [34]. These results were obtained without tuning, since T-Coffee does not have any parameters of its own. Due to the library extension, T-Coffee does more than simply compute a consensus alignment. Nonetheless, given a collection of multiple alignments, it can be interesting to combine them into a single consensus multiple alignment. This is what the ComAlign program does [75] by combining several multiple alignments into a single, often improved, multiple alignment. Although the details differ, T-Coffee bears some similarity to DiAlign [73], another consistency-based algorithm that attempts to use local information in order to guide a global multiple alignment. DiAlign starts with an identification of highly homologous segment-pairs. The weight of each of these pairs is defined by a Pvalue comparable to the P-values used in BLAST. Each of these segment-pairs receives another score proportional to its compatibility with the complete set of segment-pairs. This score is named an overlapping weight and segment-pairs weighted this way are very reminiscent of the extended library. The multiple alignment is then progressively assembled by adding the pairs of segments according to their weight. Assembly is made in a sequence independent order, as opposed to the ClustalW-style progressive alignment strategy. Non-compatible segment-pairs are discarded, hence the importance of the order induced by the weights. According to the authors, DiAlign is especially good at properly aligning sequences where local homology is the driving signal. This has been confirmed by BAliBASE benchmarking [31,34]. Overall, DiAlign is not as accurate as ClustalW or Prrp but it does very well in categories 4 and 5 of BAliBASE, which require very long insertions to be properly aligned. Over the past few years, the DiAlign algorithm has been modified on numerous occasions for improved efficiency [76].
Conclusion and expert opinion

‘relevant’ information and in fact there is so much of it that by choosing the data, one can suit the needs of almost any method (progressive, iterative etc.). Ironically, one could be tempted to say that data has improved faster than mutiple aligment methods. As a consequence, the real challenge is not so much the multiple alignment itself but rather the choice of a subset of sequences that will yield the most biologically correct and informative alignment, given one method or another. There are two good reasons for not using all the available sequences: • Alignments with a large number of sequences are slow to compute and hard to analyze. Whenever possible, an alignment should fit on a single sheet of A4 paper. • Limitations of existing programs. Although they all use weighting schemes meant to minimize the effect of similar or highly correlated sequences, none of these schemes are entirely satisfactory, and over-represented sub-groups always end up dominating the alignment or the profile. This can prevent the proper alignment of less well represented sequence sub-groups that may be just as important. Careful user’s trimming is still the best available way around that effect. Unfortunately, the increased sensitivity of database search tools coupled with the increase in database size has rendered this process very tedious. The second major change that has occurred over the last years is the increasing number of available 3D structures. Although the proportion of protein sequences with a known 3D structure is getting smaller and smaller, the situation is very different from a protein family perspective and the proportion of protein families where at least one member has a known 3D structure increases regularly. This means that in most cases, multiple alignment modelling could benefit from the incorporation of 3D structural information, in order to enhance very remote homologies, or to guide the choice of local penalties [77]. Very few of the packages avaliable are able to mix structure and sequences within a multiple alignment. While ClustalW is able to use SwissProt secondary structure information for gap penalty estimation, a proper tool is still lacking for the simultaneous alignment of sequences and structures. Two of the methods introduced here are good candidates for such a combination. The consistency-based algorithms
Pharmacogenomics (2002) 3 (1)

Ten years ago, when schemes such as MSA were developed, there was very little data available and the main problem was to use every bit of available information properly. Today the situation has dramatically changed. We are overwhelmed by
10

REVIEW

Figure 4. Layout of T-Coffee.

Users library A B A C B C Primary library of local alignments Extension A B A C A B A C B C Primary library of global alignments

Primary library

Extended library

Progressive alignment

This figure indicates the layout of T-Coffee. Local and global pairwise alignments are first computed and then combined into a primary library that is extended in order to be used for computing the multiple sequence alignment in a progressive manner.

have the advantage of having few requirements on the origin of their libraries. For instance, DALI, the database of structural multiple alignments [78] relies on T-Coffee to assemble the collection of pair-wise alignments produced by the DALI algorithm into a multiple alignment. The double dynamic programming algorithm introduced by Taylor [79] is also a good candidate for that purpose. While it has been shown that this algorithm is suitable for structure-to-structure alignments [80], recent results indicate that it could also be used in the context of MSA and possibly as a means to mix sequences and structures [81]. The third major obstacle on the road that leads toward an informative multiple alignment is the processing of repeats. Repeated sequences (in tandem or not) are renowned for confusing all the existing MSA methods. When dealing with sequences that contain such repeats, the only solution is to pre-process the sequences, extract the repeats and only align homologous regions. This extraction can be made using any local multiple alignment tool such as the Gibbs

sampler [19], Mocca [82] or Repro [83]. Unfortunately, none of these tools are well integrated within a global multiple alignment procedure. The Gibbs sampler and Mocca have the advantage of providing the user with some estimation of the biological relevance of their output. The fourth point that needs to be raised here is computation. While elegant solutions have been found to parallelize database searches, the parallelization of a MSA algorithm remains a difficult task. The operations involved in the implementation of these algorithms require complex schemes of memory sharing that are not suitable to Linux-farms and other clusters. When dealing with large sets of data of long sequences, supercomputers are still required for multiple alignment programs. The last important point is the estimation of local accuracy. The common property of all the methods introduced here is that no one in particular is the best. They may all be out-performed by the others on one protein family or another. For that reason, we feel that it is more important to be able to assess the exact level of

http://www.ashley.com

11

REVIEW

Highlights
• MSAs are essential bioinformatics tools. They are required for phylogenetic analysis, to scan databases for remote members of a protein family and structure prediction. • No perfect method exists for assembling a multiple sequence alignment and all the available methods do approximations (heuristics). • The most commonly used methods for doing multiple sequence alignments use a progressive alignment algorithm (ClustalW [31]). • Recent progress have focused on the design of iterative (Prrp [30], SAGA[51]) and consistency based methods (DiAlign [33], T-Coffee [34], Praline [48] and IterAlign [70]). • Benchmarking on a collection of reference alignments (BAliBASE [31]) indicates that ClustalW [31] performs reasonably well on a wide range of situations while DiAlign is more appropriate for sequences with long insertions/deletions. These tests also indicate that T-Coffee [34] is on average the best available method among those evaluated that way. • Future methods should be able to integrate structural information within the multiple alignments and to allow some estimation of their local reliability.

accuracy of an alignment, rather than improving the average performances of each method. To our knowledge, only four packages, incorporate an estimation of the alignment local quality: ClustalX (the X-Window interface of ClustalW), Praline [48], T-Coffee [34] and Match-Box [20]. None of these methods for estimating local accuracy have been thoroughly benchmarked and properly validated for estimation. To conclude, a multiple alignment is merely a very constrained model. It is a powerful way to spot inconsistencies amongst a data set and to visualize relationships that may exist among seemingly independent pieces of information. Multiple alignment may be driven by any available source of information for instance structure, sequence, experimental knowledge and so on.
Outlook

and more important in our understanding of biology and there is no doubt than in 5 to 10 years, multiple alignments will be as central to the biological analysis as they are now. There is no doubt in my mind that MSA will remain central to sequence-based biology. This being said, MSA methods will also need to evolve. They will need to integrate heterogeneous information such as structures, results of database searches, experimental data and in general, anything that may come from expression data and proteomic analysis, including regulatory information. Integrating such heterogeneous information is a complex task. When the data is heterogeneous, knowing who is right and who is wrong becomes an art. Addressing that type of questions will be difficult and essential. The appropriate method will have to do this in a transparent way, letting the user control every bit of extra information that goes into his alignment. This ideal method should also allow the user to inject into his model some of his own knowledge. Doing so should be made an easy task. These ideas have been central to development of the underlying philosophy in the T-Coffee package [34]. In any case, these future methods are bound to be memory and CPU hungry. Compared with database searches, multiple sequence alignment protocols are hard to optimize. Special hardware may need to be adapted and the code may have to be redesigned. Several computer manufacturers are currently looking at this problem. One can easily imagine that a powerful multiple sequence alignment server will soon be a feature of most laboratories, just like PCR machines made their appearance in the 90s.
Acknowledgements The author wishes to thank Dr Jean Michel Claverie for helpful advice and comments. He also wishes to thank the two referees for interesting comments and for putting to his attention several of the most recent references included in this work.
6. Bairoch A, Bucher P, Hofmann K: The PROSITE database its status in (1997). Nucleic Acids Res. 25, 217-221 (1997). Attwood TK, Croning MD, Flower DR et al.: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28(1), 225-227 (2000). Gribskov M, Luethy R, Eisenberg D: Profile analysis. Meth. Enzymol. 183, 146159 (1990).

Are multiple sequence alignments here to stay? The answer is yes, without any doubt. While we enter the area of comparative genomics, the simultaneous comparison of a large number of homologous biological objects will become more

Bibliography
Papers of special note have been highlighted as either of interest (•) or of considerable interest (••) to readers. 1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990). 2. Rost B, Sander C, Schneider R: PHD - an automatic server for protein secondary structure prediction. CABIOS 10, 53-60 3.

4. 5.

(1994). Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406425 (1987). Saitou N: Maximum likelihood methods. Meth. Enzymol. 183, 584-598 (1990). Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783-791 (1985).

7.

8.

12

Pharmacogenomics (2002) 3 (1)

REVIEW

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

•• 19.

20.

21.

22.

Haussler D, Krogh A, Mian IS, Sjölander K: Protein modeling using hidden markov models: analysis of globins. In: Proceedings for the 26th Hawaii International Conference on Systems Sciences. Wailea HI U.S.A.: Los Alamitos CA: IEEE Computer Society Press (1993). Luthy R, Xenarios I, Bucher P: Improving the sensitivity of the sequence profile method. Protein Sci. 3(1), 139-146 (1994). Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer E: The Pfam protein families database. Nucleic Acids Res. 28(1), 263-266 (2000). Altschul SF, Madden TL, Schaffer AA et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 22893402 (1997). Garnier J, Gibrat J-F, Robson B: GOR method for predicting protein secondary structure from amino acid sequence. Meth. Enzymol. 266, 540-553 (1996). Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195-202 (1999). Rost B: Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 134(2-3), 204-18 (2001). Goebel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins: Structure Function and Genetics 18(4), 309-317 (1994). Gutell RR, Weiser B, Woese CR, Noller HF: Comparative anatomy of 16S-like ribosomal RNA. Prog. Nucleic Acid Res. Mol. Biol. 32, 155-216 (1985). Morgenstern B, Dress A, Wener T: Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93, 12098-12103 (1996). The first method described that does not require arbitrary gap penalties. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-214 (1993). Depiereux E, Baudoux G, Briffeuil P et al.: Match-Box_server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13(3), 24956 (1997). Schuler GD, Altschul SF, Lipman DJ: A workbench for multiple alignment construction and analysis. Proteins 9(3), 180-90 (1991). Henikoff S: Playing with blocks: some

23.

24.

25.

26.

27.

28.

29.

•• 30.

••

31.

32.

33.

34.

pitfalls of forcing multiple alignments. The New Biologist 3(12), 1148-1154 (1991). Karlin S, Bucher P, Brendel V, Altschul SF: Statistical methods and insight for protein and DNA sequences. Annu. Rev.Biophys. Biophys. Chem. 20, 175-203 (1991). Altschul SF, Lipman DJ:Trees stars and multiple biological sequence alignment. SIAM J. Appl. Math. 49, 197-209 (1989). Altschul SF, Carroll RJ, Lipman DJ: Weights for data related by a tree. J. Mol. Biol. 207, 647-653 (1989). Dayhoff MO, Schwarz RM, Orcutt BC: A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: Atlas of Protein Sequence and Structure Dayhoff MO (Eds.), xNational Biomedical Research Foundation: Washington D.C. USA 353-358 (1979). Vingron M, Waterman MS: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1-12 (1994). Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA 86, 4412-4415 (1989). Thompson J, Higgins D, Gibson T: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4690 (1994). The most widely used method for making multiple sequence alignments. Gotoh O: Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput. Appl. Biosci. 10(4), 379-87 (1994). The first attempt to systematically assess the accuracy of a MSA method by comparison wit reference structural alignment. Also the most complex dynamic programming based iterative method. Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27(13), 2682-2690 (1999). Gonnet GH, Korostensky C, Benner S: Evaluation measures of multiple sequence alignments. J. Comput. Biol. 7(1-2), 261-76 (2000). Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment [In Process Citation]. Bioinformatics 15(3), 211-8 (1999). Notredame C, Higgins DG, Heringa J: TCoffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302, 205217 (2000).

35. Baldi P, Chauvin Y, Hunkapiller T, McClure MA: Hidden Markov models of biological primary sequence information. Proc. Nat. Acad. Sci. 91, 1059-1063 (1994). 36. Karplus K, Hu B: Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 17(8), 713-20 (2001). 37. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7-8), 563-77 (1999). 38. Wang L, Jiang T: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337-348 (1994). 39. Stoye J, Moulton V, Dress AW: DCA: an efficient implementation of the divide-andconquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13(6), 625-6 (1997). 40. Higgins DG, Sharp PM: Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-244 (1988). 41. Corpet F: Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881-10890 (1988). 42. Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175-186 (1984). • The first description of the progressive algorithm. 43. Feng D-F, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360 (1987). 44. Taylor WR: A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169 (1988). 45. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970). 46. Barton GJ, Sternberg MJE: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 198, 327-337 (1987). 47. Gotoh O: Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264(4), 823-838 (1996). 48. Heringa J: Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple

http://www.ashley.com

13

REVIEW

49.

50.

51.

• 52.

53.

54.

55.

56.

57.

58.

••

59.

alignment. Computers and Chemistry 23, 341-364 (1999). Krogh A, Brown M, Mian IS, Sjölander K, Haussler D: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235, 15011531 (1994). Kim J, Pramanik S, Chung MJ: Multiple Sequence Alignment using Simulated Annealing. Comp. Applic. Biosci. 10(4), 419426 (1994). Notredame C, Higgins DG: SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 1515-1524 (1996). One of the first attempts to apply genetic algorithms to sequence analysis. Zhang C, Wong AK: A genetic algorithm for multiple molecular sequence alignment. Comput. Appl. Biosci. 13(6), 565-81 (1997). Anabarasu LA: Multiple sequence alignment using parallel genetic algorithms. In The Second Asia-Pacific Conference on Simulated Evolution (SEAL-98). Canberra Australia (1998). Gonzalez RR: Multiple protein sequence comparison by genetic algorithms. In SPIE98 (1999). Chellapilla K, Fogel GB: Multiple sequence alignment using evolutionary programming. In: Congress on Evolutionary Computation. (1999). Cai L, Juedes D, Liakhovitch E: Evolutionary computation techniques for multiple sequence alignment. In: Congress on Evolutionary Computation. (2000). Duret L, Abdeddaim S: Multiple Alignment for Structural Functional or phylogenetic analyses of Homologous Sequences. In: Bioinformatics Sequence structure and databanks. Higgins D, Taylor W (Eds.), Oxford University Press: Oxford UK (2000). Durbin R et al.: Biological Sequence Analysis. Cambridge University Press. Cambridge, UK (1998). One of the most comrehensive textbook on the algorithms dedicated to sequence analysis. Devereux J, Haeberli P, Smithies O: GCG package. Nucleic Acids Res. 12, 387-395

(1984). 60. Carrillo H, Lipman DJ: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073-1082 (1988). 61. Reinert K, Stoye J, Will T: An iterative method for faster sum-of-pair multiple sequence alignment. Bioinformatics 16(9), 808-814 (2000). 62. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092 (1953). 63. Holland JH: Adaptation in natural and artificial systems. Ann Arbour MI: University of Michigan Press (1975). 64. Ishikawa M, Toya T, Hoshida M, Nitta K, Ogiwara A, Kanehisa M: Multiple sequence alignment by parallel simulated annealing. Comp. Applic. Biosci. 9, 267-273 (1993). 65. Goldberg DE: Genetic Algorithms. In: Search Optimization and Machine Learning. Goldberg DE (Eds), AddisonWesley. New York USA (1989). 66. Notredame C, Holm L, Higgins DG: COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5), 407-422 (1998). 67. Notredame C, O'Brien EA, Higgins DG: RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res. 25(22), 45704580 (1997). 68. Eddy SR: Multiple alignment using hidden Markov models. In: Third international conference on intelligent systems for molecular biology (ISMB). Cambridge England: Menlo Park CA: AAAI Press (1995). 69. Berger MP, Munson PJ: A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci. 7, 479-484 (1991). 70. Brocchieri L, Karlin S: Asymetric-iterated multiple alignment of protein sequences. JMB 276, 249-264 (1998). 71. Kececioglu JD: The maximum weight trace problem in multiple sequence alignment. Lecture notes in Computer Science 684, 106-119 (1983). 72. Vingron M, Argos P: Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol.

218, 33-43 (1991). 73. Morgenstern B, Frech BK, Dress A, Werner T: DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14(3), 290-294 (1998). 74. Huang X, Miller W: A time-efficient linearspace local similarity algorithm. Adv. Appl. Math. 12, 337-357 (1991). 75. Bucka-Lassen K, Caprani O, Hein J: Combining many multiple alignments in one improved alignment. Bioinformatics. 15(2), 122-130 (1999). 76. Lenhof HP, Morgenstern B, Reinert K: An exact solution for the segment-to-segment multiple sequence alignment problem. Bioinformatics 15(3), 203-210 (1999). 77. Jennings AJ, Edge CM, Sternberg MJ: An approach to improving multiple alignments of protein sequences using predicted secondary structure. Protein Eng. 14(4), 227-31 (2001). 78. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L: A fully automatic evolutionary classification of protein folds: dali domain dictionary version 3. Nucleic Acids Res. 29(1), 55-7 (2001). 79. Taylor WR, Saelensminde G, Eidhammer I: Multiple protein sequence alignment using double-dynamic programming. Comput. Chem. 24(1), 3-12 (2000). 80. Orengo CA, Taylor WR: A rapid method of protein structure alignment. J. Theor. Biol. 147, 517-551 (1990). 81. Eidhammer I, Jonassen I, Taylor WR: Structure comparison and structure patterns. J. Comput. Biol. 7(5), 685-716 (2000). 82. Notredame C: Mocca: semi-automatic method for domain hunting. Bioinformatics 17(4), 373-374 (2001). 83. Heringa J, Argos P: A method to recognzse distant repeats in protein sequences. Proteins: Structure Function and Genetics 17, 391-411 (1993). 84. Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Computer Applications in Biological Science 12, 95-107 (1996).

14

Pharmacogenomics (2002) 3 (1)

UTILISATION DES ALGORITHMES GENETIQUES POUR L’ANALYSE DE SEQUENCES BIOLOGIQUES

Cédric Notredame

Doctorat en Bio−informatique
Février 1998

Université Paul Sabatier France

Directeur De Thèse: Prof. François Amalric

Acknowledgement I wish to thank the EMBL for their funding through an EMBL grant. This work was carried out in the lab of Des Higgins, first at the European Molecular Biology Laboratory in Heidelberg,Germany and later at the EMBL outstation, the European Bioinformatics Institute, in Hinxton, U.K. Des has been a constant source of support. I wish to thank him and express him all my gratitude for teaching me most of what I know in bioinformatics, and so much more about being a scientist. I also wish to thank Ture Etzold who gave me a place in his team when Des had to leave for Ireland. Thanks to his expertise in the field, Thure has had a lots of influence on my work. A special thanks to the system managers, namely Roy Omond, Rodrigo Lopez and Pettere Jokinen who have always been supportive, allowing me to overload the machines any time I needed it. Without their help, this work could not have been achieved. I also wish to thank Miguel Andrade, Inge Jonassen and Burkhard Rost for stimulating discussions and frendship. Chris Sander and Liisa Holm have always been available to share with me their extensive experience of the field. I wish to express them my gratitude. Michelle Magraine has proved an unvaluable allie for fighting against my english grammar weaknesses. I wish to thank her for that. I also wish to thank Rob Harper for joining me in my everyday fight against post script files. My stay at the EBI has been extremly enjoyable. This work is dedicated to my friends and family for their constant support over all these years.

RESUME DE THESE
Une grande partie de la recherche fondamentale en biologie moléculaire repose sur l’étude des protéines et des acides nucléiques. Ces molécules extrêmement complexes résultent de la combinaison d’éléments plus simples: les acides aminés et les nucléotides. Vingt acides aminés constituent la grande majorité des protéines, cinq nucléotides constituent la plupart des acides nucléiques. Le termes séquence est utilisé pour désigner l’enchaînement de nucléotides constituant un acide nucléique ou l’enchaînement d’acides aminés constituant une protéine. La plupart des protéines sont codées par les gènes contenus dans l’ADN des chromosomes. Au cours de ces dernières années, de nombreux progrès techniques ont rendu possible le séquençage à grande échelle du génome de plusieurs espèces bactériennes ou eucaryotes. Ces séquences d’ADN sont entreposées dans des banques de données spécialisées (Swiss Prot, Gene Bank, EMBL nucléotides database...) dont la croissance (en taille) est aujourd’hui exponentielle. La bio−informatique est une sous−discipline de la biologie ayant pour objet l’analyse de ces données par des moyens informatiques. Le principe de base d’une telle approche est la notion de relation entre fonction et séquence. Le but est d’extrapoler des données obtenues de façon expérimentales sur certaines séquences à d’autres séquences pour lesquelles aucune donnée expérimentale n’est disponible. Alors qu’il est clair que deux protéines (ou acide nucléiques) ayant la même séquence ont probablement la même fonction, une corrélation devient plus difficile à établir lorsque les séquences ne présentent qu’une homologie partielle. La nécessitée d’utiliser ce type d’information est la principale motivation derrière le développement des méthodes de comparaison aujourd’hui utilisées. Le travail présenté dans cette thèse est essentiellement consacré à cet aspect de la bio−informatique. L’un des moyens les plus utilisés pour la comparaison de séquences est l’alignement. Un alignement permet d’identifier les zones conservées entre deux séquences. Ces zones peuvent correspondre à des motifs structuraux ou fonctionnels dont l’identification permet de faire des prédictions quand à la fonction putative des séquences analysées. De façon plus générale, un alignement permet l’identification de régions sur lesquelles existent des contraintes diverses imposant le maintient de certaines propriétés. D’autre part, un alignement de qualité permet l’évaluation de la distance évolutive séparant deux organismes, ou deux protéines. Cependant, dans les cas complexes, la quantité d’information contenue dans deux séquences n’est pas suffisantes, et il devient nécessaire d’étendre la comparaison à plusieurs séquences. C’est là l’objet des alignements de séquences multiples. Leur problématique est double. Il s’agit tout d’abord d’un problème biologique. Etant donné un groupe de séquences, les propriétés de l’alignement optimal doivent être définies. La règle la plus simple est de tenter de générer autant d’identités que possible dans les colonnes tout en limitant le nombre d’insertions/délétions (gaps). En pratique néanmoins, les règles utilisées sont plus complexes et peuvent prendre en compte la nature des acides aminés alignés (protéines) ou la structure secondaire des séquences (ARN). L’usage est de donner à cette liste de règles une forme mathématique associant un score à chaque alignement. On parle alors de fonction objective. Un nombre important de fonctions de ce types ont été décrites au cours de ces dernières années. Globalement, elles peuvent être divisées en deux groupes: les fonctions basées sur des matrices de substitutions et des penalitées d’insertion/délétion et les fonctions globales telles que les HMM (Hidden Markov Models). Une des propriétés les plus importantes d’une fonction objective est sa signification biologique. De façon idéale, une fonction doit assigner à un alignement optimal un score traduisant l’intérêt biologique de l’information qu’il contient. Le second aspect est purement informatique. Il ne suffit pas d’avoir une fonction objective, il faut aussi être capable d’optimiser le score de cette fonction (i.e. produire l’alignement ayant le meilleur score). Ce problème est loin d’être trivial. L’optimisation de la plupart des fonctions objectives appartient à la classe des problèmes dits NP complets. En conséquence, l’optimisation ne peut être réalisée qu’en utilisant des méthodes dites heuristiques qui ne garantissent pas une solution optimale. Le travail présenté dans cette thèse englobe l’ensemble de ces problématiques. Dans la première partie, une méthode d’optimisation globale par algorithme génétique est proposée. Cette

méthode est intégrée dans un logiciel nommé SAGA (Sequence Alignment by Genetic Algorithm). Les algorithmes génétiques sont des stratégies d’optimisation basées sur une analogie avec le phénomène de sélection naturelle. Cette méthode peut en théorie être appliquée à n’importe quel type de fonction objective. Le second aspect du travail a consisté à définir une nouvelle fonction objective (COFFEE: Consistency based Objective Function for alignEment Evaluation) et à optimiser cette fonction en utilisant SAGA de façon à prouver que COFFEE peut induire la création de meilleurs alignements que des méthodes alternatives. La troisième application a été axée sur l’alignement d’ARNs ribosomiques avec définition d’une fonction objective adaptée à la prise en compte des interactions secondaires. Ce programme, adapté de SAGA a été nommé RAGA (RNA Alignment by Genetic Algorithm). L’une des principales limitations de RAGA réside dans la simplicité de la fonction objective utilisée. Afin de remédier à ce problème, un travail d’analyse a été réalisé sur des alignements de référence afin de déterminer les paramètres pouvant aider à la définition d’une fonction objective plus réaliste dans la prise en compte des contraintes à modéliser dans l’alignement. Ce travail constitue la quatrième application présentée dans cette thèse. Dans l’ensemble, ce travail a permis d’établir l’utilité des algorithmes génétiques dans le contexte des problèmes d’alignement de séquences multiples. SAGA est à l’heure actuelle l’algorithme le plus performant pour l’optimisation des fonctions objectives couramment utilisées pour les alignements de séquences multiples. Dans le cas de séquences protéiques, SAGA est le seul algorithme capable de réaliser l’alignement global de plus de dix séquences. Pour ce qui est de l’ARN ribosomique, RAGA est le seul programme capable d’aligner des séquences ayant une longueur supérieure à deux mile paires de bases, tout en prenant en compte les pseudo−noeux. D’autre part, la fonction COFFEE est l’une des rares fonctions capable de permettre la génération d’alignements biologiquement plus exacts que ceux obtenus par ClustalW (ClustalW est une des méthodes d’alignement les plus populaires). RESUME DES ANNEXES Document Numéro 1 SAGA: Sequence Alignement by Genetic Algorithm Dans cet article, une nouvelle approche est proposée pour la résolution du problème des alignements de séquences multiples. Un algorithme génétique a été conçu et intégré dans un logiciel nommé SAGA. La méthode implique l’évolution d’une population d’alignements. Dans ce contexte, évolution signifie que la qualité des alignements est graduellement améliorée au gré d’une succession de cycles (générations) contenant des étapes de modifications aléatoires (opérateurs) ainsi que des étapes de sélection basée sur le score. Le degré d’amélioration est jugé par l’évaluation du score de chaque alignement à l’aide de la fonction objective. SAGA utilise une technique de contrôle automatique pour réguler l’utilisation simultanée de vingt opérateurs destinés à recombiner entre eux des alignements (crossing overs), ou bien à les modifier individuellement (mutation). Afin de tester SAGA, nous avons utilisé comme référence le programme M.S.A. (Multiple Sequence Alignment) capable d’optimiser une des fonction objectives le plus couramment utilisée (somme des paires avec pénalités affines de délétion/insertions). Utilisé dans ce contexte, SAGA fournit de meilleurs résultats que M.S.A. en terme d’optimisation (score de l’alignement obtenu). De plus les alignements produits par SAGA sont biologiquement plus exacts s’il on en juge par leur similarité avec l’alignement des mêmes séquences réalisés par comparaison de structures. Au total, SAGA a été teste sur treize groupes de séquences pour lesquelles un alignement de référence basé sur les structures est disponible dans la banque de données 3D_ali. Document Numéro 2 COFFEE: A New Objective Function for Multiple Sequence Alignmnents Dans ce travail, nous présentons un nouveau mode d’évaluation des alignements de séquences multiples. Cette fonction est nommée COFFEE. COFFEE est une mesure du degré de consistance existant entre un alignement de séquences multiples et un bibliothèque de référence contenant les mêmes séquences alignées deux par deux. Il est montré que le score COFFEE peut être efficacement optimisé par SAGA. La fonction a été utilisé sur onze groupes de séquences pour lesquels un alignement de référence est disponible dans la banque de données 3D_ali. Dans neuf cas sur onze, SAGA, utilisé avec COFFEE, produit des alignements meilleurs que ceux obtenus avec

ClustalW (s’il on en juge par leur similarité avec les alignements de références). Nous avons aussi montré que le score assigné par COFFEE peut être utilisé pour évaluer la qualité d’un alignement multiple, de façon locale ou globale. Finalement, la bibliothèque de référence peut être constituée d’alignement en paire obtenus par comparaison de structure (par exemple, des alignements extraits de FSSP). Dans ce cas là, SAGA−COFFEE est capable de produire des alignements structuraux multiple de très haute qualité. En théorie, COFFEE devrait permettre d’appliquer aux alignements multiples n’importe quelle méthode adaptée à l’alignement de paires de séquences. Document Numéro 3 RAGA: RNA Sequence Alignment by Genetic Algorithm. Cet article décrit une nouvelle approche pour aligner deux séquences d’ARN homologues lorsque la structure secondaire de l’une des deux molécules est connue. A cette fin, deux programmes ont été développés, RAGA (RNA séquence alignment by Genetic algorithm) et PRAGA ( Parallel RAGA). Ces deux programmes sont essentiellement basés sur le programme SAGA. La parallélisation est réalisée par la synchronisation d’un nombre défini de copies actives de RAGA. Celles−ci échangent une partie de leur population suivant une topologie définie comme étant un arbre à branches multiple et profondeur variable. Cette méthode permet d’optimiser une fonction objective prenant en compte les informations primaires et secondaires contenues dans les deux séquences. Une des propriétés les plus intéressantes de RAGA réside dans le fait qu’il est possible de prendre en compte aussi bien les tiges boucles classiques que les pseudo−noeux présents dans l’ARN ribosomique. RAGA à été testé à l’aide de neuf alignements de référence extraits à partir d’alignements d’experts. Ces alignements, constitués d’ARNs de la petite sous unité ribosomique, ont servi de référence. Dans chacun des cas, PRAGA est capable de surpasser en exactitude les méthodes alternatives basées sur la programmation dynamique. Ceci est vrai même lorsque la distance phylogénétique séparant les deux séquences à aligner est très importante (comme entre l’humain et saccharomyces cerevisiae). Document Numéro 4 Optimisation of RNA Profile Alignments Ce projet fait pendant au projet numéro 3 et s’intègre dans un contexte plus large visant à la création des outils nécessaires à la maintenance automatique des banques de données d’ARN ribosomiques. De part le monde, plusieurs banques de données de ce type existent. Elles sont essentiellement maintenues de façon manuelle. A long terme, la création de méthodes automatiques va devenir une nécessité absolue. Dans le document numéro 3, nous avons proposé une procédure destinée à l’alignement de deux séquences. N’utiliser que deux séquences revient à ignorer la vaste quantité d’information contenue dans les alignements multiples établis par des groupes d’experts. Le but de ce projet a été de définir certaines des modalités d’utilisation de cette information. Différentes méthodes de pondérations ont été testées et implémentée dans un contexte de programmation dynamique. L’évaluation des méthodes a été réalisée en testant la qualité d’alignement obtenue lorsqu’une séquence est extraite puis réintroduite dans un alignement multiple. Cette stratégie ne prend en compte que les contraintes primaires. Les résultats montrent que par l’utilisation d’un mode de pondération adéquat et d’un système de pénalités d’insertion/délétion adapté, il est possible d’améliorer considérablement la qualité de l’alignement entre un profil et une séquence.

1−INTRODUCTION ....................................................................................................................... 11 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20

4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 CONCLUSION ....................................................................................................................... 50

TABLE OF CONTENTS

1−INTRODUCTION ....................................................................................................................... 11 1.1 BIOINFORMATICS AND BIOLOGY ............................................................................................................ 12 ............................................................................................................ 14 1.2. COMPARING SEQUENCES ............................................................................................................ 15 1.3 OUR APPROACH ............................................................................................................ 16 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 2.1 WHAT IS A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? ............................................................................................................ 18 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT ............................................................................................................ 19 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20 3.1 PROTEIN PAIRWISE ALIGNMENTS ............................................................................................................ 20

3.1.1 Substitution Matrices ................................................................................................ 20 3.1.2 Gap penalties ................................................................................................ 22 ................................................................................................ 3.1.3 Database Searches ................................................................................................ 24 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT ............................................................................................................ 25 ................................................................................................ 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments ................................................................................................ 25 3.2.2 Weights ................................................................................................ 27 3.2.3 Profiles ................................................................................................ 29 3.2.4 hidden Markov models ................................................................................................ 30 3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS ............................................................................................................ 32 4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 4.1 COMPLEXITY OF THE PROBLEM ............................................................................................................ 35 4.2 DETERMINISTIC GREEDY APPROACHES ............................................................................................................ 36 4.2.1 Aligning Two Sequences ................................................................................................ 36 4.2.2 Aligning Two Alignments : Progressive Alignment Methods ................................................................................................ 37 4.3 DETERMINISTIC APPROACHES FOR NON PROGRESSIVE MULTIPLE ALIGNMENTS ............................................................................................................ 39

4.3.1 The Carrillo and Lipman Algorithm ................................................................................................ 39 4.3.2 Other Approximation Techniques ................................................................................................ 39 4.4 STOCHASTIC HEURISTICS ............................................................................................................ 40 4.4.1 What is a stochastic Method ? ................................................................................................ 40 4.4.2 Iterative alignments and Expectation−Maximization Strategies ................................................................................................ 41 4.4.3 Simulated Annealing ................................................................................................ 42 4.5 GENETIC ALGORITHMS ............................................................................................................ 43 4.5.1 What is a Genetic Algorithm? ................................................................................................ 43 4.5.2 Applications of GAs in Sequence Analysis ................................................................................................ 45 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 6.1 SAGA: MAKING MULTIPLE SEQUENCE ALIGNMENT BY GENETIC ALGORITHM ............................................................................................................ 48 6.2 COFFEE: IMPROVING ON EXISTING OBJECTIVE FUNCTIONS ............................................................................................................ 48 6.3 RAGA: THREADING RNA SECONDARY STRUCTURES ............................................................................................................ 49 6.4 OPTIMIZING RIBOSOMAL RNA PROFILE ALIGNMENTS ............................................................................................................ 50 CONCLUSION ....................................................................................................................... 50

1−INTRODUCTION ....................................................................................................................... 11 1.1 BIOINFORMATICS AND BIOLOGY ............................................................................................................ 12 ............................................................................................................ 14 1.2. COMPARING SEQUENCES ............................................................................................................ 15 1.3 OUR APPROACH ............................................................................................................ 16 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 2.1 WHAT IS A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? ............................................................................................................ 18 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT ............................................................................................................ 19 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20 3.1 PROTEIN PAIRWISE ALIGNMENTS ............................................................................................................ 20 3.1.1 Substitution Matrices ................................................................................................ 20 3.1.2 Gap penalties ................................................................................................ 22 ................................................................................................ 3.1.3 Database Searches ................................................................................................ 24 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT ............................................................................................................ 25 ................................................................................................ 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments

................................................................................................ 25 3.2.2 Weights ................................................................................................ 27 3.2.3 Profiles ................................................................................................ 29 3.2.4 hidden Markov models ................................................................................................ 30 3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS ............................................................................................................ 32 4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 4.1 COMPLEXITY OF THE PROBLEM ............................................................................................................ 35 4.2 DETERMINISTIC GREEDY APPROACHES ............................................................................................................ 36 4.2.1 Aligning Two Sequences ................................................................................................ 36 4.2.2 Aligning Two Alignments : Progressive Alignment Methods ................................................................................................ 37 4.3 DETERMINISTIC APPROACHES FOR NON PROGRESSIVE MULTIPLE ALIGNMENTS ............................................................................................................ 39 4.3.1 The Carrillo and Lipman Algorithm ................................................................................................ 39 4.3.2 Other Approximation Techniques ................................................................................................ 39 4.4 STOCHASTIC HEURISTICS ............................................................................................................ 40 4.4.1 What is a stochastic Method ? ................................................................................................ 40 4.4.2 Iterative alignments and Expectation−Maximization Strategies ................................................................................................ 41 4.4.3 Simulated Annealing ................................................................................................ 42

4.5 GENETIC ALGORITHMS ............................................................................................................ 43 4.5.1 What is a Genetic Algorithm? ................................................................................................ 43 4.5.2 Applications of GAs in Sequence Analysis ................................................................................................ 45 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 6.1 SAGA: MAKING MULTIPLE SEQUENCE ALIGNMENT BY GENETIC ALGORITHM ............................................................................................................ 48 6.2 COFFEE: IMPROVING ON EXISTING OBJECTIVE FUNCTIONS ............................................................................................................ 48 6.3 RAGA: THREADING RNA SECONDARY STRUCTURES ............................................................................................................ 49 6.4 OPTIMIZING RIBOSOMAL RNA PROFILE ALIGNMENTS ............................................................................................................ 50 CONCLUSION ....................................................................................................................... 50 APPENDIX: RESEARCH PAPERS Annex 1 Annex 2 "SAGA: Sequence Alignmnent by Genetic Algorithm" Notredame and Higgins, 1996 "COFFEE: A new Objective Function for Multiple Sequence Alignments" Notredame, Holm and Higgins, 1998 "RAGA: RNA Alignmnent by Genetic Algorithm" Notredame, O’Brien and Higgins, 1997 "Optimisation of ribosomal RNA profile Alignments" O’Brien, Notredame and Higgins, 1998

Annex 3 Annex 4

1−INTRODUCTION

1.1 BIOINFORMATICS AND BIOLOGY Life as we know it is a complex arrangement of biological structures designed to interact with each other. As complex as they may appear, even the most elaborated living structures can be described as arrangements of smaller less complex building blocks (such as cells), which are themselves the result of the combination of even smaller basic blocks (such as metabolites, proteins, nucleic acids). Identifying these structures and characterizing their functions is a major aim of biology. Questions can be addressed at any level of organization one may wish to study (from populations to atoms in fact). In such a top to bottom approach, molecular biology is almost at the bottom. It deals with the biological structures at the molecular level, trying to understand how these are created and interact with one another to perform the basic functions of life. The search for ordered systems is an important part of the biological methodology. Ordered systems usually make it possible to establish general rules allowing a global understanding of otherwise disparate collections of facts. In this respect, the discovery of DNA structure and the understanding of RNA and protein synthesis have been two of the most important milestones of modern molecular biology. They have allowed a deep and precise understanding of some of the most central cellular mechanisms. We now know that proteins and RNA molecules are involved at virtually all the steps of biological processes. We also know that these key components of cellular life are coded in DNA sequences, in an almost universal manner. These DNA sequences are contained in the genomes of living organisms. At the lowest level, a genome can be described as a long string of nucleotides. It could be compared to a very long text made of four letters. As in a text, the letters are not distributed at random but organized in words. In the context of a genome, a word will be any sequence having a function. One of the difficulties when trying to identify these ’words’ stems from the fact that nature uses spaces and punctuation in a very personal way. This mean that not only do we ignore beforehand the function of the ’words’, we also do not know where they start and finish. Add to this the fact that there are many different classes of ’words’. Some allow the binding of other molecules on the DNA, others are translated into RNAs, that can in turn be translated into proteins. A protein or an RNA molecule may contain motifs that will have functions (binding, catalytic site....). The genome also contains some very specific combinations of words such as genes (enhancer, promoter, introns, exons...). Bioinformatics and molecular biology are two complementary techniques with similar aims: identification of biological structures and sub−structures at a molecular level and characterization of their function. ’Function’ is a very general concept. If we look at it from an experimental point of view such as genetics, a function can be defined with respect to a gene and will often be deduced from what happens when this gene is inactivated/modified by a mutation. Often, this is not enough to gain a real deep understanding of the mechanisms involved. To do so, one will have to know whether this function is performed by a protein, an RNA, or a regulatory sequence. If it is a protein then the next question is ’how does the protein perform its function?’. If it is an enzyme we will want to know where is the catalytic site, does it look like any other known site, what are the residues involved in the site and how they perform their function? The protein (enzyme or not) may also interact with other proteins, nucleic acids or metabolites.

Here again we will want to know what are the portions of the protein involved in these operations and what are the potential partners. In most of the cases, this information will be much easier to understand/predict when a 3D model is available for the protein. In a broad sense, the function of a protein (or a nucleic acid or a DNA/RNA regulatory element) is defined by the sum of all these elements. Until recently, the only way available for gathering together these pieces of information was to use wet lab techniques. These involve genetic analysis, cloning, sequencing, interaction experiments..... Although the results obtained that way can usually be regarded as strong biological evidence, they suffer from a major drawback. The cost of wet lab techniques is extremely high, in terms of time and money. This means that there is a limit on the number of functions that can be thoroughly investigated through such techniques. The problem has become especially severe now that the improvement of sequencing techniques gives us access to far more sequences than it will ever be possible to analyse in the wet lab. It is this situation that has promoted the massive development of bioinformatics techniques over the last few years. Bioinformatics could be regarded as an approach diametrically opposed to the traditional experimental ones. Instead of starting from a phenotype, one will start with a sequence and try to gather as much information as possible by comparing this sequence to others for which experimental evidence is available. But the difference between the two approaches is much less acute than it seems. Bioinformatics relies on the same basic assumptions as classic biology. It is a method of inquiry based on a series of comparisons that lead to classifications/predictions. The main paradigm of bioinformatics is that sequence conservation is correlated to function conservation. Under this framework, the aim is to extrapolate, as much as possible the information acquired experimentally. The process follows a traditional feed back scheme where models are built and validated or invalidated by experiments made in wet lab or in silico. Darwinian laws of evolution, and the notion of parsimony often underlie the bioinformatics approach. The assumption is that biological systems have evolved from the same origin, constantly reusing some basic building blocks (such as metabolic pathways) and adapting them to respond to their environmental constraints. If each time a new constraint appeared, a new biological system was created from scratch, the bioinformatics approach would probably be bound to fail. Fortunately, in most of the cases, this is not what happens. Through the cycles of mutation/selection that constitute evolution, new functions have been created by reusing pieces of already existing machinery, and existing functions have evolved to become more adapted to the environment in which they are needed. If we consider this problem in terms of sequences, this means that two sequences responsible for similar functions may be different, depending on how long they have been diverging (i.e. how long ago the original sequence was duplicated, or how long ago the two organisms containing these sequences started diverging). Nevertheless, if the distance separating them is small enough, an evolutionary scenario can be reconstructed that will show how related these sequences are. Depending on what is known for one of these sequences (or for other sequences of the same category), it will then become possible to make assumptions about the function. On the other hand, if the sequences are evolutionarily too far apart, accurately analysing their relationship may prove difficult by simply comparing the sequences. The signal they contain may have to be enhanced using other techniques such as structure prediction.

Sequences are only conceptual objects. As such, they have no function in a cell. In fact, even the distinction between RNA, proteins and DNA is artificial. As far as the cell is concerned, all these elements only exist as complex 3D arrangement of atoms. It is because of its precise 3D structure that a molecule has the mechanical and chemical properties it needs to perform its function (catalytic activity, interactions...). The relation between structure and function is probably one of the oldest paradigms of molecular biology. We also accept that broadly speaking, structures are induced by sequence although we know for a fact that very different amino acid sequences can code for similar 3D folds. This last point helps understand why proteins with different sequences can have similar function and structure, since natural selection is applied on the active 3D structures rather than on the sequences (i.e. evolution gives more freedom to the sequence than to the structure). As a consequence, relationships between proteins (or RNAs) are usually easier to analyse when the structures are known. Unfortunately, structures are difficult to determine experimentally and prediction from sequence alone (ab initio folding, threading) is still one of the main challenges of computational biology. It is true however that useful tools exist that can help supplement weak sequence identity. Developing new techniques for automatically analysing sequences is one of the main purposes of research in bioinformatics. It is a point of crucial importance. Today, all the major databases of nucleotide or protein sequences are growing in size at an exponential rate (doubling every year or so). It means that the proportion of sequences for which experimental data are available is decreasing. For this reason, targeting the points at which experiments are needed has become more important than ever. Such a goal will only be achieved by gaining some more understanding on the ways in which information can be extrapolated from one sequence (or a set of sequences) to another. This is the only way available for making any use of the DNA sequencing results (at least in a reasonable amount of time). It is for this reason that sequence comparison tools are at the heart of the bioinformatics approach.

Figure 1. Growth of the EMBL Nucleotide database over the 1985−1997 period. The last release of the EMBL nucleotide sequence database ( Rel. 52, October 1997) contained 1,181,167,498 nucleotides. The last release of the Swiss−Prot Database ( Rel. 32 October 1996) contained 21 210 389 amino acids in 59021 entries. For comparison, the last PDB release (proteins with known structures) of December 1997 contained a total of 6731 entries.

1.2. COMPARING SEQUENCES In most cases the problem facing the user takes the following form: a new sequence is available and it is desirable to search the database and find out whether one or more close relatives of this sequence have already been reported. If so, one may wish to extrapolate some of the experimental data gathered that way to the new sequence. In such a case, the solution is to compare the sequences of interest to all the sequences contained in the database, keeping track of the most similar. Two very popular tools are used to perform basic database similarity searches: FASTA(1) and BLAST(2). Sequence comparison can also be much more complex. For instance, by combining experimental data contained in the databases and sequence analysis, one may want to know if a specific motif is sufficient for a protein to bind zinc. If several types of alternative motifs emerge when doing such analysis, one may want to build a classification. Further experimental data may also be available that allows the establishment of functional differences between the zinc−binding motifs (some may be associated with RNA−binding proteins and others with DNA−binding proteins). This for instance, is one of the ideas developed in the Prosite database(3). Once such results have been established, new sequences can be scanned for the known motifs they contain. As simple and trivial as they may seem, such strategies present various very complex difficulties that need to be overcome. First of all, one has to define what ’comparing’ means. This is typically a biological problem. The features one is interested in when looking at two sequences or more will obviously depend on the aim and the scope of the comparison. Do we want to compare two proteins for having the same sequence, the same function, for having related functions, for being expressed in the same circumstances, for having similar folds? Are these questions equivalent? When it comes to making detailed analysis, sequence alignments appear as one of the most powerful solutions. In the simplest case, they only involve two sequences (pairwise alignments) but they can be extended to a larger number (multiple sequence alignments). Having decided that we want to use sequence alignments, the problem of defining what is a ’good’ sequence alignment remains. It is a difficult question that requires a deep understanding of the biological information one wishes to extract from such alignments. In most of the cases, the criterion that will allow evaluation of the quality of an alignment will take the form of a mathematical function(objective function) associating a value to an alignment. But even so, the problem is not solved. Having a criterion for alignment quality is not enough. One also needs to be able to build the best scoring alignment according to this quality criterion. In most of the cases, this is far from easy. Many of the problems in bioinformatics and more specifically in sequence alignment are said to be NP complete. This means that the number of potential solutions rises exponentially with the number of sequences and their length (i.e. the solution cannot be found in polynomial time and space). The need to overcome such severe limitations requires the development of powerful algorithms.

1.3 OUR APPROACH In the work presented here, the problem of sequence alignment was approached through the two aspects mentioned above: −defining new objective functions for sequence alignment. −developing new ways to optimize these functions. One of the main concerns of the approach was the fact that there is no use in defining new sequence comparison schemes if no tool is available to use them and allow judgment to be made on their potential relevance. To test an objective function, one must be able to optimize it and compare the quality of the alignments it provides with other methods. The new optimization scheme proposed here is a genetic algorithm named SAGA (Annex 1) for Sequence Alignment by Genetic Algorithm. This algorithm was used to evaluate the biological relevance of COFFEE (Consistency Based Objective Function for alignmEnt Evaluation) ,a new objective function designed for protein multiple sequence alignment(Annex 2). SAGA was also adapted to RNA alignments (RAGA, RNA Alignment by Genetic Algorithm) using an objective function that takes into account secondary structure interactions in RNA (Annex 3). In order to improve RAGA’s objective function, a new function was designed for aligning an RNA sequence to a large multiple RNA sequence alignment (Annex 4). This function was only tested using a traditional optimization method (Dynamic Programming). The following sections will deal with the three main concepts associated with sequence alignment: what it is useful for, how to define a sequence alignment and finally how to build a sequence alignment. The last section will put the four contributions in their relative context.

2−THE SCOPE OF SEQUENCE ALIGNMENTS
2.1 WHAT IS A SEQUENCE ALIGNMENT ? A sequence alignment is the representation of two sequences in a way that reflects their relationship. If the alignment is correct, two residues aligned with one another are homologous. The definition of homology depends on the criterion used for the alignment. For instance if the aim is to identify the relationship between two structures, two residues will be aligned because they are equivalent in the 3D structures. If the alignment is designed to reflect phylogenetic relationships, two residues will be aligned when they originate from the same residue in the common ancestor. The definition of a pairwise alignment can be extended to multiple sequence alignments. In this case, several sequences are aligned together and each column contains homologous residues. However, homologous residues do not necessarily exist in each sequence for each position of the alignment. If a given sequence lacks one residue, a gap will be inserted in its place at the corresponding position. Gaps usually take the form of strings of nulls. In an evolutionary context, a null sign means that a residue was inserted in one of the sequences or deleted in the other while the sequences were diverging from their common ancestor. There are two types of alignments: global and local. In a local alignment, the only portions that are aligned are those which are clearly homologous. The rest of the sequence is ignored. In a global alignment, whole sequences are aligned, regardless of the local level of similarity. The scope of global and local alignment is usually different. Local alignments are more appropriate when the sequences analysed are remotely related and may share only a few domains. Global alignments are mostly designed to analyse sequences that are known to be homologous to one another. In this thesis, I will mostly consider global sequence alignments. It is also important to realize that, given a set of sequences, there are lots of alternative alignments. For instance, given two sequences of 1000 residues each. there are about 10764 different possible alignments. This rules out any naive enumeration strategy for identifying the correct one! Instead, we will see that several strategies have been developed that allow more or less efficient computation in polynomial time. 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? Quantitatively, the most widely used application of sequence alignment is in database searching. When doing so, the aim is to find for a given query sequence, all the related sequences contained in a database. The principle is very straightforward. The query sequence is aligned in turn to all the members of the database and results are ranked according to some similarity criterion. FASTA(1) and BLAST (2)are the most popular tools of this type. They rely on local alignments rather than global. However important the results obtained in this way, there is a clear limit on the quality of the alignments that can be deduced from these searches. Firstly, the sequence alignment algorithms implemented in these programs are only crude approximations of the standard sequence alignment algorithm. This is necessary in order to search very large databases in a reasonable amount of time. Secondly, in

most of the cases, the searches are based on pairwise alignments. This means that they only contain a limited amount of information. Although database searching is at the heart of many approaches, pushing the analysis further may require important refinements. Such refinements can involve structure prediction, identification of new motifs or domains, generalization of the family properties (i.e. combining the information contained in the known sequences in order to identify distant members), phylogenetic analysis. For all these applications, pairwise alignments are of limited use. A way to simultaneously combine the information contained in several sequences is needed. Such a need is the main motivation for building multiple sequence alignments. Multiple alignments are very important for phylogenetic analyses because they provide a way to compute evolutionary distances and phylogenetic trees. Trees are computed from sets of pairwise distances using some clustering algorithms such as the neighbor joining method(4). When computing a tree, it is very important to have accurate pairwise distances hence the use of a multiple alignment in which the pairwise alignment of two sequences depends on the information contained in all the sequences of the set. Another fundamental application of multiple sequence alignments is the identification of motifs or domains. In a multiple alignment, these elements often appear as portions on which constraints exist that limit divergence. If some of the sequences are experimentally characterized, these motifs can be used for function prediction. This is, for instance, one of the aims of the Prosite(3) or the ProDom databases(5). The information contained in a multiple sequence alignment can also be generalized in order to produce a profile(6)or a hidden Markov model (7)that can be used for identifying new family members. The other important use of multiple alignments is structure prediction. In a given protein, residues do not evolve in the same fashion, depending on their role in the structure (buried/exposed, helix/beta strand/loop...). It is very hard to extract such information from a sequence alone while it can be accessed through analysis of multiple alignments by looking at the distribution of the substitutions. Using multiple sequence alignments instead of sequences alone has had a dramatic effect on this area of sequence analysis, boosting the accuracy of protein secondary structure predictions from 55%(8) (9)to 75%(10) accuracy. One can also go further and try to identify correlated mutations in multiple sequence alignments. This has been done on many occasions in proteins with limited success(11−13). On the contrary, in RNA analysis, the identification of correlated mutations has been of great help, allowing accurate prediction for secondary structure analysis and even tertiary structures(14− 16). Finally, a less challenging but very important application of multiple sequence alignments is the localization of highly conserved area for the design of efficient PCR primers, in order to clone new members of a family. All these examples reflect the importance of multiple sequence alignments in the domain of sequence analysis. We will show here that making good multiple sequence alignments is a multi−step task. 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? A scoring function associates a score to an alignment. Ideally, the better the score, the more biologically accurate the alignment. An alignment with the best possible score is said to be optimal, whether it is biologically relevant or not. An optimal alignment always exists. Being able to distinguish between biologically

relevant and non relevant alignments is an important issue, especially when analyzing databases. Powerful statistical tools have been developed for this purpose, allowing the discrimination of hits for their potential biological relevance(2). This problem remains when aligning sequences known to be related. For instance, two homologous domains may surround a loop that is different in the two structures. In this case, any alignment of the residues contained by these loops will be meaningless even if a mathematical optimum exists. Another important problem, common to many areas of computational biology has to do with the choice of parameters. Most of the objective functions come with complex sets of parameters. In many cases, one has to rely on empirical values, known to lack robustness (i.e. small changes of the parameter values may induce very different alignments) which may lead to inaccurate alignments. This explains the fact that a large amount of the work dedicated to objective function definition focused on parameter elimination. Ironically, so much work has been done in this field that the choice of a scoring scheme can almost be regarded as one more parameter requiring optimization. Among the countless number of existing methods, we will only describe some of the most important schemes, focusing on those related to the work carried out for this thesis. There are two types of alignments: sequence and structure alignments. Sequence alignments do not require any non local interactions to be taken into account and are therefore less complex (algorithmically speaking) than structural alignments. For the problem of sequence alignments we will mostly talk about protein sequences while the problem of structure alignment will be addressed through the example of RNA secondary structures. The problem of protein structure alignment will not be analysed in depth. Its complexity and the amount of literature available on this subject put it beyond the scope of this dissertation which is mostly oriented toward sequence analysis rather than structure. 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT In many cases, building a sequence alignment takes the form of a compromise between biological relevance, mathematical optimality and efficiency. Given an objective function, it may be very hard to produce the mathematically optimal alignment. We already mentioned earlier that done in a naive way through enumeration, sequence alignment is beyond the scope of any computer. For this reason, trade−offs need to be made on both side. Objective functions need to be defined in such a way that they fit already existing optimization techniques, and optimization techniques need to be improved in order to accommodate the complexity of the problem. For instance, in its more general form, the problem of aligning two sequence is NP complete(17). However, if formulated under certain constraints, it can be solved with a technique known as dynamic programming(18) but becomes NP complete again with the number of sequences (i.e. when trying to align more than two sequences simultaneously)(19). Structure alignments (i.e. sequence alignments taking into account non local interactions) are also NP complete, even for two sequences, and can only be addressed in a simplified form(i.e. using an objective function that does not reflect all the known constraints)(20). Because of this NP completeness most of the algorithms developed in the context of sequence alignments are heuristics(21−24). It means that they do not guarantee a mathematically optimal solution, but rather a good approximation. In many cases,

this trade off is reasonable and allows the computation of multiple sequence alignments in an efficient manner. In this thesis, I will describe some of the optimization methods currently used, with a special emphasis on the genetic algorithms.

3 EVALUATING ALIGNMENTS
3.1 PROTEIN PAIRWISE ALIGNMENTS 3.1.1 Substitution Matrices The twenty amino acids commonly found in proteins have very specific physico− chemical properties such as size, charge and hydrophobicity. The role of a residue in a protein mostly depends on these properties. For this reason, substitutions do not occur at random but in a way that reflects physico−chemical constraints in the 3D structure. It is therefore a very intuitive idea to try to associate with each possible substitution a cost depending on its probability. This information can be stored in what is known as a substitution matrix, a 20 by 20 table giving the relative cost or the probability of each possible amino acid substitution. Although many types of matrices have been proposed, the more successful are those derived empirically (25, 26). The principle often involves statistical analysis of a large set of alignments. Interestingly, these matrices tend to be in general agreement with what would be expected, knowing the physico−chemical properties of the residues (e.g. substitutions conserving charge, size or hydrophobicity have lower costs). We will review here the most popular of these matrices and their relative strengths/weaknesses. The simplest possible substitution matrices are those only rewarding identities in an alignment. Considering their simplicity, they do remarkably well in a variety of cases, probably owing to the fact that they put a very drastic threshold on background noise, only allowing the identification of very strong signals(27). On the other hand, these matrices disregard a large part of the information and this proves a big disadvantage when doing pairwise alignments between sequences with a low level of identity but a clear homology. The need to show that two sequences with a low level of identity can still be significantly similar has been one of the main motivations behind the development of more complex substitution matrices that take into account similarity as well as identity. The Dayhoff matrices(25), also known as PAM matrices, are among the most widely used. The principle on which they are built is quite straightforward. Alignments of closely related proteins are made (more than 85% identity). When such a high level of identity is shared by the sequences, alignments are usually straightforward and accurate. In these alignments, the frequency of each possible substitution is measured. The table of frequencies obtained in this way is turned into a probability model (log− odds matrix). This model can be used to define weight matrices, appropriate for comparing sequences of any degree of divergence The distance is measured in PAM, Point Accepted Mutations per 100 residues and one can have matrices from 1 PAM up to 500 PAM.

C S T P A G

11.5 0.1 −0.5 −3.1 0.5 −2.0

2.2 1.5 2.5 0.4 0.1 7.6 1.1 0.6 0.3 0.4 −1.1 −1.6

2.4 0.5

6.6

N −1.8 0.9 0.5 −0.9 −0.3 0.4 3.8 D −3.2 0.5 0.0 −0.7 −0.3 0.1 2.2 4.7 E −3.0 0.2 −0.1 −0.5 0.0 −0.8 0.9 2.7 3.6 Q −2.4 0.2 0.0 −0.2 −0.2 −1.0 0.7 0.9 1.7 2.7 H −1.3 −0.2 −0.3 −1.1 −0.8 −1.4 1.2 0.4 0.4 1.2 6.0 R −2.2 −0.2 −0.2 −0.9 −0.6 −1.0 0.3 −0.3 0.4 1.5 0.6 4.7 K −2.8 0.1 0.1 −0.6 −0.4 −1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 M −0.9 −1.4 −0.6 −2.4 −0.7 −3.5 −2.2 −3.0 −2.0 −1.0 −1.3 −1.7 −1.4 4.3 I −1.1 −1.8 −0.6 −2.6 −0.8 −4.5 −2.8 −3.8 −2.7 −1.9 −2.2 −2.4 −2.1 2.5 4.0 L −1.5 −2.1 −1.3 −2.3 −1.2 −4.4 −3.0 −4.0 −2.8 −1.6 −1.9 −2.2 −2.1 2.8 2.8 4.0 V −0.0 −1.0 0.0 −1.8 0.1 −3.3 −2.2 −2.9 −1.9 −1.5 −2.0 −2.0 −1.7 1.6 3.1 1.8 3.4 F −0.8 −2.8 −2.2 −3.8 −2.3 −5.2 −3.1 −4.5 −3.9 −2.6 −0.1 −3.2 −3.3 1.6 1.0 2.0 0.1 7.0 Y −0.5 −1.9 −1.9 −3.1 −2.2 −4.0 −1.4 −2.8 −2.7 −1.7 2.2 −1.8 −2.1 −0.2 −0.7 0.0 −1.1 5.1 7.8 W −1.0 −3.3 −3.5 −5.0 −3.6 −4.0 −3.6 −5.2 −4.3 −2.7 −0.8 −1.6 −3.5 −1.0 −1.8 −0.7 −2.6 −3.6 4.1 14.2 ...C S T P A G N D E Q H R K M I L V F Y W

Figure 2. A log odds matrix computed from mutation data. This is a PAM 250 matrix, extrapolated from the original PAM 15. Each entry indicates the cost for aligning two residues. The worst substitution costs are usually associated with the tryptophan (W).

Originally, the Dayhoff matrices were established using 71 sets of aligned protein sequence pairs with 1572 point mutations (amino acid substitutions). The main limit of this approach is the fact that the content of information in alignments having 85% sequence identity is low. Therefore, it appears risky to extrapolate such limited information to large evolutionary distances such as PAM 250. Furthermore, the extrapolation of the PAM model to any PAM distance is based on the assumption that mutations are independent events. This hypothesis was challenged by several alternative methods. The most popular alternative to PAM based scoring functions is the BLOSUM scheme (26). It is based on a library of blocks, extracted from sequences of related proteins. A block is a local multiple alignment that does not contain any gaps. About 2000 blocks were used for establishing the matrices. Given a set of substitution frequencies, BLOSUM and PAM matrices are computed in a similar fashion. The main difference is that in BLOSUM the frequencies are measured in a way that takes into account sequence identity. This way, using the collection of blocks, several sets of matrices can be generated without having to do any extrapolation. They range from 80 to 45 % average identity. BLOSUM matrices have been shown on various occasions to outperform PAM or other matrices(26−28). The main reproach that can be made to these two types of matrices, is that they attempt to be general while only relying on alignments with a low information content (sequences more than 85% identical), or domains that can be aligned without gaps (blocks). The question of the existence of bias in these matrices has often been discussed. For instance, let us consider alpha helices and beta strands. The type of substitutions observed on these two types of structural elements are known to be slightly different (this is the basis of some efficient secondary structure prediction algorithms(29)). As a consequence, if a matrix is built using a data set that contains more helices than beta sheets, it will be biased and will perform poorly when aligning portions of sequences coding for the beta−sheets. On the other hand, if the data set is equilibrated for the two types of structures, the matrix will simply be an average, and will not be as good as it could be in either of the two cases. In an attempt to compensate for this type of potential problem, several alternative matrices were built for helices(30), beta strands(30) or transmembrane domains(31). The problem is that the way in which such matrices should be used is far from being obvious. Solving this problem often amounts to solving structure prediction, unless a structure is available for some of the homologous sequences in which one is interested.

Overall, about 40 different substitution matrices have been proposed, using no less than 20 different methods with different training sets in most cases. Recently, two generic studies have been made in an attempt to understand the fundamental differences between these schemes(27, 32). The main motivation in the work of Vogt et al. was to assess the behavior of these matrices for the accuracy of the alignments they induce using dynamic programming (See Section 4.2.1). Correctness was judged using the structural alignments contained in 3D_ali(33). Their work shows that there is very little difference between the best matrices (Gonnet(34), BLOSUM(26) and Benner(35)). They also concluded that matrices which are able to identify remote homologues in database searches were the ones leading to the more correct alignments. Finally, from an algorithmic point of view, (see 4.2.1) they concluded that these matrices are more suited for global alignments than local alignments. The second study(32) focused on understanding the way these matrices reflect amino acids properties. The authors found that PAM units are significantly correlated to volume and hydrophobicity while other matrices are much more biased toward identity. Interestingly their results indicate that despite the different methods and different data sets used for their construction, most of the substitution matrices based on sequence analysis are highly correlated with the Dayhoff’s. When trying to group these matrices according to their level of correlation, the global matrices fall into the same cluster (i.e. are very similar), while structure−specific matrices fall into separate clusters. This is further evidence supporting the idea that matrices should be applied in a way that takes into account structural information. We will see later that the Dirichlet mixtures(36) provide an interesting alternative to this problem (see section 3.2.4). The matrix comparison problem was also addressed in a different context, and with a smaller set of matrices, by Henikoff et al. who compared matrices for their ability to discriminate sequences in a database search using FASTA or BLAST (i.e. local alignments). The results obtained that way confirm that matrices based on the alignment of distantly related sequences (such as BLOSUM) or structures (37) perform much better than PAM matrices. Finally, a point on which all these studies agree is the necessity of using appropriate gap penalties when scoring an alignment with a matrix. Most of the results obtained using substitution matrices can be dramatically affected, depending on the set of penalties used to score gaps. In the next section, we review some of the concepts underlying the definition and the scoring of gaps when aligning two sequences. 3.1.2 Gap penalties Substitutions are not the only events affecting sequences while they diverge. Insertions and deletions also occur. This means that two sequences may contain unrelated portions that should not be aligned. Such an event is represented by a gap in the sequence that did not receive the insertion, or where the deletion occurred. Deletion Terminal gap XXXXXXX−−−−−−−−−−−−−−−−−XXXXXXXXXX−−−−−−−− XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX or Insertion It is obvious that a cost should be given to these events when scoring an alignment. Unfortunately the choice of a scoring scheme often implies some evolutionary model,

and we still lack a good understanding of the underlying biology of insertions/deletions. For instance, unless a very reliable phylogenetic scenario is available, it is often very hard to distinguish between insertion and deletion(38, 39), hence the word indel that describes a position where an insertion OR a deletion occurred. Some useful information can be gathered about indels by looking at structural alignments(39). As one would expect, indels do not happen at random, but tend to concentrate on the portions of the structure with less steric constraints such as loops. It is in theory possible to extrapolate this information to sequences with unknown tertiary structure. For instance, Chou and Fassman(40) proposed using secondary structure propensity in order to derive some local gap penalties. Another method proposed by Pascarella and Argos (39) involves measuring, on reference structural alignments, the probability of having a gap after a given residue. This scoring scheme has been implemented in ClustalW(21) where a gap can be given a cost depending on the residue after which it occurs. Similarly, some heuristics have been implemented in ClustalW that attempt to locate areas more prone to gap insertion such as stretches of hydrophilic residues, usually exposed in the loops. However, these methods are both empirical and very general, they do not take into account the specificity of the sequences one is interested in aligning ( for instance due to the structural constraints, some positions may be more conserved than others...). We will later discuss the way information derived from a multiple sequence alignment can be used in order to established more reliable local gap propensity (cf. the Profile section, 3.2.3). The question "where do gaps occur?" is only one part of the problem. When scoring a gap, one also has to ask: "Is this gap long enough?". There is clear evidence that gap should be penalized according to their sizes. Analyses made on mammals(41) suggest that a logarithmic scheme could be quite appropriate, with a gap opening penalty and an extension penalty that is a function of the length of the gap (i.e. a penalty per residue decreasing with the length of the gap). These results confirm those obtained by Benner et al. (38) and suggest that linear gap penalties are less than realistic (penalty cost increasing linearly with the gap length). Results also suggest that insertion and deletion should be distinguished from one another(38, 41). However, even if there is a wide agreement on the fact that non−linear schemes would probably be biologically more relevant(42), alternative solutions suffer from a major drawback: their implementation poses significant algorithmic problems when it comes to optimizing alignments(43, 44). For this reason, in practice, gap penalties are usually optimized under a simplified form known as "affine gap penalty". It can be formalised as follow: cost = Gap Opening Penalty + Gap Extension Penalty * Length (eq. 1) This gives penalties as a linear function of gap length and an efficient algorithm for optimizing this was introduced by Gotoh(45). There is no real justification for using this type of model apart from the fact that it performs reasonably well, especially when the gaps are small (less than 20 residues). Since the size of indels is known to be a function of the evolutionary distance(38), this means that linear gaps will be acceptable when aligning closely related sequences. However, since affine gap penalties are not empirically based, they raise the problem of defining the values of the two parameters: the gap opening penalty (GOP) and the gap extension penalty (GEP). There is no guaranteed way to choose these values so that they fit the sequences one is interested in. A popular practice is to give to the opening penalty a value equal to the average of the values contained in the substitution matrix used for

comparison (excluding the main diagonal). The extension penalty is then set to a tenth of the opening value. It is also common practice not to penalize terminal gaps, at least for opening. When making an alignment there is a competition between gap insertion and substitution. To some extent, gap penalties can be regarded as thresholds used to decide whether a stretch of residues has a homologue or not in the other sequences. In this context, it makes sense to modulate the penalties with some local information (secondary structure propensity, profile information....). But even so, a major problem remains: when the sequences aligned are only remotely related, the gap penalties lack robustness. A study made by Vingron and Waterman showed that slight variations of values for the GOP and the GEP can induce very different optimal alignments(46). Under such conditions, it may be hard to decide which alignment is biologically the most relevant. An attempt to increase the robustness of the penalty parameters has been proposed by Taylor: "Score Run−Length Enhancement"(47). It originates from the observation that in biologically relevant alignments, gaps are usually clustered in a few parts of the sequences and separated by long uninterrupted blocks. The technique proposed by Taylor involves enhancing the score of long ungapped portions in order to avoid them being interrupted by gaps. This work shows that under this scoring scheme, a correct guess for the penalty values becomes much less critical than previously reported. There is little doubt that the correct treatment of gaps and a deep understanding of their biology are critical parameters when making accurate sequence alignments. However, as shown here, the problem is mostly algorithmic. It is possible to define gap penalties that describe sequence relations in a realistic way, it is simply not practical to optimize them. The problem of practicality becomes even more acute when it comes to scanning databases containing hundreds of thousands of sequences. 3.1.3 Database Searches An exhaustive treatment of database search methods is beyond the scope of this thesis. The most commonly used methods will be briefly described here as respect to the fact that they involve specific scoring schemes designed for evaluating the statistical significance of an alignment. The principle of a database search is quite straightforward. A query sequence is aligned in turn with all the other sequences of the database. Depending on their score, alignments are kept as relevant or discarded. BLAST(2) is probably one of the most popular methods for database searches. Given two proteins, the method involves finding the High Scoring Pairs of residues (i.e. short stretches of aligned residues, with no gap, which have high scores). The scores are evaluated using a substitution matrix (PAM120 for instance). These segments are found by looking for words of a specified size(48), and by extending these words. Since the method does not allow gaps, it will in many cases be restricted to more or less small segments. In such a context, the score alone will not be informative enough to decide whether or not a high scoring pair is significant. These scores will need to be normalized in order to become comparable from a statistical point of view. This normalization takes into account the size of the database and the size of the sequences and gives the probability that the match arose by chance. This score is called the E value and is used to rank the hits. The other popular tool for database searches is FASTA(49). As in BLAST, FASTA starts by looking for high scoring segments using the Wilbur and Lipman method(48). The segments the method is interested in are those having a high

proportion of identities. They are scored by using the main diagonal of a substitution matrix such as PAM 250 (i.e. only considering identities, but giving them a score that depends on the amino acids). The ten best diagonals found that way are then re− scored using a full substitution matrix. In a second step, non collinear segments are joined by dynamic programming, using a segment joining penalty (analogous to a gap penalty). The resulting scores are then measured and used to rank the matches. In order to prevent this chaining step from decreasing selectivity, it is only applied when the best scoring segment is above some empirical threshold. The mean and the standard deviation of this distribution are then evaluated and used to decide on a final threshold that is used to separate spurious hits from real ones. Of course, both these methods lead to false negatives and false positives, partly because they do not use the most accurate method for local alignments (50) that have been shown to significantly outperform FASTA(51), and also because some background noise is difficult to avoid especially when dealing with very large databases. The reason why the statistics behind FASTA are less evolved than for BLAST has to do with the fact that assessing the statistical significance of a gapped alignment is much harder than for an ungapped segment as in BLAST. This may change soon. Vingron and Waterman have recently proposed a scheme that allows the estimation of probabilities for gapped alignments(52). Furthermore, very recently, a new version of BLAST, gBLAST(53), has been published, that incorporates some of these results and allows statistical ranking of gapped alignments when scanning a database. Other statistical scoring systems include the one described by Bucher(54). 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments Extending the definition of pairwise costs to multiple sequence alignments can become complicated. To keep within the framework used for pairwise alignments, we seek a multiple alignment cost which is the sum of the substitution costs (cost based on substitution matrices and gap penalties). Nonetheless, based on different evolutionary scenario. Because genetic mutations are binary events which change one protein or nucleic acid sequence into another, substitution costs for multiple alignments are generally defined in terms of those for pairwise alignments(42). Two different approaches have been described. The first is to define the substitution cost for a set of elements as the sum of the substitution cost for all pairs of elements chosen from the set(55−57). Thus for three sequences i, j and k the cost of the alignment i, j, k will be equal to the sum of the cost of each pairwise alignment induced by the multiple alignment (cost (i,j)+cost(i,k)+cost (j,k)). These induced pairwise alignments are also called pairwise projections. An alignment defined that way is called SP for "sum−of−pairs" alignment. In such a context the evaluation of a multiple alignment amounts to measuring the dissimilarities within a set of letters. Although this approach has the advantage of being straightforward and intuitive, its main disadvantage is that it has no clear foundation in the theory of molecular evolution. Sankoff(58) proposed an approach in closer agreement with biological intuition. In his model, an evolutionary tree is assumed where each sequence is a leaf. The nodes are occupied by reconstructed sequences. If the tree has k nodes, then substitution costs are defined on k−tuples and equal the sum of the pairwise substitution costs associated with each edge of the tree. An alignment defined that way is a tree alignment. It must not be confused with progressive alignments (see Section 4.2.2) which often rely on estimated phylogenetic trees for the computation of an approximated SP alignment.

In the cases where the tree alignment has only one central node, it is named star alignment, being based on star phylogeny. Despite the fact that tree alignments are biologically more realistic than SP alignments they have not become very popular, mostly because of the fact that their construction presents serious algorithmic difficulties. A majority of the multiple alignment methods based on pairwise substitutions attempt to produce optimal SP alignments. In an SP context, defining gap costs is not necessarily straightforward, as can be gathered from the variety of alternative schemes (55, 56, 59, 60).

Sequences A A A C C Reconstructed Sequences (Nodes) / / / Cost 6

A A A C C

A A A C C

A A C 1

A

2

Figure 3. SP, Tree and Star alignment substitution costs for five 1 letter sequences ( from Altschul, (42)). The reconstructed sequences are indicated by circles at the tree nodes. Dashed lines indicate substitution cost of 0 while plain lines indicate a cost of 1.

The simple implementation of pairwise costs in SP evaluation is known as ’natural gap costs’. The idea is to consider the costs of the gaps in each pairwise projection of a multiple alignment (any column of nulls is removed from the pairwise projection). Although these gap costs seem to be the obvious companions of an SP alignment, Altschul was the first one to formally propose them(42). Most of the alternative schemes have been introduced for aligning simultaneously three sequences. For instance, Gotoh(56) proposed to define a gap as a set of columns having a null in identical positions. For Murata(55), a gap is a set of adjacent columns, each containing at least a null. The main motivation behind these schemes was algorithmic. These approximations were made in order to make gap cost evaluation easier when computing an SP alignment. It is for this same reason that Altschul

proposed a simplified version of the natural gap cost named ’quasi natural gap costs’(42). Quasi natural gap costs are very similar to natural ones. The main difference is that when a pairwise projection is considered, columns of nulls are not removed, and an extra gap is counted as opening when a null run in one sequence starts after and finishes before a null run in the other sequence, such as: Sequence 1 XXXXXX−−−−−−−−−−XXXXXXX Sequence 2 XXXXXXXX−−−−−XXXXXXXXXX that will lead to considering two gaps when in practice there is only one between the two sequences. The motivations behind this approximation are purely algorithmic and have to do with efficiency requirements when implementing the Carillo and Lipman algorithm for multiple sequence alignment(57) (see Section 4.3.1). This scheme induces a bias that favors similar gaps in aligned sequences, Nonetheless, this approximation is reasonable because indels seem to be rare events that tend to be kept unchanged through evolution(38). As a consequence, in most multiple alignments, the number of times where the quasi natural scheme induces alignments different from the natural one should be fairly limited. Natural and semi−natural gap costs can also be applied to tree or star alignments. This type of gap penalties in the context of an SP alignment constitutes one of the most widely used objective function for multiple sequence alignments. It is often referred to as "sums of pairs with affine gap penalties". Some of the drawbacks of this method have already been discussed in the previous section. The main ones stem from the fact that substitution matrices are general descriptions that do not take into account local constraints, this is true as well of gap penalties that should incorporate local information. An attempt to do so has been made in ClustalW where penalties are locally reassessed, trying to use some information from the other sequences being aligned(21). Another weakness of the SP function has to do with the fact that sequences are considered in pairs only. It probably makes more sense to consider a column in a multiple alignment as a distribution of amino acids generated by evolution. In the next sections about hidden Markov models and profiles (Sections 3.2.3 and 3.2.4), we will present some methods that attempt to take this fact into account when scoring multiple sequence alignments. Finally, a potential weakness inherent in any scoring scheme is the problem of non− representative information. The sequences used for building an alignment rarely constitute a representative set. They are often biased by the composition of the database used for collecting these sequences. In such a case, the alignment of the sequences that are in an isolated minority will suffer from the fact that the information they contain may be buried among the rest. We will see that several weighting schemes have been designed in order to overcome this problem.

3.2.2 Weights With most of the multiple sequence scoring systems, weighting of the sequences is necessary. Weights are designed in order to correct for unequal representation among a set of sequences. For instance, when aligning together globin sequences found by querying a database like Swiss−Prot(61), large numbers of sequences will be identical, while others will be quite different from the rest of the set and will have no close relative. However, if we want our alignment to be representative of the globin

family in general, it is important to avoid a complete domination by the vertebrate myoglobin and hemoglobin sequences, simply because the database contains far more of them. Such an alignment, made by giving each sequence the same weight would be biased. Weights are used to avoid this type of bias. Several methods have been proposed that can be separated into two groups whether they depend on an alignment or an estimated phylogenetic tree. Alignment based weighting methods do not require the sequences to be related at all. Therefore, complex issues of tree topology and root placement are avoided. These methods can be based on pairwise distances (62) or on the distances from some average generalized sequence (63). Whatever the method, the general trend is similar and results in an up−weighting of the sequences which are poorly related to the rest of the set while sequences which, on average, are more similar to the other sequences have their weights accordingly lowered. The tree−based weights assume that sequences are related through evolution and that a reasonably correct tree can be deduced from pairwise distances (64). Two schemes of this type have been proposed: branch−length proportional weights (65).and the Altschul Carol Lipman (ACL) weights that are based on a statistical analysis of the tree topology(66). In the ACL scheme, a sequence receives a low weight if it is far from the tree root or if it has close neighbors in the tree. ACL weights have the advantage of correcting duplicated information without biasing the alignment toward very divergent sequences. The underlying assumption is that although a very divergent sequence contains a lots of information, it is hard to exploit such information without bringing in too much extra noise. The main weakness of the ACL method is that when the topology of the tree is hard to establish, mistakes can be made regarding the position of the root. The weights proposed by Thompson et al. (65) provide an alternative solution. They also rely on a phylogenetic tree, but under the scheme, sequences only get down weighted for having closely related neighbors. In an SP context, applying pairwise or sequence weights to score a multiple alignment is straightforward. The weighted sums of pairs alignment score can be ) can be estimated for formulated as follows. Given N sequences Si..Sn a weight( each possible pair of sequence Si, Sj. This pairwise weight will be obtained directly through computation, or will result from the combination of individual sequence weights. Each pairwise projection of the sequences Si and Sj in the alignment has a cost (COST ( ))evaluated using a substitution matrix and a set of gap penalties. Given these definitions, the global weighted SP score is equal to:

Wi,j* COST (Ai,j) 1 i+ SCORE = i= j= 1

ΣS

N1 N − (eq. 2)

As with matrices, an important issue for weights is to decide which scheme should be applied. Each of these weights have desirable properties and unwanted side effects. Vingron and Sibbald proposed a systematic way of comparing five different methods(62). Their conclusion was that when sequences are related through a robust phylogenetic tree, the ACL do better than alignment based methods. On the other hand, when the relationship between the sequences is harder to estimate, leading to an inaccurate phylogenetic tree, the Sibbald and Argos(63) or related methods(67) are

¥ ¤£¢  ¡

¥¤£¡ ¦

¥¤£¡ ¦

preferable. Similar results were more recently established by Henikoff and Henikoff(68) using an empirical evaluation method. These authors found that for phylogenetically related sequences, tree based methods are preferable and that the Thompson’s scheme slightly outperforms ACL. It is not clear however, to what extent these findings are method dependent, especially if one considers that most of these weights are used in different heuristic alignment making strategies. Gotoh pointed out the fact that weighting schemes are very likely to be method dependent(69). For instance, an important difference between the ACL weights and the Thompson’s ones is that the ACL method produces pairwise weights while the other gives individual sequence weights. As a consequence, the ACL weights contain more information since the pairwise weights they depend on are not necessarily correlated (i.e. there is not always a set of sequence weights corresponding to a set of pairwise weights). On the other hand, as we will see later, these two types of weights are used in very different contexts. Thompson’s weights are mostly used for progressive alignments, where they are probably very appropriate since remotely related sequences usually have little effect on the overall multiple alignment(21). ACL weights are used in the program MSA (Multiple Sequence Alignment(22)) which does global simultaneous alignments where each sequence is given a chance to affect the overall alignment. We will now see that weights can also be useful for the construction and the use of sequence profiles (i.e. generalized alignments used to describe a protein family or a domain).

3.2.3 Profiles Multiple sequence alignments can be used to provide position specific scoring matrices known as profiles(6). The procedure of turning an alignment into a profile is fairly straightforward. It involves counting the residue frequencies in each column of the multiple alignment, and deducing from these measures, a table of substitution costs for each position of the profile. A local cost is also evaluated for gap insertion and extension. The term profile refers to the collection of costs associated with each position of the alignment. A profile can be treated as a single sequence and aligned to any other sequence (or profile), using the profile substitution costs and penalties instead of a single matrix.

A Penalty POS ALN 1 EGVL 2 LLSP 3 VVVV . . . 21 SS−D 22 S−−S . . . 49 SSNY

C

D

E

F 4 0 2 . . . 5 3 . . . 2

G

H

I

K

L

M 1 3 0 . . . 3 2 . . . 1

P

Q ... ... ...

... 9 9 9 . . . 4 4 . . . 9

Gap

3 −2 3 4 0 2 −2 −2 −1 3 2 2 −2 −2 2 . . . . . . . . . . . . . . . 3 2 5 4 −4 2 3 1 1 −2 . . . . . . . . . . . . . . . 2 5 2 1 1

−1 3 −1 4 4 1 −1 3 −1 6 5 −1 −3 11 −2 1 −2 −2 . . . . . . . . . . . . . . . . . . 0 −1 2 −3 −2 4 −1 0 1 −2 −1 2 . . . . . . . . . . . . . . . . . . 1 0 1 −2 −2 5

... ...

...

Figure 4. Example of profile ( adapted from Gribskov et al.(6)). For each position (POS) of a multiple alignment (ALN, presented in a vertical format with each line corresponding to a column) a substitution cost is calculated for any amino acid that would be aligned to this position. A gap penalty is also evaluated. Note that at positions 21 and 22 of the profile, the

gap penalty is lowered because the alignment used for the profile contains gaps at this position.

A profile is specific for a family (or a domain). One of the main usage of profiles is to search databases for new members of a family. In such a context the desirable properties are sensitivity and selectivity (i.e. recognize very remote homologues and discriminate false positives). Such a result can only be achieved if the profile induces very good alignments. This in turns depends on the quality of the profile itself and can be affected by several factors: (i) choice of the sequences, (ii) method used for building the multiple alignment, (iii) method used to turn the multiple alignment into a profile (especially the treatment of the gaps and the method used to describe background frequencies). In many cases, the choice of the sequences is directly imposed by the database and one of the best way to remove this type of bias is to use a weighting scheme when aligning the sequences (see previous section). However if one wishes to build some very specific profile, it is also possible to select the appropriate sequences, using techniques such as the one described by Neuwald et al. (70). Weights also need to be applied when the alignment is turned into a profile. On various occasions, it has been shown that many of the schemes used for sequence alignments do as well when used to build profiles, and help to increase the level of generalization(65, 71). Accumulation of gaps is another side effect that appears when many sequences are used to build a profile. Because the number of gaps tends to increase with the number of sequences, schemes have to be used for down weighting the effect of their occurrence(65, 72, 73). Profiles are not only important for database searches, their computation is also a crucial step for some multiple sequence alignment strategies based on a progressive approach (67, 74, 75). In this context, to build the full multiple sequence alignment, partial multiple alignments need to be aligned in intermediate steps. The best way to do so is often to turn these alignments into profiles, and to align these profiles with one another. Methods for doing so have been extensively described by Higgins(76, 77) and Gotoh(69, 73). Finally, since profiles are involved in database searches, some significant work has been done on establishing the statistical meaning of alignment scores(78, 79). This important aspect of the problem has received much more attention in the context of the hidden Markov models based approaches that will now briefly be reviewed. 3.2.4 hidden Markov models A hidden Markov model (HMM) describes a series of observations generated by a "hidden" stochastic process (a Markov process)(80). They have been used extensively in speech recognition. HMMs designed for sequence alignments are related to profiles. Their aim is to provide a statistical model representative of a given family of proteins(7). In theory, one of the main advantages of HMMs (as opposed to profiles) is that they provide a way to estimate a model directly from unaligned sequences. However, in practice, most of the methods available for HMM optimization require the computation of multiple alignments. Nevertheless, HMMs have some interesting features that distinguish them from profiles. A key concept in HMMs is the notion of states. A HMM is a chain of elements with different possible states. The number of possible states is arbitrarily defined. In SAM(81) for instance, there are three states: align, insert and delete. When going

through a model, probabilities are given to each possible transition. The values of these transition probabilities are evaluated by training the model using known members of a protein family. Sequences can be aligned to a trained model using a variant of dynamic programming(18) known as the Viterby algorithm(80). An alignment between a sequence and a HMM is called a path, in the sense that it joins different states in order to produce the path with the highest probability. To a large extent, aligning a sequence to a model can be regarded as equivalent to aligning a sequence to a profile. There is however a fundamental conceptual difference. A new sequence is not ’aligned’ to a HMM. What is measured is the probability for a given HMM to generate the optimally aligned sequence (i.e. the sequence with the right pattern of gaps/unaligned residues/aligned residues).

Figure 5. A linear hidden Markov model (from Hughey and Krogh, (82). This model has three different states ( M, I, D). Each state is connected to the others by a transition probability ( arrows). Assigning the weights to each transition is the purpose of the training.

The number of sequences, and their range of identities are critical factors that will influence the model. In their simplest expression, HMMs do not require any prior information (as opposed to profiles that required a multiple alignment made using a substitution matrix). In HMMs, residues are described as ’letters’ and the training only relies on identities to establish the parameters. If there are enough sequences in the training set, this will result in a sensible model since the substitution constraints will be ’discovered’ by the model on a position per position basis. The accuracy and the sensitivity of a model trained that way will be highly dependent on the number of sequences and their information content. To overcome possible lack of information (missing data), pseudo counts are incorporated in order to simulate background frequencies (82, 83). Their influence on the model decreases with the amount of information present in the sequences. The actual values of the pseudocounts can be estimated using various methods. One can measure the probability of each amino acid in the training set, or other background probabilities from a standard substitution matrix. Generally speaking, the smaller the number of sequences used for the training, the more critical the values of the pseudocounts. In this context, Dirichlet mixtures(36) proved extremely useful. A Dirichlet mixture is a mathematical tool that, given an observed amino acid distribution and a set of reference distributions allows the computation of a probability for the observed distribution. In a hidden Markov model context, these mixtures can be regarded as the equivalent of a substitution matrix. They have been

shown to be more sensitive to sequence conservation or variation than traditional substitution matrices(36). As with profiles, HMMs can be used to generate multiple sequence alignments or for scanning databases. Scoring can be made by combining the probabilities of all the different alignments of a sequence to a model, which is equivalent to calculating the total probability of a sequence given the model. This can be done efficiently using the Viterby algorithm (80). Such a score is called the NLL score for Negative Log Likelihood score. The NLL scores measures how far a sequence is from its model (in other words, the statistical cost of forcing a given model to produce an aligned sequence). The problem with NLL score is that they depend on the size of both the sequence and the model. One way to overcome this is to measure the Z score, the number of standard deviations a NLL score is away from the average NLL score of unrelated sequences of the same length. A complete algorithm for the computation of Z scores is described in (82). This study of multiple sequence alignment scoring systems is far from exhaustive. A large variety of alternative methods have been described that roughly fall into two distinct categories: those relying on SP evaluation and those (like HMMs) that consider distributions of amino acids rather than pairs. Although it seems that considering distributions is a more realistic approach, the main reason why SP schemes have so far been more popular has mostly to do with algorithmic problems associated with distribution based methods. Both methods only deal with one aspect of the problem: the use of local information. We know that since proteins fold into active 3D structures, there must be more information in the sequences (i.e. tertiary structure interactions...) than what we discussed so far. If used in an appropriate manner there is no doubt it could help to improve the quality of the alignments. Few methods have been proposed to deal with these non−local interactions. There are two good reasons for that: this type of signal is usually very weak in proteins and the algorithmic problem is even more complex than when taking into account primary sequences. We will now see that the problem is different with RNA hence the use of more complex types of objective functions.

3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS When reliable non local interaction information is available, it makes sense to incorporate it into the scoring scheme. This is rarely the case with proteins, hence the difficulties encountered when doing structure threading (threading a protein sequence onto a known structure)(84). Fortunately, in the case of RNA the problem is different and the rules that govern the formation of non local interactions are better understood(85, 86). A large part of the secondary structures encountered in RNAs are due to Watson and Crick interactions. The existence of mathematical models that describe these interactions on a thermodynamic basis make RNAs good candidates for ’ab−initio’ folding predictions. However, so many parameters influence the structure of an RNA molecule (local conditions, interacting proteins, unknown tertiary interactions...) that accurate predictions based on thermodynamic models are still very imperfect despite some recent improvements(87, 88). This does not mean that the thermodynamic approach is wrong, but it may mean that the information it uses is not sufficiently detailed considering our knowledge of the RNA folding process. However when combined with phylogenetic information (such as can be

taken from multiple sequence alignments) predictions can be realized to a high level of accuracy(16).

Figure 6. Some of the motifs commonly found in RNA secondary structure. Base pairings are usually made through Watson and Crick interaction although non canonical pairs have often been reported.

Multiple sequence alignments have the advantage of revealing constraints without the need of hypotheses on their origin. Such analysis, performed on ribosomal RNA sequences(16), made it possible to predict the secondary structures of these molecules with an accuracy outperforming traditional energy minimization schemes like Zuker(89). Unfortunately, these types of RNA alignments are difficult to build automatically due to complex algorithmic problems. Several functions have been proposed that incorporate RNA secondary structure information into their evaluation of multiple sequence alignment quality(90−92). In their approach, Kim et al. (90) proposed a function that computes the probability for all the potential secondary structures contained in a multiple sequence alignment to be taken into account. It is a scheme that has the advantage of being very flexible (for instance, it allows pseudo−knots) and not requiring the actual computation of a secondary structure (structures are not assessed for compatibility). Its main drawback is that evaluation time is quadratic with the length of the alignment. This can prove a severe limit when dealing with very long sequences (like ribosomal RNAs) while using an iterative or stochastic optimization method such as simulated annealing. Eddy and Durbin(91) proposed a different type of approach. In their method, the secondary structure is expressed as a binary tree in which each node stands for a column in a multiple sequence alignment. This tree can be seen as a path through a generalized HMM (HMM with bifurcation) named Covariance Model (CV). A CV can be trained like an HMM. Once this has been done, the sequences are then aligned to the model in order to produce the multiple sequence alignment. This approach is very similar to the stochastic context free grammar (SCFG) methods(93, 94), where the aim is to express the structure using a special type of regular expression. The alignment is made by parsing the sequences through the proper expression that has been obtained by training on the sequences. CV and SCFG methods suffer from the same drawback. They can only allow nested structures and cannot take into account pseudoknots(95, 96) as opposed to the method proposed by Kim. Furthermore, their computation is very expensive and restricts these methods to small sequences (<200 nucleotides). An option for decreasing the complexity of the problem is to do threading: assume that some master structure is known and thread the sequences onto it. In several important cases, like ribosomal RNAs, this is a realistic assumption.

This approach can of course be taken using a SCFG based objective function. Alternatively, one can consider a simpler function such as the one described by Corpet and Michot(92). In this case, the evaluation of the alignment is split in two steps: (i) evaluating the primary sequence alignment (ii) evaluating the quality of the fold induced by the sequence of known structure onto the sequence of unknown structure. To generate the overall score, these two terms are combined with one another. Although this scheme has no real theoretical justification as opposed to those previously described, it has the merit of being conceptually simple. It can also accommodate a range of interactions such as pseudo−knots and other non nested structures. In this case, the alignment problem is known to be NP−complete(17). It is in order to provide a reasonable heuristic solution with long sequences (>1000 nucleotides) that we developed the package RAGA(97) (See section 5.3 and Annex 3).

4 MAKING MULTIPLE SEQUENCE ALIGNMENT
4.1 COMPLEXITY OF THE PROBLEM So far, we have focused on reviewing some of the objective functions needed for evaluating the quality of sequence alignments. However, as pointed out earlier, this is only one side of the coin. The other one, that constitutes in fact the main bottleneck, is optimization. In other words, given an objective function, is it possible to optimize it by producing the best scoring alignment? There are at least two good reason for designing efficient and accurate optimization strategies. The first one is obvious: making the alignments that are needed for whatever purpose.... The second reason is less direct but of extreme importance. The evaluation schemes described above are only theoretically justified using phylogenetic or structural. criteria. As such, they do not constitute any proof and must therefore be validated through empirical analysis (i.e. how well do they perform). The optimization methods required for these two reasons do not necessarily need to be equivalent. One can, for instance, use a very robust but expensive (computer time and memory) method to compare and validate alternative scoring schemes. If one of these schemes proves useful it may later become appropriate to develop a very specific heuristic method that approximates the optimization reasonably well while being efficient enough for production purpose. Needless to say, whatever direction one wishes to take, the design of an optimization technique will always prove to be a very demanding problem. We already mentioned that even for two sequences of moderate length, a naive approach to alignment computation can lead to impractical enumeration problems. Fortunately, the situation does not have to be that bad and will depend on the scoring scheme one wishes to optimize. In many cases, there are short cuts that allow efficient computation of an alignment, given some specific objective functions. We will see in section 4.2.1 that dynamic programming(18) is one of these techniques that allows the computation of pairwise alignments in time proportional to the product of the length of the two sequences. This essential technique constitutes the core of many alignment methods. In theory, it is not restricted to two sequences, but since its complexity is a function of the product of the length of the sequences to align, it can hardly be used for more than three sequences at a time(55). This does not mean multiple sequence alignments cannot be computed automatically with dynamic programming, but it means that to do so, one has to rely on heuristic algorithms. Heuristic methods do not guarantee an optimal solution but may perform well and sometimes even guarantee the solution to be inside given boundaries. Generally speaking, multiple sequence alignment algorithms can be divided into two classes: (i) the greedy algorithms that usually rely on sequence clustering algorithms and dynamic programming for making progressive alignments (ii)the non−progressive algorithms that attempt to simultaneously align all the sequences. These non progressive algorithms themselves fall into two distinct sub−categories: −deterministic heuristics. −stochastic heuristics.

In the following sections, the underlying principles of these algorithms and their main differences will be briefly explained. More emphasis will be given to the genetic algorithm techniques on which the SAGA package is based(98)(see section 5.1 and annex 1).

4.2 DETERMINISTIC GREEDY APPROACHES 4.2.1 Aligning Two Sequences The main algorithm for aligning two sequences, often referred to as the Needleman and Wunsch(18) or dynamic programming (DP) algorithm, is one of the oldest and most important tools in bioinformatics. Over the last 30 years, it has been used under in form or another in most of the methods developed for sequence comparison. When applying dynamic programming to two sequences, it is possible to compute the best scoring alignment between these sequences using an amount of memory and time proportional to the product of the lengths of the two sequences to align. This provides a dramatic improvement over the naive approach that would require enumerating all the possible alignments. An important advantage of dynamic programming is that it is a very general scheme. Given substitution costs (e.g. matrix or profile...) and a scheme for scoring gaps, the algorithm can compute the alignment with the best score. In practice, DP can accommodate any context−independent scoring scheme. The algorithm is based on extending recursively the best scoring alignment until all the residues of each sequence have been aligned. In practice, this means finding the best path through a matrix constructed from the scores of all pairs of elements between the two sequences. Let us consider two sequences, A of length m and B of length n, a matrix that assigns a score di,j to the substitution of residue i in A by the residue j in B and a gap penalty g. Computation of the optimal score is achieved by incrementally extending each path with a locally optimal step. For example; element di,j can extend any path terminating in the preceding row (i−1, m:m<j) and/or the preceding column (n, j−1:n−1), less any gap penalty. If the extension that produces the highest new path score is always taken, then applying the condition repeatedly (to every element in the sequence matrix), transforms a sequence matrix of pairwise match−scores into a sequence matrix of path−scores (S). This can be expressed more formally by the recursive expression: Sij=dij+max { (Si−1,j−1), (Si−1,m −g, (m<j−1)), Sn,j−n −g, (n<i−1))} (eq. 3) Once the global score Sij has been found, the optimal alignment can be retrieved by doing a traceback from the cell with the highest score. This algorithm is slightly more elaborate than the original Needleman and Wunsch that did which used simpler gap penalties. Computation is quadratic in time and space for the score. Gotoh proposed a modified version of this algorithm that allows computation of score in linear space, while using affine gap penalties(45). Finally, a recursive algorithm originally described by Hirschberg(99) and refined by Myers and Millers(100) allows computation of both score and alignment in linear space, using affine gap penalties. A vast number of variations have been proposed around the original algorithm. For instance, since it is known that the mathematical optimal is rarely biologically optimal methods have been described that allow computation of sub optimal

alignments(101−103). Such computations can prove useful when trying to assess the reliability of a sequence alignment(104). Another variation, proposed by Taylor, involves biasing the DP toward motifs, in order to minimize the effect of the gap penalty choice(47). DP is not restricted to global alignments but can also be used for finding the longest sub−sequences shared between two otherwise unrelated sequences. This algorithm (50) is extremely important for database searches. It is mostly an adaptation of the Needleman and Wunsch algorithm and involves extending a path as long as the alignment it contains is improving (hence the 0 in the following equation): Sij=max { (Si−1,j−1)+dij, (Si−1,m −g, (m<j−1))+dij, Sn,j−n −g, (n<i−1))+dij, 0} (eq. 4) The score of a path is set to 0 if this path gets a negative score (the substitution matrix needs to give negative values to unfavored substitutions). Best segments are identified by finding the best scores in the score matrix and making tracebacks, using 0 as a boundary for the path. The use of DP is not restricted to sequences. For instance, the Viterby algorithm, used for aligning a sequence to a HMM(80) is an adaptation of the classic DP algorithm. Similarly, modified versions of DP can be used to align structures. This was shown by Corpet and Michot who described a DP strategy for aligning two RNA molecules taking into account potential secondary structures in one of the sequences while knowing the structure of the other(92). Taylor and Orengo(105) also described a heuristic based on dynamic programming designed to align two protein structures (double dynamic programming). Finally, the algorithm used to align an RNA molecule to a covariance model(91) is based on DP and is very similar to the one described by Zuker for predicting RNA folds(89). Most of these algorithms are computationally very demanding (higher complexity than regular DP). For this reason, alternative techniques (like stochastic optimization techniques) can be sensible alternatives or even necessities when dynamic programming cannot be used. Nonetheless, in the case of pairwise sequence alignments DP may provide the most efficient and accurate way to compute a solution, as long as traditional gap penalty schemes are used. We will now see that how this pairwise method can be extended to multiple sequences. 4.2.2 Aligning Two Alignments : Progressive Alignment Methods Although it is hard to align more than two sequences at a time by regular dynamic programming, it is possible to align two alignments (i.e. two sets of prealigned sequences) with a pairwise algorithm. This is due to the fact that an alignment (pairwise or multiple) can be regarded as a generalized sequence where instead of having one residue per position, one has a residue vector. Such a definition allows the alignment of two alignments, treating them like normal sequences(69). It makes sense when doing so to treat each alignment as a profile. These methods for aligning multiple alignments are very important because they are central to one of the most popular multiple alignment strategies: the progressive alignment algorithms described by Feng and Dolittle(74) and Taylor(75). The principle of a progressive alignment is quite straightforward. Since it is impossible to produce a true multiple sequence alignment by DP, using all the sequences at the same time, one will begin by aligning pairs of closely related sequences. Such

pairwise alignments, dealing with very similar sequences are likely to be correct if the sequences are similar enough. In the second stage, pairs of closely related pairwise alignments will then be aligned two by two, and so on until the whole multiple sequence alignment has been produced. Ideally, the order in which the sequences are aligned depends on their phylogenetic relationship (i.e. a phylogenetic tree) and the procedure will follow the topology of some estimated tree. Several methods exist that allow the computation of such trees using various clustering algorithms that rely on an estimates of the pairwise distances between the sequences(4, 106).

A B C D
Figure 7. Progressive alignment strategy(74, 75). A phylogenetic tree is estimated for the four sequences A, B, C and D (gray lines). Its topology is used to decide the order for aligning the sequences by dynamic programming.

The advantage of progressive alignments is that they provide an efficient heuristic for overcoming the problem of computational complexity posed by global optimization techniques. However, if the aim is, for instance, to optimize the sums of pairs with a substitution matrix and a pair of gap penalties, one has to be aware that the progressive method does not offer any guarantee. Its performances may be unpredictable and will depend on the quality of the intermediate alignments. These will depend in turn on the relations between the sequences and the accuracy and the density of the tree. Intermediate alignments are not necessarily correct (in terms of optimization) because they are made using only a subset of the information available. This may have serious consequences since early mistakes will never be corrected (’once a gap always a gap’) and may also propagate by inducing more mistakes in further intermediate alignments. This problem is often referred to as a ’local minimum problem’, in the sense that the method by its greediness is pushed toward satisfying short term constraints that may lead to a non global solution. Nevertheless, in many cases, this progressive strategy leads to convincing alignments at a low computational cost. There are two good reasons for that. First of all, in many cases, the sequences are well fitted to a progressive approach, and secondly because mathematical optimality does not necessarily mean better biological alignments. This explains why some of the most popular multiple sequence alignment packages are based on these algorithm. They include Clustal V(77), Multal(75),ClustalW(21),Barton and Stenberg(107), etc. In a wide variety of cases

these methods have been shown to do well when their performances are assessed using sets of reference alignments. Many of the techniques used for improving multiple sequence alignments such as position specific gap penalties, tree based weights, secondary or tertiary structure information can be implemented in a progressive alignment strategy as shown in ClustalW(21) and in the work of Barton and Stenberg(107). One of the problems of progressive alignments is that their accuracy will clearly depend on the relation between the sequences aligned. If these are closely related, intermediate alignments are likely to be accurate and to induce a correct overall alignment. On the other hand, if the sequences are only remotely related, it may be necessary to simultaneously use all the information they contain to build a realistic model (i.e. most of the pairs of sequences needed to build the intermediate alignments will not contain enough information for inducing a correct pairwise alignment). The only way to improve the relation between the sequences in a set is to incorporate some more sequences into this set. In most of the cases this is not an option, simply because the sequences are not present in the databases and may not even exist. Simultaneous sequence alignment methods are therefore necessary. We will now describe methods that attempt to produce this type of multiple sequence alignment. 4.3 DETERMINISTIC APPROACHES FOR NON PROGRESSIVE MULTIPLE ALIGNMENTS 4.3.1 The Carrillo and Lipman Algorithm The basic DP algorithm is hard to apply to more than two sequences at a time and the computation of an SP multiple alignment using DP has been shown to be NP−complete(19). In order to compensate for these limitations,. an algorithm was designed by Carrillo and Lipman(57) to perform such operations on a larger number of sequences (up to 10). It relies on the idea that the whole alignment space does not necessarily need to be explored when aligning sequences. Bounds can be derived that guarantee finding the optimal alignment inside a portion of the hyper space defined by the sequences of the set. Several methods have been described for deriving guaranteed boundaries (57, 108). Unfortunately, these boundaries are often too loose to be practical (i.e. they define a space that is too large to be explored). To be of any use, the Carrillo and Lipman algorithm needs to rely on tighter heuristic boundaries that do not guarantee optimality. These are implemented in the package MSA(22, 109). Their main property is that they depend on the level of similarity of the sequences. For very closely related sequences, the boundaries will be extremely tight and define a very small portion of the hyper space. If the sequence similarity decreases, the boundaries become looser, until the computation of an alignment requires the exploration of a space beyond computational ability. The limit on the number of sequences and on the level of similarity shared by these sequences constitutes the major drawback of MSA. It means that in most cases the only alignments that are within the scope of the program are those for which an accurate progressive alignment can be made because the sequences are related enough. The second important drawback of MSA is that it is restricted to a very specific type of objective function. This is especially drastic with regard to the type of gap penalties MSA is able to use (’semi natural gap penalties’ instead of ’natural gap penalties’, see section 3.1.2). 4.3.2 Other Approximation Techniques

More recently, non−dynamic programming based methods have been proposed that attempt to generate multiple sequence alignments by solving graph theory problems. One of these techniques is known as the MWT (Maximum Weight Trace) formulation, introduced in 1993 by Kececioglu(110), (111). Like dynamic programming, the solving of MWT is known to be NP complete. But even so, using a branch and bound algorithm, a MWT based algorithm has been described(111) that manages to align up to 15 sequences in a reasonable amount of time. Unfortunately, this approach suffers from the same drawback as MSA: it requires tight boundaries to be established, which is not always possible, especially when the sequences have very low similarity. Considering all this evidence, it is clear that in the years to come, our ability to build multiple sequence alignments will increase along with computer power. This means that these methods will be able to handle more and more sequences. On the other hand, the problem, that in most cases has been shown to be NP complete is likely to remain unsolvable in polynomial time. This implies that the number of sequences a deterministic strategy is able to handle will always remain severely limited. Another serious drawback is the fact that in many cases these algorithms are very specific for a given objective function and difficult to adapt. We will now focus this review on stochastic techniques that have proven to be much more flexible and versatile tools. 4.4 STOCHASTIC HEURISTICS 4.4.1 What is a stochastic Method ? A stochastic optimization strategy (also known as Monte Carlo simulation) is a way of finding a solution to a problem through some form of random sampling. The extent to which the search is random mostly depends on the heuristic one uses. The advantage of randomness is that there is a non null probability for any potential solution to be sampled, regardless of the solution space size. Of course, the randomness also implies that all the solutions may not be looked at (including the global optimum). In order to correct for this problem, a large number of heuristics have been designed that attempt to bias the way in which the solution space is sampled. The aim is to improve the chances of sampling the optimal solution. As a consequence, most stochastic strategies can be regarded as trade offs between greediness and randomness. This can best be explained considering a solution landscape with hills and valleys(19, 112). The optimal solution is the top of the highest hill (or the bottom of the deepest valley). If there is only one hill, one can find this point by simply climbing along the steepest gradient. Such a method does not have to be stochastic and is known as hill climbing. However, if there are several small hills and a high one, the outcome of hill climbing will be restricted by the point where it was started. The action of climbing the hill will not necessarily lead to the optimum, but to the first local optimum. This is the price to pay for the excessive greediness of a hill climbing strategy that always goes up. If we add an element of randomness that decreases the greediness of the strategy, it may sometimes be possible to take alternative routes (going down, for instance, which in turn may allow the search to encounter the highest hill). This less greedy technique is known as Stochastic Hill Climbing. Ultimately, its success (as for any stochastic strategy) will mostly depend on the fitness landscape. The more complex this landscape, the hardest it is to find the global optimum.

Stochastic methods are quite well suited for solving sequence alignment problems where the landscape is very complex(113) and the number of solutions much too large to be exhaustively analysed(19). Many of these techniques are inspired from physical (Simulated Annealing), biological (Genetic Algorithms, Evolutionary strategies), or even social (Taboo search) phenomena, others are simply based on iterative statistical analyses like the Gibbs sampling (stochastic expectation− maximization strategy). In this section, we will review some of the applications of stochastic heuristics to the problem of sequence alignment. Since one of these techniques (Genetic Algorithm) is central to the work presented here, it will be given a more detailed presentation in the next section. A very important advantage of many Monte Carlo strategies over more traditional heuristics is the fact than they often allow conceptual separation between optimization and objective function. Ideally, the optimization strategy will play the role of a black box into which any objective function can theoretically be plugged. The main drawback of these strategies is that they rarely give any indication of how complete the optimization is. In fact, since most of these methods are iterative, it is usual practice to stop them if the search stabilizes for a specific number of iterations. The general lack of guaranteed criteria for stopping the search is usually the main problem. It can sometimes be overcome if the score obtained with the objective function contains some information regarding the completeness of the optimization. 4.4.2 Iterative alignments and Expectation−Maximization Strategies Iterative strategies are heuristic approaches, very similar in spirit to EM methods (Expectation maximization). Expectation−Maximization is a Hill climbing strategy that involves two steps: (i) given a model and some data, estimation of new parameters for the model, (ii) given the new parameters and the ancient model, estimation of a new model. The procedure is carried on iteratively until the model and the parameters stabilize (good fit between the model and the parameters). Depending on the implementation, EM can incorporate some stochastic steps as do iterative alignment strategies. The scope of EM strategies is quite large. The optimization of RNA covariance models is made using EM(91), as well as the training of HMMs in SAM(81). Several multiple sequence alignments strategies have also been described based on EM(114, 115) To perform an iterative multiple sequence alignment, two components are needed: a set of prealigned sequences and an algorithm for aligning pairs of sequences (or pairs of alignments) such as DP using Gotoh’s operators for aligning profiles and alignments(69, 73). An iterative alignment algorithm usually takes the following steps: 1−split the multiple sequence alignment into two groups of prealigned sequences 2−apply the DP algorithm to the two multiple alignments 3−if the score is not stabilized, go to 1 Since the number of possible partitioning in step 1 increases factorially with the number of sequences, exhaustive analysis cannot be done, and one has to use random partitioning or other types of partitions that use for instance the phylogenetic tree. Several alternative schemes of this type have been described ( for a review, see (116)) These methods do better on average than progressive strategy (in terms of scores) but are also much slower (100 times on average). Recently, Gotoh presented such a technique named double−nested randomized iterative strategy(117). He uses it

to optimize the weighted sums of pairs with a reevaluation of the weights after each iteration. According to the author, as judged from reference structural alignments, this technique is significantly more accurate than progressive alignments strategies. Gibbs sampling is a more sophisticated example of stochastic EM(118). In Gibbs sampling applied to multiple sequence alignment(83), a random solution is generated and used to guide the generation of the next solution (ungapped multiple alignment), assigning probabilities to all the neighboring solutions, and randomly choosing one of these (or more exactly, choosing a new solution according to their probabilities, given the current model). The search is iterative and has been successfully implemented for identifying ungapped motifs conserved among a set of sequences. In such a context, given a set of non−aligned sequences, the aim is to generate the more informative model (the less likely to have been obtained by chance). The objective function described in that context by Lawrence et al. is purely based on statistical analysis of the data and can be made fully independent of any prior information. In many respects it is very similar to HMMs. 4.4.3 Simulated Annealing One of the first stochastic techniques described was simulated annealing (SA)(119, 120). SA relies on an analogy with physics. The idea is to compare the solving of an optimization problem to the cooling of a metal (i.e. finding the best position of each atom in the cooled metal is equivalent to solving a multi constraints problem). The principle is very straightforward. Given a function, a random solution will be generated, and a temperature chosen. Every iteration, a new solution is created based on the previous one (using some random modification algorithm). If the new solution is better than the old one, it will be accepted. Otherwise, it will be accepted/rejected with a probability that depends on the temperature and the level of disimprovement (i.e. the higher the temperature, the more likely a solution is to be accepted, regardless of its quality). Temperature will gradually be cooled down. This means that at the beginning, almost any new solution is accepted, while in the last phases (low temperature) only the solutions bringing an improvement are accepted. The SA algorithm can be summarized as follow given that Sc is the current solution at the nth iteration: 1) Generate an initial solution Sc=So and an initial Temperature (To), Tc=To 2) Modify the solution Sc into Sn 3) IF F(Sn) is better than F(Sc) then keep the new solution: Sc=Sn ELSE keep Sc with a probability P=e −(|F(Sn) −F(Sc)|)/Tc 4) Change the value of Tc according to the annealing schedule 5) Go to 2 as long as the termination criterion has not been met. The strategy depends on three parameters: the initial temperature, the cooling schedule and the acceptance function. The cooling and the acceptance function define the nature of the SA. The most common are those using Boltzman distributions, but one can also use Gaussian or other ad−hoc distributions(121). A remarkable property of SA is that given a cooling function slow enough, the search is guaranteed to reach the optimal solution. Of course this argument remains theoretical since in most of the cases an adequate cooling schedule would require too much time to be of any practical use (hence the slowness of SA).

Despite this intrinsic limitation, SA has been applied to multiple sequence alignment on several occasions(122−124). The conclusion of these studies has mostly been that although it does reasonably well, because of its slowness, SA has to be restricted to being an alignment improver method rather than a full alignment method. In all these studies, SA is applied to sequence alignment in a similar fashion: a multiple alignment is created (randomly or using some heuristic) and modified using specific sub−routines that move gaps around or insert them. We will later see that these ’modification procedures’ are in many ways similar to the mutations used in a genetic algorithm strategy. As opposed to the iterative methods previously discussed, SA is not restricted to making SP multiple alignments using DP. On the contrary, the method displays most of the properties of a black box. If it was not so slow, SA would possibly be the ideal method for any optimization problem. It has successfully been applied to the alignment of RNA molecules using potential secondary structure information(90). Similarly SA based methods(125) have been described for predicting the fold of very long RNA molecules that are beyond the scope of traditional methods like Zuker’s. Finally, a variant of expectation−maximization based on SA is used for the training of HMMs(126). In the work presented here (SAGA and RAGA), SA has been an important source of inspiration, because of the similarities that exist between SA and GAs. GAs are probably some of the most interesting stochastic optimization tools available today. They can best be described as a very flexible framework in which most of the available methods (deterministic or not) can be integrated in order to constitute a general tool. One of the reason why GAs have received so little attention in the context of multiple sequence alignment is probably due to the fact that the implementation of a genetic algorithm specialised for multiple alignment is much less straightforward than with simulated annealing. In other areas of computational biology, GAs have already been established as powerful tools. These include protein 3D structure prediction or RNA secondary structure prediction 4.5 GENETIC ALGORITHMS 4.5.1 What is a Genetic Algorithm? Genetic algorithms are based on a loose analogy with the phenomenon of natural selection. They have existed in one form or another for quite a while, but were formally introduced by Holland in 1975(127). The principle is quite straightforward. Given a problem, a set of potential solutions (population) compete with one another (selection). These solutions can be modified (mutations), or combined with one another (crossovers). The idea is that acting together, selection and evolution will lead to an overall improvement of the population. Most of the ideas developed here about GAs are taken from (128, 129) There are two essential concepts at the heart of the GA strategy. The first one is selection. Selection is established in order to lead the search toward improvement. It means that the best solutions to the problem (as judged with the objective function one is interested in) will be selected according to their quality. If one was to do so in a deterministic way, for instance, by only selecting the best solution every generation, the search would very rapidly converge toward the first local optimum it encounters. In such a case, GAs would mostly behave like Hill Climbing. To avoid this major problem (called premature convergence), the selection procedure in a GA is not absolute but statistical. Making statistical selection is a very straightforward process.

In a first step, each individual is evaluated using the objective function. The score obtained that way is turned into some fitness measure. In the selection round, an imaginary wheel is spun where each individual has a number of slots proportional to its fitness. This means that individuals with a very high score are the most likely to be selected, while those with a low score are less likely. However, everything is possible and it is this uncertainty that prevents a GA from converging prematurely to the closest local minimum. The second key concept in a GA is the concept of evolution. The aim of the operators is to create new solutions based on those present in the population. By analogy to real life, there are two sets of operators: the crossover (combining two individuals) and the mutations. Since crossovers can be regarded as a very primitive interpretation of sexual reproduction, it is current practice to apply selection when choosing the two individuals that are going to be combined. When doing so, one hopes that the child produced that way will combine the qualities of both parents. In practice, a genetic algorithm follows a series of cycles known as generations. During each cycle, individuals are evaluated and used to create the next generation through selection and operations. The variations one can put inside such a scheme are virtually infinite. Different schemes can be used for turning a score into a fitness measure. Selection can be made stronger or weaker. Many of the effects of such parameters have been studied in detail. However there is no solid argument for one model of GA to outperform the others. In fact the choice of a model seems to be mostly problem dependent. We found in our experiments that if the problem is suited to combinatorial optimization, any model of GA does reasonably well. However, some do better than others and the appropriate tuning needs to be done in what is mostly a trial−and error−strategy. There is very little theory around the reasons why GAs perform efficient optimizations. For instance, as compared to SA, there is no formal proof that a GA will reach the optimal solution given enough time. The most popular concept to explain the performance of GAs is the notion of building blocks. The concept is best understood if we consider a GA using a coding system that consists of binary strings (i.e. each individual is a chromosome where each gene can have two allelic values 0 or 1). In such a context, a block is defined as any stretch of 0s and 1s present in a chromosome. The GA strategy through its cycles of selection and combination makes it possible for important blocks to be selected (i.e. increase the number of their copies in the population). This in turn will increase the chances for these blocks to be extended through mutations and crossovers. The more stable a block (i.e. the better the score it induces), the more likely it is to spread in the population (this would be the equivalent of the fixation of a group of alleles in an animal population). In this context mutations help create new blocks, or restore blocks that may have been lost. The fact that the number of existing blocks is much larger than the population itself induces a phenomenon known as implicit parallelism. This GA procedure can also be looked at through another angle. If we consider each solution as a vector in a hyperspace, with each piece of the chromosome being one of the coordinates, we can view the search as a form of multivariate analysis. In this context, building blocks are sets of coordinates that restrict the search to some smaller portion of the hyperspace that has a higher concentration of good solutions than the rest. While the search goes on, this hyperspace becomes smaller and smaller until it cannot be reduced anymore.

The theory of the building blocks was initially proposed for the simple Genetic Algorithm developed by Goldberg(128). For this reason, it is restricted to binary chromosomes, where each gene only has two allelic values (1 or 0). It is hard to extrapolate this theory to the more complex representations required by sequence alignment problems. However, empirical evidences suggest that the building block theory constitutes a reasonable approximation when analysing the behavior of more complex representation systems. It should also be stressed that although it provides an elegant way to answer the question of how a GA works, the theory of the building blocks is useless for predicting with which type of problem the GA approach will fail or succeed. Deceptivity is an area of GA research that has received some significant attention over the last few years. It amounts to defining the conditions in which a GA will fail at finding the correct solution. What these studies have shown is that like any optimization method, the GA is sensitive to the fitness landscape of the function one is interested in. These results established what would intuitively be assumed. When there is very little continuity in the function one is interested in, the optimization becomes much harder or impossible. On the other hand, given a complex problem, there are often alternative ways of defining a landscape. In fact, the landscape will depend on the representation of a solution and on the means one uses to walk along the landscape (operators in our case). When using binary coding, methods that attempt to reshape the fitness landscape in order to make it more continuous are known as gray coding. More generally, whatever the problem and the representation, efforts should be made so that neighboring solutions should have comparable fitness. Defining the neighborhood of two solutions is a complex problem that depends on two factors: the representation system and the operators used. There is no use having a representation that induces a very smooth fitness landscape if the operators are not as well designed to sample close neighbors... For instance, in the case of a multiple sequence alignment, there are many ways one can define such neighbors. The neighbor of a given alignment will be another alignment created when inserting a gap or when shifting an already existing gap. The continuity of the landscape explored during the search will depend on the shape of the fitness landscape these operators define (i.e. do they induce any gradient that can be followed?). It is clear that a mutation that inserts a gap at a random position (thus potentially generating a shift for a whole sequence) does not explore the same neighborhood as a mutation that simply shifts a gap by one residue. This is a serious problem, and in practice defining the appropriate operators constitutes the main difficulty when designing a Genetic Algorithm. 4.5.2 Applications of GAs in Sequence Analysis Although the concept of GA is relatively new in the context of sequence alignments, GAs themselves have been applied to sequence analysis on several occasions, mostly to solve structure predictions problems for DNA, RNA and proteins. This makes sense since structure predictions problems fit the concept of GAs very well. A protein structure can be described as a list of amino acids associated with some conformation values. These conformation values may be angles of chemical bonds. In such a context, the chromosomal representation can be the list of the bonds that need to be characterized (genes) with the allelic values being the actual angles in a given conformation. These problems are difficult ones, mostly because it is hard to create an objective function that accurately mirrors the protein folding process. Several attempts have been made that suggest GAs are among the best suited optimization

strategies for ab initio folding(130−136). Recently a strategy has been proposed for docking analysis using a GA(137). Finally, RNA folding has also received some attention(87, 88, 138, 139). Generally speaking, the use of genetic algorithms in sequence and structure analysis is rapidly expanding. For instance, Medline records about 110 publications describing GA based methods for sequence analysis over a period of 8 years [1989−1997]. Out of these 110 publications, 45 appeared during 1997 alone. In terms of sequence alignment, the only work we are aware of is an attempt to perform iterative alignments using a genetic algorithm in order to speed up the process(115). Apart from SAGA, no method has been described which is able to directly perform operations on the alignment and to accommodate non−standard objective functions. We will now briefly review the motivations behind SAGA as well as the work done to validate our approach and extend its scope to a larger variety of objective functions and problems.

5 COMPARISON OF THE METHODS
We are only aware of two general studies made to compare systematically different alignment methods(117, 140) .The work by McClure et al. was made by comparing the performances of seven different methods(22, 23, 74, 77, 107, 110, 141) when aligning four different protein families (Hemoglobin, RibonucleaseH, Kinases and aspartic proteases). The methods were evaluated for their ability to correctly align some motifs of known structural or functional importance in each family. The conclusions were that proteins with more than 50% identity can usually be correctly aligned, regardless of the methods. When identity decreases, all the methods become affected by the number of sequences ( more sequences do not necessarily improve the results). The general conclusion is that progressive methods doing global alignments are usually the best. However, the test sets were not large enough to really allow strong distinctions to be made between the methods ability and the non progressive methods tested were still in an early development stage. A main drawback of the approach taken by McClure was the limited set of reference alignments on which their comparison was based, and the fact that these alignments had not been established by structural analysis. It has now become standard procedure to evaluate new methods by comparing their output with reference alignments available in some of the structural databases (33, 142−144). Gotoh recently presented such a comparison (117)based on structural databases(33, 142). In his study, he compared four methods: one based on progressive alignments (21) , two iterative alignment strategies that he had previously described(73, 145) and a new one that involves iterative alignment with dynamic reevaluation of the weights and the phylogenetic tree for optimizing the weighted sums of pairs with affine gap penalties. The methods were applied to 54 protein families. The conclusion was that iterative alignments methods do on average a bit better than ClustalW(21). It should be noted that the study did not make any use of the program MSA. Previous results obtained with the MSA program(22) indicate that MSA does not outperform ClustalW(98),(Barton, personal communication). This suggests that rather than the quality of the sums of pair optimization provided by the iterative alignment strategies (i.e. global optimization) the improvement is probably due to the dynamic weighting scheme used by Gotoh that allows a better use of the information. The weights used in this strategy are similar to those described for the MSA program(66). In MSA, these weights are computed from an initial progressive alignment which is probably less accurate than those obtained by Gotoh on optimized alignments and used for

reevaluation of the tree. It may also be that the iterative strategy used by Gotoh does not lead to mathematical optimality, but to sub−optimal alignments that are biologically more accurate.

6 SUMMARY OF THE CONTRIBUTIONS
6.1 SAGA: MAKING MULTIPLE SEQUENCE ALIGNMENT BY GENETIC ALGORITHM SAGA is a package designed to perform multiple sequence alignments using a genetic algorithm. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as judged by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimize the sums of pairs objective function, SAGA performs better than some of the widely used alternative packages like MSA. This is seen with respect to the ability to achieve an optimal solution. The general attraction of the approach is the ability to optimize any objective function that one can invent. This last point was the main motivation behind the development of SAGA. A lot of alternative packages exist that can provide reasonable optimizations of SP multiple alignments, for instance. However, it is known that in cases where the sequences are very remotely related, these methods fail, not necessarily because of an optimization problem, but also because traditional SP objective functions may not be well adapted to these alignments. Another potential application for a good quality optimization black box is hidden Markov model training. The EM based algorithms used at present are known to be very sensitive to local minimum problems. This problem could easily be overcome by using a GA based training scheme. The validation of SAGA as an optimization tool was made using the program MSA. Remarkably, starting from unaligned sequences, SAGA was always able to reproduce MSA results or to outperform them. As far as we know, SAGA is the only optimization method that can achieve this result without using dynamic programming. However, it is fair to say that SAGA is not yet ready for systematic use and does not appear as a challenger to programs like ClustalW or other progressive alignment packages. On the other hand, SAGA is a powerful tool for analysing new objective functions and evaluating the quality of faster heuristic methods. In the long term, we hope that SAGA will become much more practical, thanks to the increasing power of the computers and to some improvements made in the algorithm (parallelisation, better seeding, optimized evaluation). The original version of SAGA has been available for just over a year (http://www.ebi.ac.uk/~cedric/saga_hp.html). The package now regroups a community of known users of about 30 people. The software is supported and a new release is scheduled for the beginning of 1998. 6.2 COFFEE: IMPROVING ON EXISTING OBJECTIVE FUNCTIONS COFFEE is a natural extension of SAGA. It is an attempt to design a new type of objective function for evaluating the quality of multiple sequence alignments. There is already a number of objective functions described for evaluating multiple sequence alignments. They all have qualities and drawbacks. With COFFEE, the aim is not to add one more objective function to the list, but to propose a scheme that makes it possible to combine any existing scoring schemes. The COFFEE score reflects the

level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The function is tested on 11 test cases made of structural alignments extracted from 3D_ali(33). On 9 of these test cases, SAGA with the COFFEE function is able to outperform ClustalW (progressive multiple sequence alignments) as judged by comparison with the structural references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure based pairwise sequence alignments extracted from FSSP, SAGA can produce high quality multiple sequence alignments. An important issue in COFFEE is the validation of an objective function. This process should not be confused with the evaluation of an optimization strategy, as described for SAGA. An objective function may be optimized correctly but lead to biologically irrelevant alignments. Verifying the correctness of an alignment is a complex process, that requires the use of biologically correct references. Since the closest thing to a biologically correct alignment is a structural alignment, we used such alignments to validate the COFFEE approach (3D_ali). COFFEE is powerful mostly because of its openness. It can accommodate virtually any type of modification, including position specific weights and different types of sequence weights. Furthermore any alignment making technique (pairwise, multiple, global, local) can be integrated into the COFFEE framework and several otherwise incompatible techniques can be combined together. 6.3 RAGA: THREADING RNA SECONDARY STRUCTURES The distinction between sequence and structure alignments has already been developed here. It seemed like an interesting question to ask whether SAGA would be able to accommodate a structure based objective function. The fitness landscape of such a function is likely to be very different from those used for protein alignments. We applied the technique to RNA secondary structure because this is a much simpler problem than protein threading. Even if the algorithmic complexity is the same, the problem can at least be formulated in a fairly accurate way, thanks to the fact that RNA secondary structures are mostly based on Watson and Crick base pairing. In RAGA, we describe a new approach for accurately aligning two homologous RNA sequences, when the secondary structure of one of them is known. To do so we developed two software packages called RAGA and PRAGA which use a genetic algorithm approach to optimize the alignments. RAGA is mainly an extension of SAGA. In PRAGA, several genetic algorithms run in parallel and exchange individual solutions. This method allows us to optimize an objective function that describes the quality of a RNA pairwise alignment, taking into account both primary and secondary structure, including pseudoknots. We report the results obtained using PRAGA on nine test cases of pairs of eukaryotic small subunit ribosomal RNA sequence (nuclear and mitochondrial). The parallel implementation is described in detail in the corresponding publication. It involves a set of synchronized RAGA processes running on different machines on the same set of sequences. Every N generations (typically 5), the processes exchange some of their fittest individuals. The main originality of this implementation is its relative simplicity. It only involves a population sharing scheme between several GAs and does not require any low level modification of the GAs. Of course, the parallel program is very different from the original GAs (i.e. it induces a very different population structure). Despite this fact the parallel version has properties very close

to those that would have been expected with RAGA parallelized at a lower level (speed and accuracy). The parallelization module was made very general. It can be used with SAGA and does not depend in any way on the objective function. We plan to maintain this module so that it remains compatible with other developments made on SAGA or RAGA. This type of parallelisation (known as island parallelisation) is not completely new, but we are not aware of any previous descriptions of a model strictly identical to ours. Although no systematic work has yet been done for accurate characterization, we found that on the RAGA objective function, the speedup is much more significant than when the parallel module is applied to SAGA. This may have to do with the fact that our parallelisation makes it easier for a GA to analyse very complex fitness landscape, such as the one required by a structure based objective function. RAGA and PRAGA have been made available over the WWW (http://www.ebi.ac.uk/~cedric/raga_hp.html) In its present form, RAGA is mostly a prototype. There is little interest in attempting to align two ribosomal RNA sequences at a time. Large accurate alignments exist that should be used in order to guide the characterization of a new sequence. This has been the main motivation for the last project presented here: analysing large multiple alignments in order to optimally use the information they contain. However, out of a ribosomal RNA context, the requirement of knowing the guide structure of the sequences to align is quite heavy. In many interesting cases, such prior information will not be available, and will be difficult to extract when sequence identity is very low. For this reason, we plan to extend the algorithm in order to use an objective function similar to the one described in (90)or in(146) that allows the alignment of two RNA sequences using primary and potential secondary structure. Finally, applying RAGA to the problem of Protein Threading is one of the natural continuations of this work. We plan to do so in the context of the program phDthreader(20). 6.4 OPTIMIZING RIBOSOMAL RNA PROFILE ALIGNMENTS The purpose of this project is to show that the use of the information contained in a profile can be optimized when trying to introduce (align) a new sequence. The main question is to decide how each sequence of the alignment should contribute to this new alignment. It is quite obvious that sequences closely related to the one of interest should receive a higher weight. For this reason, we developed a weighting scheme based on the inverse of the distances between the new sequence and the rest of the sequences. This can be seen as a pairwise weighting scheme (cf. section 3.2.3) quite similar in fact to the one proposed for COFFEE. On the other hand, there is also a need to avoid bias due to unequal representation of different taxa. This effect is achieved by using tree based weights that attempt to correct for unequal representation. We found that the ideal weighting scheme, in a dynamic programming context, appears to be a combination of these two. Attempts were also made to use position specific gap penalties. In a DP context , it is hard to take into account secondary structures. This should be done in a second step that will involve implementing a profile based objective function in the RAGA/PRAGA framework. In the long term the alignment method is meant to be made available over the WWW through a JAVA−based server.

CONCLUSION

This work provides further evidence for the usefulness of genetic algorithms in a sequence analysis context. GAs mimic very efficiently the human approach that is made of trial and error and of a continual attempt to combine partial results in order to globally improve a solution. The work done here is only preliminary, and there is still a lot to be done before SAGA or RAGA come into everyday use. However, this may happen if the algorithm is improved in terms of speed and if the developments of COFFEE keep the promises of the preliminary results. Putting aside the slowness that constitutes their main drawback, GAs have a lot of desirable properties. They can accommodate a large variety of problems and are easy to hybridize with other alternative methods. They also allow a conceptual barrier to be put between biological and computational problems. Optimizing a function is purely a formal problem that does not teach us much about biology. On the other hand, designing and testing an objective function can be extremely informative about the problem analysed and can help understanding the type of constraints that occur during evolution. It is along these lines that I plan to extend the genetic algorithm strategy to a much wider range of problems, namely genome alignments and motif discovery. In a period where new complete genome sequences are published every month or so, there is a surprisingly small number of tools allowing one to compare these genomes. It is unfortunate because such comparisons can be extremely informative in understanding the way genomes evolve, the way functionally related sequences get clustered or not. In terms of computation, this is a difficult problem, because of the nature of the events that occur during evolution (inversion, transpositions, deletions, insertion) that are almost impossible to model using traditional techniques. I believe a GA approach is very likely to provide a convincing solution to such a problem.

REFERENCES 1. Lipman, D. J. and Pearson, W. R., Rapid and sensitive protein similarity searches. Science, 1985. 227: p. 1435−1441. 2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool. Journal of Molecular Biology, 1990. 215: p. 403−410. 3. Bairoch, A., Bucher, P., and Hofmann, K., The PROSITE database, its status in 1997. Nucleic Acids Research, 1997. 25: p. 217−221. 4. Saitou, N. and Nei, M., The neighbor−joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 1987. 4: p. 406−425. 5. Sonnhammer, E. L. L. and Kahn, D., Modular Arrangement of Proteins as Infered from Analysis of Homology. Protein Science, 1994. 3: p. 482−492. 6. Gribskov, M., McLachlan, M., and Eisenberg, D., Profile analysis: Detection of distantly related proteins. Proceedings of the National Academy of Sciences, 1987. 84: p. 4355−5358. 7. Haussler, D., Krogh, A., Mian, I. S., and Sjölander, K. Protein Modeling using Hidden Markov Models: Analysis of Globins. in Proceedings for the 26th Hawaii International Conference on Systems Sciences. 1993. Wailea, HI, U.S.A.: Los Alamitos, CA: IEEE Computer Society Press. 8. Garnier, J., Gibrat, J.−F., and Robson, B., GOR method for predicting protein secondary structure from amino acid sequence. Methods in Enzymology, 1996. 266: p. 540−553. 9. Garnier, J. and Robson, B., The GOR method for predicting secondary structure in proteins, in Prediction of protein structure and the principles of protein conformation, F.G. D., Editor. 1989, Plenum Press: New York. p. 417−465. 10. Rost, B., Sander, C., and Schneider, R., PHD − an automatic server for protein secondary structure prediction. CABIOS, 1994. 10: p. 53−60. 11. Göbel, U., Sander, C., Schneider, R., and Valencia, A., Correlated mutations and residue contacts in proteins. Proteins, 1994. in press: p. 000−000. 12. Neher, E., How frequent are correlated changes in families of protein sequences? Proceedings of the National Academy of Sciences, 1994. 91: p. 98−102. 13. Shindyalov, I. N., Kolchanov, N. A., and Sander, C., Can three−dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Engineering, 1994. 7: p. 349−358. 14. Woese, C. R., Gutell, R., Gupta, R., and Noller, H. F., Detailed analysis of the higher−order of 16S−like ribosomal ribonucleic acids. Microbiol. Rev., 1983. 47: p. 621−669. 15. Gautheret, D. and Gutell, R. R., Inferring the conformation of RNA base pairs and triples from patterns of sequence variation. Nucleic Acids Res, 1997. 25(8): p. 1559−64.

16. Gutell, R. R., Weiser, B., Woese, C. R., and Noller, H. F., Comparative anatomy of 16S−like ribosomal RNA. Prog. Nucleic Acid Res. Mol. Biol., 1985. 32: p. 155−216. 17. Lathrop, R. H., The protein threading problem with sequence amino acid interaction preferences is NP−complete. Protein Engineering, 1994. 7: p. 1059−1068. 18. Needleman, S. B. and Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 1970. 48: p. 443−53. 19. Wang, L. and Jiang, T., On the complexity of multiple sequence alignment. Journal of computational biology, 1994. 1(4): p. 337−348. 20. Rost, B., Schneider, R., and Sander, C., Protein fold recognition by prediction−based threading. Journal of Molecular Biology, 1996. : p. in press. 21. Thompson, J., Higgins, D., and Gibson, T., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position−specific gap penalties and weight matrix choice. Nucleic Acids Res., 1994. 22: p. 4673−4690. 22. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D., A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 1989. 86: p. 4412−4415. 23. Taylor, W. R., Multiple sequence alignment by a pairwise algorithm. Computer Applications in Biological Science, 1987. 3: p. 81−87. 24. Corpet, F., Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res., 1988. 16: p. 10881−10890. 25. Dayhoff, M. O., Schwarz, R. M., and Orcutt, B. C., A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results, in Atlas of Protein Sequence and Structure, M.O. Dayhoff, Editor. 1979, National Biomedical Research Foundation: Washington, D.C. p. 353−358. 26. Henikoff, S. and Henikoff, J. G., Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 1992. 89: p. 10915−10919. 27. Vogt, G., Etzold, T., and Argos, P., An assessement of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 1995, 1995. 299(4): p. 816−831. 28. Henikoff, S. and Henikoff, J. G., Performance evaluation of amino acid substitution matrices. Proteins: Structure, Function, and Genetics, 1993. 17: p. 49− 61. 29. Rost, B. and Sander, C., Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 1993. 232: p. 584−599. 30. Lüthy, R., McLachlan, A. D., and Eisenberg, D., Secondary structure−based profiles: use of structure−conserving scoring tables in searching protein sequence

databases for structural similarities. Proteins: Structure, Function, and Genetics, 1991. 10: p. 229−239. 31. Jones, D. T., Taylor, W. R., and Thornton, J. M., A mutation data matrix for transmembrane proteins. FEBS Letter, 1994. 339: p. 269−275. 32. Tomii, K. and Kanehisa, M., Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Engineering, 1996. 9: p. 27−36. 33. Pascarella, S. and Argos, P., A data bank merging related protein structures and sequences. Protein Eng., 1992. 5: p. 121−137. 34. Gonnet, G. H., Cohen, M. A., and Benner, S. A., Exhaustive matching of the entire protein sequence database. Science, 1992. 256: p. 1443−1445. 35. Benner, S. A., Cohen, M. A., and Gonnet, G. H., amino acid substitution during functionnally constrained evolution of protein sequences. Protein Engineering, 1994. 7(11): p. 1323−1332. 36. Sjolander, K., Karplus, K., Brown, M., Huguey, R., Krogh, A., Saira, M., and Haussler, D., Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comp. App. in Biosci., 1996. 12(4): p. 327− 345. 37. Overington, J., Donnelly, D., Johnson, M. S., Sali, A., and Blundell, T. L., Environment−specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Science, 1992. 1: p. 216−226. 38. Benner, S. A., Cohen, M. A., and Gonnet, G. H., Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol., 1993. 229: p. 1065−1082. 39. Pascarella, S. and Argos, P., Analysis of insertions/deletions in protein structures. J. Mol. Biol., 1992. 224: p. 461−471. 40. Chou, P. Y. and Fasman, G. D., Empirical predictions of protein conformation. Ann. Rev. Biochem., 1978. 47: p. 251−276. 41. Gu, X. and Wen−Hsiung, L., The size distribution of insertions and deletions in human and roent pseudogenes suggests the logarithmic gap penalty for sequence alignmnent. J. Mol. Evol., 1995. 40: p. 464−473. 42. Altschul, S. F., Gap costs for multiple sequence alignment. J. Theor. Biol., 1989. 138: p. 297−309. 43. Waterman, M. S., Efficient sequence sequence alignment. Journal of Theoretical Biology, 1984. 108: p. 333−337. 44. Miller, W. and Myers, E. W., Sequence comparison with concave weighting functions. Bull. Math. Biol., 1988. 50: p. 97−120. 45. Gotoh, O., An improved algorithm for matching biological sequences. J. Mol. Biol., 1982. 162: p. 705−708.

46. Vingron, M. and Waterman, M. S., Sequence alignment and penalty choice. Journal of Molecular Biology, 1994. 235: p. 1−12. 47. Taylor, W. R., Motif−Biased Protein Sequence Alignment. Journal of Computational Biology, 1994. . 48. Wilbur, W. J. and Lipman, D. J., Rapid similarity searches of nucleic acid and protein data banks. Proceedings of the National Academy of Sciences, 1983. 80: p. 726−730. 49. Pearson, W. R. and Lipman, D. J., Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 1988. 85: p. 2444− 2448. 50. Smith, T. F. and Waterman, M. S., Identification of common molecular subsequences. J. Mol. Biol., 1981. 147: p. 195−197. 51. Pearson, W. R., Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith−Waterman and FASTA algorithms. Genomics, 1991. 11: p. 635−650. 52. Waterman, M. S. and Vingron, M., Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl. Acad. Sci. USA, 1994. 91: p. 4625−4628. 53. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D., Gapped BLAST and PSI−BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. . 54. Bucher, P. and Hofmann, K., A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. Ismb, 1996. 4(44): p. 44−51. 55. Murata, M., Richardson, J. S., and Sussman, J. L., Simultaneous comparison of three protein sequences. Proceedings of the National Academy of Sciences, 1985. 82: p. 3073−3077. 56. Gotoh, O., Alignment of three biological sequences with an efficient traceback procedure. J. Theor. Biol., 1986. 121: p. 327−337. 57. Carrillo, H. and Lipman, D. J., The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 1988. 48: p. 1073−1082. 58. Sankoff, D., Minimal mutation trees of sequences. SIAM J. Appl. Math., 1975. 28: p. 35−42. 59. Fredman, M. L., Algorithms for computing evolutionary similarity measures with length−independent gap penalties. Bull. Math. Biol., 1984. 46: p. 553−566. 60. Altschul, S. F. and Erickson, B. W., Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 1986. 48: p. 603−616.

61. Bairoch, A. and Apweiler, R., The SWISS−PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 1997. 25: p. 31−36. 62. Vingron, M. and Sibbald, P., Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proceedings of the National Academy of Sciences, 1993. 90: p. 8777−8781. 63. Sibbald, P. R. and Argos, P., Weighting aligned protein or nucleic acid sequences to correct for unequal representation. Journal of Molecular Biology, 1990. 216: p. 813−818. 64. Felsenstein, J., Inferring evolutionary trees from DNA sequences, in Statistical analysis of DNA sequences, B.S. Weir, Editor. 1983, Marcel Dekker Inc.: New York. p. 133−150. 65. Thompson, J. D., Higgins, D. G., and Gibson, T. J., Improved sensitivity of profile searches through the use of sequence weights and gab excision. Computer Applications in Biological Science, 1994. 10: p. 19−29. 66. Altschul, S. F., Carroll, R. J., and Lipman, D. J., Weights for data related by a tree. Journal of Molecular Biology, 1989. 207: p. 647−653. 67. Sander, C. and Schneider, R., Database of homology−derived structures and the structurally meaning of sequence alignment. Proteins: Structure, Function, and Genetics, 1991. 9: p. 56−68. 68. Henikoff, S. and Henikoff, J. G., Position−based sequence weights. Journal of Molecular Biology, 1994. 243: p. 574−578. 69. Gotoh, O., Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput Appl Biosci, 1993. 9(3): p. 361− 70. 70. Neuwald, A. F., Liu, J. S., Lipman, D. J., and Lawrence, C. E., Extracting protein alignment models from the sequence database. Nucleic Acids Res, 1997. 25(9): p. 1665−77. 71. Gibson, T. J., Thompson, J. D., and Heringa, J., The KH domain occurs in a diverse set of RNA−binding proteins that include the antiterminator NusA and is probably involved in binding nucleic acid. FEBS Letters, 1993. 324: p. 361−366. 72. Taylor, W. R., A non−local gap−penalty for profile alignment. Bull Math Biol, 1996. 58(1): p. 1−18. 73. Gotoh, O., Further improvement in methods of group−to−group sequence alignment with generalized profile operations. Comput Appl Biosci, 1994. 10(4): p. 379−87. 74. Feng, D.−F. and Doolittle, R. F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 1987. 25: p. 351−360. 75. Taylor, W. R., A flexible method to align large numbers of biological sequences. Journal of Molecular Evolution, 1988. 28: p. 161−169.

76. Higgins, D. G. and Sharp, P. M., CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988. 73: p. 237−244. 77. Higgins, D. G., Bleasby, A. J., and Fuchs, R., CLUSTAL V: improved sofware for multiple sequence alignment. Computer Applications in Biological Science, 1992. 8: p. 189−191. 78. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K., A flexible motif search technique based on generalized profiles. Comput Chem, 1996. 20(1): p. 3−23. 79. Luthy, R., Xenarios, I., and Bucher, P., Improving the sensitivity of the sequence profile method. Protein Sci, 1994. 3(1): p. 139−46. 80. Rabiner, L. R., A tutorial on hidden Markov Models and selected applications in speech recognition. Proc. IEEE, 1989. 77: p. 257−286. 81. Krogh, A., Brown, M., Mian, I. S., Sjölander, K., and Haussler, D., Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol., 1994. 235: p. 1501−1531. 82. Hughey, R. and Krogh, A., Hidden Markov models for sequence analysis: extension and analysis of the basic method. Computer Applications in Biological Science, 1996. 12: p. 95−107. 83. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C., Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 1993. 262: p. 208−214. 84. Bryant, S. H. and Altschul, S. F., Statistics of sequence−structure threading. Current Opinion in Structural Biology, 1995. 5: p. 236−244. 85. Crick, F. H. C., Central dogma of molecular biology. Nature, 1970. 227: p. 561−563. 86. Richards, R. G., 5 S RNA An analysis of Possible Base Pairing Schemes. Eur J Biochem, 1969. 10: p. 36−42. 87. Gultayaev, A. P., van Batenburg, F. D. H., and Pleij, C. W. A., The computer Simulation of RNA Folding Pathways Using a Genetic Algorithm. J. Mol. Biol., 1995. 250: p. 37−51. 88. Shapiro, B. A. and Wu, J. C., Predicting RNA HType pseudoknots with the massively parallel genetic algorithm. Comp. Applic. in Biosci., 1997. 13(4): p. 459− 471. 89. Zuker, M., Computer prediction of RNA structure. Meth. Enzymol., 1989. 180: p. 262−288. 90. Kim, J., Cole, J. R., and Pramanik, S., Alignment of possible secondary structures in multiple RNA sequences using simulated annealing. Comp. Applic. Biosci., 1996. 12(4): p. 259−267.

91. Eddy, S. R. and Durbin, R., RNA aligment using covariance models. Nucleic Acid Res., 1994. 22(11): p. 2079−2088. 92. Corpet, F. and Michot, B., RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comp. Applic. Biosci., 1994. 10(4): p. 389−99. 93. Sakakibara, Y., Brown, M., Underwood, R. C., Mian, I. S., and Haussler, D. Stochastic Context−Free Grammars for Modeling RNA. in 27th Hawaii International Conference on System Sciences. 1994. Wailea, HI, U.S.A.: Los Alamitos, CA: IEEE Computer Society Press. 94. Lefebvre, F. An optimized parsing algorithm well suited to RNA folding. in ISMB−95. 1995. Cambridge, England: AAAI Press. 95. Pleij, C., Pseudoknots: a new motif in the RNA GAme. Trends Biochem. Sci., 1990. 15: p. 143−147. 96. Westhof, E. and Jaeger, L., RNA pseudoknots. Current Opinion In Structural Biology, 1992. 2: p. 327− 333. 97. Notredame, C., O’Brien, E. A., and Higgins, D. G., RAGA: RNA Sequence Alignment by Genetic Algorithm. Nucleic Acids Res.(in press), 1997. . 98. Notredame, C. and Higgins, D. G., SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 1996. 24: p. 1515−1524. 99. Hirschberg, D. S., A linear space algorithm for computing maximal common subsequences. Comm. ACM, 1975. 18: p. 341−343. 100. Myers, E. W. and Miller, W., Optimal alignments in linear space. Comput. Applic. Biosci., 1988. 4: p. 11−17. 101. Naor, D. and Brutlag, D. L., On Near−Optimal Alignmnets of Biological Sequences. Journal of Computational Biology, 1994. 1(4): p. 349−366. 102. Saqi, M. A. S. and Sternberg, M. J. E., A simple method to generate non− trivial alternate alignments of protein sequences. Journal of Molecular Biology, 1991. 219: p. 727−732. 103. Chao, K. M., Computing all the suboptimal alignments in linear space. Lecture Notes In Computer Science, 1994. 807: p. 31−42. 104. Vingron, M. and Argos, P., Determination of reliable regions in protein sequence alignment. Protein Eng., 1990. 3: p. 565−569. 105. Taylor, W. R. and Orengo, C. A., Protein structure alignment. Journal of Molecular Biology, 1989. 208: p. 1−22. 106. Sneath, P. H. A. and Sokal, R. R., Numerical Taxonomy. 1973, San Francisco: Freeman, W.H.

107. Barton, G. J. and Sternberg, M. J. E., A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. Journal of Molecular Biology, 1987. 198: p. 327−337. 108. Gusfield, D., Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol., 1993. 55: p. 141−154. 109. Gupta, S. K., Kececioglu, J. D., and Schaffer, A. A., Improving the practical apace and time efficiency of the shortest−paths approach to sum−of−pairs multiple sequence alignment. Journal of Computational Biology, 1995. 2(3): p. 459−472. 110. kececioglu, J. D., The maximum weight trace problem in multiple sequence alignmnet. Lecture Notes in Computer Science, 1993. 684: p. 106−119. 111. Reinert, K., Lenhof, H. P., Mutzel, P., Melhorn, K., and Kececioglu, J. D., A branch−and−cut Algorithm for multiple sequence alignmnet. Recomb97, 1997. : p. 241−249. 112. Kauffman, S. and Levin, S., Toward a General Theory of Adaptative Walks on Rugged Landscapes. J. Theor Biol, 1987. 128(1): p. 11−45. 113. Charleston, M. A., Toward a characterisation of landscapes of combinatorial optimisation problems, with special attetion to the phylogeny problem. Journal of Computational Biology, 1995. 2(3): p. 439−450. 114. Berger, M. P. and Munson, P. J., A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci., 1991. 7: p. 479−484. 115. Ishikawa, M., Toya, T., and Tokoti, Y. Parallel Iterative Aligner with Genetic Algorithm. in Artifificial Intelligence and Genome Workshop, 13th International Conference on Artificial Intelligence. 1993. Chambery, France: 116. Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M., Comprehensive study on iterative algorithms of multiple sequence alignments. Comp. App. Biosci., 1995. 11(1): p. 13−18. 117. Gotoh, O., Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol., 1996. 264(4): p. 823−838. 118. Geman, D. and Geman, D., Gibbs Sampling. Trans. Pattern Anal. Mach. Intell, 1984. 6: p. 721. 119. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E., Equation of state calculations by fast computing machines. J. Chem. Phys., 1953. 21: p. 1087−1092. 120. Kirkpatrick, S., Gelatt, C. D. J., and Vecchi, M. P., Optimization by Simulated Annealing. Science, 1983. 220: p. 671−680. 121. Ingber, L. and Rosen, B., Genetic Algorithm and Very Fast simulated Reannealing: a comparison. Mathematical Computer Modeling, 1993. 16: p. 87−100.

122. Ishikawa, M., Toya, T., Hoshida, M., Nitta, K., Ogiwara, A., and Kanehisa, M., Multiple sequence alignment by parallel simulated annealing. Comp. Applic. Biosci., 1993. 9: p. 267−273. 123. Kim, J., Pramanik, S., and Chung, M. J., Multiple Sequence Alignment using Simulated Annealing. Comp. Applic. Biosci., 1994. 10(4): p. 419−426. 124. Godzik, A. and Sander, C., Conservation of residue interactions in a family of Ca−binding proteins. Prot. Eng., 1989. 2: p. 589−96. 125. Schmitz, M. and Steger, G., Description of RNA Folding by Simulate Annealing. J. Mol. Biol, 1996. 255: p. 254−266. 126. Eddy, S. R. Multiple alignment using hidden Markov models. in Third International converence on Intelligent Systems for Molecular Biology (ISMB). 1995. Cambridge, England: Menlo Park, CA: AAAI Press. 127. Holland, J. H., Adaptation in natural and artificial systems. 1975, Ann Arbour, MI: University of Michigan Press. 128. Goldberg, D. E., Genetic Algorithms in Search, Optimization, and Machine Learning. ed. D.E. Goldberg. 1989, New York: Addison−Wesley. 129. Davis, L., The handbook of Genetic Algorithms. ed. L. Davis. 1991, New York: Van Nostrand Reinhold. 130. Rabow, A. A. and Scheraga, H. A., Improved genetic algorithm for the protein folding problem by use of a cartesian combination operator. Protein Science, 1996. 5: p. 1800−1815. 131. Pedersen, J. T. and Moult, J., Ab initio structure prediction for small polypeptides and protein fragments using genetic algorithms. Proteins: Structure, Function, and Genetics, 1995. 23: p. 454−460. 132. Legrand, S. M. and Merz, K. M., The genetic Algorithm and the conformational search of polypeptides and proteins. Molecular Simulation, 1994. 13: p. 299−320. 133. Unger, R. and Moult, J., Genetic Algorithms for Protein Folding Simulations. J. Mol. Biol., 1993. 231: p. 75−81. 134. Schulze−Kremer, S., Genetic algorithms for protein tertiary structure prediction, in Proceedings of the 2nd Conference on Parallel Problem Solving from Nature, R. Männer and B. Manderick, Editor. 1992, Elsevier Science Publishers: Amsterdam. p. 391−400. 135. Sun, S., Reduced representation model of protein structure prediction: statistical potential and genetic algorithms. Protein Science, 1993. 2(5): p. 762−785. 136. Dandekar, T. and Argos, P., Potential of genetic algorithms in protein folding and engineering simulations. Protein Eng., 1992. 5: p. 637−645. 137. Verkhivler, G. M., Rejto, P. A., Gelhaar, D. K., and Freer, S. T., Exploring the energy landscape of molecular recognition by a genetic algorithm: analysis of the

requirement for robust docking of HIV−1 Protease and FKBP−2 Complexes. Proteins:Structure, Function and Genetics, 1996. 250: p. 342−353. 138. Shapiro, B. A. and Wu, J. C., An annealing mutation operator in the genetic algorithm for RNA folding. Comp. Applic. Biosci., 1996. 12(3): p. 171−180. 139. Ogata, H., Yutaka, A., and Minoru, K., A genetic algorithm based molecular modeling technique for RNA stem−loop structures. Nucleic Acids res., 1995. 23(3): p. 419−426. 140. McClure, M. A., Vasi, T. K., and Fitch, W. M., Comparative analysis of multiple protein−sequence alignmnent methods. Molecular Biology Evolution, 1994. 11(4): p. 571−592. 141. Subbiah, S. and Harrison, S. C., A method for multiple sequence alignments with gaps. J. Mol. Biol., 1989. 209: p. 539−548. 142. Sali, A. and Overington, J. P., Derivation of Rules for comparative protein modeling from a database of protein structure alignments. Protein Sci., 1994. 3: p. 1582−1596. 143. Holm, L. and Sander, C., The FSSP database: fold classification based on structure−structure alignment of proteins. Nucleic Acids Res., 1996. 24: p. 206−210. 144. Barton, G. J., Scop: structural classification of proteins. Trends Biochem Sci, 1994. 19(12): p. 554−5. 145. Gotoh, O., A weighting System and Algorithm for aligning many phylogenetically related Sequences. Comput. App. in Biosci., 1995. 11(543−551). 146. Tabaska, E. J. and D., S. G. Automated alignment of RNA sequences to pseudoknotted structures. in ISMB−97. 1997. Halkidiki, Greece: AAAI Press.

4570–4580 Nucleic Acids Research, 1997, Vol. 25, No. 22

© 1997 Oxford University Press

RAGA: RNA sequence alignment by genetic algorithm
Cédric Notredame1,*, Emmet A. O’Brien1,2 and Desmond G. Higgins1,2
Outstation–The European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland
Received July 23, 1997; Revised and Accepted October 1, 1997 1EMBL

ABSTRACT We describe a new approach for accurately aligning two homologous RNA sequences when the secondary structure of one of them is known. To do so we developed two software packages, called RAGA and PRAGA, which use a genetic algorithm approach to optimize the alignments. RAGA is mainly an extension of SAGA, an earlier package for multiple protein sequence alignment. In PRAGA several genetic algorithms run in parallel and exchange individual solutions. This method allows us to optimize an objective function that describes the quality of a RNA pairwise alignment, taking into account both primary and secondary structure, including pseudoknots. We report results obtained using PRAGA on nine test cases of pairs of eukaryotic small subunit rRNA sequence (nuclear and mitochondrial). INTRODUCTION Most methods of alignment are based on the primary structure of the sequences to be analysed (1). Alignment may be straightforward when the primary structure is conserved but becomes less and less accurate as the evolutionary distance increases. In the case of RNA it may be possible to use secondary structure information to supplement the weak primary structure information. Such alignments, using primary and secondary constraints, have been built for rRNAs (2,3). Their construction is at least partially manual and is usually based on identification of sets of correlated mutations which suggest secondary structure interactions. One justification for such methods is the fact that accurate alignment is still the main non-experimental way to establish a reliable secondary structure for a long RNA molecule. The only other alternative is ab initio prediction. Several techniques of this type have been developed over time (4,5), but they recently received renewed attention through the use of stochastic heuristic-based approaches, like simulated annealing (6,7) and genetic algorithms (8,9). Nevertheless, they remain limited by the fact that our understanding of the in vivo folding process is still incomplete. In contrast, homology analysis based on alignments does not have these limitations. Multiple alignments reveal the positions of the sequences on which some constraints exist, regardless of the actual cause of these constraints. Several algorithms have been developed for aligning RNA sequences

taking into account primary and secondary information. Some methods attempt to simultaneously align and fold sequences (10–12). Their main drawback is that they remain limited to sets of short sequences (<200 nt long). To reduce the complexity of this problem it is also possible to align a sequence (or a set of sequences) of unknown structure to some pre-established reference master structure. Such alignments include non-local interactions and their solution has been shown to be NP hard (13). Nevertheless, for small sequences sensible results can be obtained. Several methods of this type, based on the use of stochastic context free grammar (SCFG) for the description of RNA non-pseudoknotted secondary structure, have been described (11,14,15). Pseudoknots, however, are important motifs in RNA folding (16) and recently some new results have been obtained on this aspect of RNA analysis. This includes the work of Tabaska and Stormo (17) for aligning an RNA sequence to a pseudoknotted structure in polynomial time. Unfortunately, all these methods have the limitation of being computationally very expensive and therefore remain restricted to small sequences (<200 nt long). This problem can be partially overcome by heuristic methods. Corpet and Michot have described an approach of this type (18). In this case a heuristic allows identification of the portions of an alignment that can be made without using secondary structure information. The remaining portions, if they are small enough, can then be aligned using non-local interactions. This is done with a specialized dynamic programming algorithm. Although this algorithm is less efficient than that described for SCFG-based alignments, the heuristic filtering makes it possible, in some cases, to align long RNA molecules (e.g. >1500 nt). At the moment this is largely beyond the scope of any SCFG-based algorithm. Unfortunately, the algorithm cannot deal with very divergent sequences and does not support the computation of pseudoknots. As opposed to the SCFG-based scoring scheme, that used by Corpet and Michot has no real theoretical justification. Nevertheless, it has the merit of being conceptually simple as well as leading to computation of sensible alignments (as judged by comparison with established reference alignments) (18). For this reason we took the overall approach and scoring scheme of Corpet and Michot but used a genetic algorithm (GA) to carry out the optimization. This has two significant advantages. Firstly, in the GA context there is no difference between the handling of pseudoknots and any other secondary structure. Secondly, it is possible to attempt to find alignments between much longer sequences, such as complete small subunit rRNAs,

*To whom correspondence should be addressed. Tel: +44 1223 494449; Fax: +44 1223 494468; Email: cedric.notredame@ebi.ac.uk

4571 Nucleic Acids Research, 1997, Vol. 25, No. Nucleic Acids Research, 1994, Vol. 22, No. 122 which can be >2000 nt in length. The genetic algorithms (GA) (19,20), like simulated annealing (21) or Gibbs sampling (22), is a stochastic optimization technique. It involves an attempt at optimizing some cost function (objective function, OF) by modifying and combining a population of solutions (individuals). GAs do not guarantee an optimal solution, but are known to perform well with combinatorial or enumeration problems. We based our approach directly on a previous package, SAGA (Sequence Alignment by Genetic Algorithm) (23). This algorithm was improved and parallelized. A suitable cost function that describes the quality of a RNA alignment was introduced, mostly based on the function described by Corpet and Michot (18). The package was name PRAGA for Parallel RNA Alignment by Genetic Algorithm. We compared this algorithm with traditional techniques of sequence alignment and, to some extent, with the program RNAlign (18). METHODS The aim is to align two related sequences of RNA, knowing the secondary structure (master structure) of one of them (master sequence), in order to predict the position of these structural elements in the second sequence (slave sequence). In the correct alignment elements of the two sequences sharing the same structure and/or sequence should be aligned (Fig. 1a–c). A measure (OF) can be designed that allows evaluation of the quality of such an alignment. This measure takes into account the quality of the sequence alignment and the stability of the folding induced by the master sequence onto the slave sequence. To produce the best scoring alignment according to this measure we used a GA. We also describe a parallel GA that we have designed to gain some speed over a serial one. The results obtained with several sets of sequences were compared with established reference alignments of the same sequences, using three comparison methods. Objective function The function we use was described by Corpet and Michot (18). We implemented it in RAGA without any modification. It combines three different scores: Pr, the primary score; Se, the secondary score; a gap penalty score. The overall score is a combination of these three values. The higher this score, the better the alignment between the two sequences. Pr is a function of the aligned pairs of residues in the alignment. It depends on a matrix where each possible pair of residues is given a score. In the case of RNA a simple identity matrix is used with a mismatch score of 0 and a match score of 1. All positions containing a gap are ignored at this point. Pr is therefore equal to the number of matches in the pairwise alignment. Se is based on the secondary structure. It evaluates the stability of the folding induced by the master onto the slave sequence. If two nucleotides form a base pair (part of a stem) in the master, then the two nucleotides in the slave sequence aligned with them should be able to form a Watson–Crick base pair as well if the secondary structure is conserved. Since pairings are relatively well defined in RNA, it is possible to assign a score to the pairing potential of the sequence of unknown structure. This can be formalized as follows. Given two sequences A (master) and B (slave) with (Ai,Aj) being two nucleotides of A and (Bk,Bn) the equivalent aligned nucleotides of B, if Ai and Aj are known to be 4571

a

b

c

Figure 1. RNA alignment. (a) Master sequence with a known secondary structure. (b) Alignment of the master sequence with a slave sequence of unknown structure using primary (Pr score) and secondary structure (Se score) information. (c) Prediction of the structural elements of the slave sequence shared with the master sequence.

paired, Ai is aligned with Bk and Aj with Bn then the score Se is equal to the pairing score of Bk with Bn. In practice, a very simplified model was used to assign pairing scores, giving 2 for GC pairs and 1 for UA and UG pairs (UG is not a conventional Watson–Crick base pair but stems in RNA frequently contain UG pairs) or any other interaction involving one of the wildcards X or N. The other pairs are given a score of 0. Se is the sum of the scores associated with each pair in the structure of B induced by the structure of A. It is usually necessary to insert gaps into one or both sequences in order to perform the alignment. These gaps represent insertion or deletion events that have occurred over time in both sequences. They may not occur completely at random and may happen more frequently in loops or in non-structured domains. In order to reflect this in the OF, two position-specific gap penalties are used following the model of Corpet and Michot (18): GOS, a penalty for opening a gap between two stacked pairs; GO, a penalty for opening a gap in non-structured regions. A third penalty (GEP) is used to penalize gap length. It is calculated as GEP × length of gap. Terminal gaps are not penalized. We use the values proposed by Corpet and Michot with GO = 5, GOS = 8, GEP = 0.3. The total gap penalty of an alignment is equal to: gap penalty = (a × GOS) + (b × GO) + (c × GEP) 1

Where a is the number of gaps between stacked pairs in stems, b is the number of other non-terminal gaps and c is the total length of all the non-terminal gaps. The complete alignment score is calculated using a new parameter λ, which is always a positive value. alignment score = Pr + (λ × Se) – gap penalty 2

4572 Nucleic Acids Research, 1997, Vol. 25, No. 22 The parameter λ has the effect of balancing the contribution from primary structure information and from secondary structure. Optimization of the objective function The function in equation 2 was shown to be a good indicator of the quality of an alignment (18). The main drawback is that its optimization is difficult when λ is given a value other than 0. In the case of λ = 0 the contribution of secondary structure to the alignment is ignored. Such a function can be optimized by regular dynamic programming with local gap penalties (1,24). However, when λ is not zero optimization becomes much harder. A variation of dynamic programming has been described. It requires O(M2N2) space and O(M2N3) time (18), M and N being the lengths of the sequences or profiles to align. Such a high complexity makes it hard to apply this algorithm to anything other than small sequences or fragments of alignments. Another limitation of this approach is that it cannot easily deal with pseudoknots. In order to overcome these two problems we used a GA. RAGA is derived from the Simple Genetic Algorithm described by Goldberg (20). It involves using a population of solutions which evolve by means of natural selection. The population we consider is made of pairwise alignments. Initially a generation zero (G0) is created (initialization). In this population each individual consists of one possible alignment for the sequences to be aligned. The size of this population is kept constant. To go from one generation to the next, children are derived from parents that are chosen by some kind of ‘natural selection’, based on their fitness as measured by OF (i.e. the better the parent, the more children it will have). To create a child an operator is selected that can be a crossover (mixing the contents of the two parents) or a mutation (modifying a single parent). There are several types of mutations, modifying the alignments in different ways. Each of these has a probability of being chosen that is dynamically optimized during the run (dynamic scheduling). These steps are repeated iteratively, generation after generation (evaluation/breeding). During these cycles new pieces of alignment appear because of the mutations and are combined by the crossovers. This selection makes sure that good pieces survive and dynamic setting of the operators helps the population to improve by creating the children it needs. Following this simple process the average fitness of the population increases until no more improvement can be made. The best alignment obtained in this way is taken as a result. Initialization. The first step of the algorithm is initialization, during which a population of solutions is created (seeding). The two desirable properties of an initial population are to have as much diversity as possible and to contain as many good scoring blocks as possible (i.e. individuals as good as possible). Seeding in a random manner allows one to have high diversity, but very few good scoring blocks. Such a population will usually improve slowly. On the other hand, seeding with greedy methods [e.g. ClustalW (25) and other alignment software] gives a population with a better initial score at the cost of lower diversity. Such a population usually improves very quickly, but tends to get stuck into a local minimum close to the starting point. In RAGA we tried to find a good trade-off between these two extremes. A variation on dynamic programming described by Gerstein (26) was used to produce our initial alignments. This method (Dynamic Programming with Added Noise, DPAN in this paper) allows addition of a random amount of noise to the regular dynamic programming method (27,28), therefore producing sub-optimal alignments centred around the mathematical optimum obtained by dynamic programming without noise. In practice, when two sequences are aligned several times with DPAN long stretches of conserved residues tend to be kept intact, while diversity accumulates in less stable regions of the alignment. It is possible to control the overall amount of noise added. In RAGA this noise was tuned so that the average score of the alignments used for seeding would be the same as the average score of a population of random alignments. Even with such a bad initial average score, a population generated in this way improves about three times faster than a completely random population. Evaluation/breeding/end. The rest of the GA procedures involved in RAGA are taken directly from SAGA. Individuals are first evaluated through the OF described earlier and then given an expected offspring score that reflects their quality in comparison with the rest of the individuals of the same generation. At each generation half of the population (lowest scoring alignments) is replaced by newly generated individuals (children). To produce a child an operator is selected (mutation or crossover). In the case of a mutation one parent is chosen to which the mutation is applied to produce a modified alignment that is put back into the population. In the case of a crossover the procedure is the same, but two parents are used. Children are put back into the population only when they are different from all the other children (population without duplicates). The selection of a parent is made by weighted wheel selection, a standard practice in GA: a virtual wheel is spun where each individual has a number of slots proportional to its expected offspring. Therefore the fittest are more likely to be chosen as parents, while the weakest still have a chance to survive. Each operator has a probability of being used which varies along the run, depending on how well it performed. This automatic process is known as dynamic scheduling of the operators and has been described in greater detail (19,23). When no improvement has been made for a specified number of generations (typically 100 generations on a run of 400 generations), the GA is stopped. The operators. According to the traditional nomenclature of genetic algorithms (20) two types of operators are used in RAGA: crossovers and mutations. In RAGA we do not make any distinction between these two types with regard to how we apply them. They are designed as independent programs that input one or two alignments (the parents) and output one alignment (the child). The difference in score between the input and output is used for future evaluation of the usefulness of the operator. Each operator requires one or more parameters which specify where and how the operation is to be carried out. For instance, an operator inserting a new gap must be told where (at which position in the alignment) and in which sequences the gap is to be inserted and how long this gap will be. The operators display several levels of greediness. Some are completely stochastic (i.e. the values of the parameter are determined randomly in some reasonable range), while others aim at some local optimization and rely on enumeration or DPAN. The crossovers. These operators allow good pieces to be recombined and therefore play a central role in improving the population. In RAGA only one type of crosover is implemented:

4573 Nucleic Acids Research, 1997, Vol. 25, No. Nucleic Acids Research, 1994, Vol. 22, No. 122 4573

population equivalent to the overall population of all the GAs. By default, after trial and error optimization, we used a threebranched tree with a depth of three, as shown in Figure 2. This model requires 13 GAs. The processes are synchronous and wait for each other to reach the same generation number before exchanging their populations. In terms of CPU time this implementation can be ∼10 times faster than a single GA with the same overall population. This means that we get ∼80% of the maximum speed-up. A typical population size for each GA is 30 and population migration occurs every five generations. A GA that receives new individuals (node) replaces half of its population that way (15 individuals). These 15 individuals are made up of three groups of five individuals coming from the previous nodes/leaves. This parallel GA was named PRAGA for Parallel RNA Alignment Genetic Algorithm. All of our results were obtained using PRAGA.

Test cases To assess the efficiency and accuracy of PRAGA several test cases were designed. We used aligned rRNA sequences, obtained from a manual expert alignment of small subunit (SSU) sequences (29). This database contains large alignments of rRNA made by hand. These alignments come with predicted secondary structure. To build a test case two sequences were extracted from a multiple alignment. This initial alignment was kept as a reference. From the same alignment the structure (master structure) of one of the two sequences (master sequence) was then extracted. In this structure we kept only the elements documented as belonging to the conserved core which is found in most SSU rRNAs (30,31). These elements were chosen because they were likely to exist in the second sequence (slave sequence). We designed two large test cases (test cases 1 and 2 in Table 1) using full-length eukaryotic nuclear sequences. These two sets use the human SSU rRNA as a master sequence and Oxytrichia nova and Giardia ardeae as slaves. These sequences are 65 and 75% identical respectively to the master human sequence. The purpose of these two test cases is to show the ability of the GA to optimize long alignments consisting of sequences of ∼2 kb in length. In order to obtain test cases with a wider range of identity between the master and the slave sequence we turned to the mitochondrial SSU rRNA sequences. These sequences diverge faster than their nuclear counterparts (32), have a wider spectrum of identity and are also generally smaller (∼1–1.5 kb in length). This allowed an extensive study of some of the properties of our algorithm. This set of seven test cases (Table 1, test cases 3–9) was created using the procedure already described. Sequences and structures were extracted from the database alignment (29). We used the human mitochondrial SSU rRNA as a master sequence and seven other mitochondrial sequences as slaves. Their identity with the human sequence ranges from 70 to 43%. Some of the slave sequences do not contain all the structural elements described in the core structure used for the alignment (Table 1, pairs column). This gave us a chance to analyse the effect of this type of noise on our optimization procedure. The distances between the two sequences of a given test case were measured using the program Dnadist in the package Phylip (33). We used this program to assess the ‘Kimura with 2 parameters’ distance (34), with a default ‘transition/transvertion’ ratio set to 2 and one category of substitution rates.

Figure 2. Layout of the parallel genetic algorithm PRAGA. Each circle represents a RAGA process. The best individuals migrate from top to bottom. The best solution is to be found in the root (bottom).

the uniform crossover (UCO) previously used in SAGA. The principle is to map the areas of alignment which are consistent (identical) between the two parents. The child will contain all these identical blocks as well as non-identical blocks taken from either of the parents. The choice of a non-consistent block can be random or deterministic. Both versions (random and deterministic) are implemented in RAGA. Gap shifting. In order to keep improving the population it is necessary to introduce new alignment configurations. This can be done by shifting gaps. To do so a gap is randomly chosen in an alignment and moved to another position. The choice of the new position can be random or greedy, in which case the gap is slowly shifted in one direction as long as the score of the alignment keeps improving. Gap insertion. This operation is made by DPAN as described earlier. It is only performed on a portion of random size that is extracted from the alignment, re-aligned and re-introduced. Island parallelization. In order to decrease run times we implemented an island parallelization model (20). Instead of having a single copy of RAGA, several identically configured GAs were used, running independently in parallel and exchanging individuals every N generations, where N is typically 5. The algorithms are arranged on a k-branched tree and population exchange only takes place in one direction, from the leaves to the bottom of the tree (Fig. 2). By default the individuals migrating from one RAGA to another are those having the best score. They replace individuals with lower scores in the node to which they move. The node from which they come keeps a copy, so that in each RAGA process the population remains constant. We found that this model gives results comparable with what would be obtained with a single copy of RAGA having a

4574 Nucleic Acids Research, 1997, Vol. 25, No. 22
Table 1. Test cases and general results TC Master Slave Distance Pairs (%) Length λ m1 (%) DP 1 2 3 4 5 6 7 Homo sapiens Homo sapiens Homo sapiens mitochondrion Homo sapiens mitochondrion Homo sapiens mitochondrion Homo sapiens mitochondrion Homo sapiens mitochondrion Homo sapiens mitochondrion Homo sapiens mitochondrion Oxytrichia nova Giarda ardeae Latimeria chalumnae mitochondrion Xenopus laevis mitochondrion Drosophila virilis mitochondrion Apis mellifera mitochondrion Penicillium chrysogenum mitochondrion Chlamydomonas reinhardtii mitochondrion Saccharomyces cerevisiae mitochondrion 0.41 0.57 0.31 0.43 0.76 1.23 1.26 82.5 82.1 81.2 84.9 82.6 72.1 81.3 1914 1895 998 985 973 977 1478 1.00 3.00 1.00 1.00 3.00 4.00 4.00 83.9 72.2 85.9 83.9 66.8 45.2 37.7 PRAGA 86.6 76.1 92.5 92.5 76.6 56.0 63.8 m2 (residues) DP 0.15 0.53 0.64 0.41 2.08 3.83 4.96 PRAGA 0.06 0.45 0.10 0.20 1.59 2.91 3.21 m3 (%) DP 85.3 65.2 82.6 77.4 48.6 24.1 15.7 PRAGA 94.7 81.3 96.1 96.7 68.5 55.1 77.0

8

1.30

66.6

1271

4.00

34.1

53.2

13.4

8.26

8.90

50.0

9

1.33

80.3

1699

6.00

31.6

60.2

14.7

3.70

21.6

70.0

TC, test case number (as used in the text); Master, sequence with a known structure; Slave, sequence with an unknown structure, Distance, estimated mean number of substitutions per site between the master and the slave measured on the reference alignment; Pairs, pairs defined in the core structure of the master and present in the slave sequence, as judged from the reference alignment; Length: length of the reference alignment; λ, optimal value of λ, measured from graphs similar to those shown in Figure 4 [in cases where the three graphs (m1, m2, m3) did not indicate the same optimum we chose a value that was at least consistent with two of the graphs]; m1, measure m1 (overall level of identity with the reference alignment) obtained by dynamic programming with local gap penalties alignment (DP) or by PRAGA alignment obtained with the optimal λ (PRAGA); m2, average offset measured on the structure (m2 should be as small as possible); m3, percent of pairs found correctly aligned (the reference is the number of pairs in the master core structure conserved in the slave sequence). The sequence EMBL accession nos are as follows: Homo sapiens, X03205; Homo sapiens mitochondrion, V00702; Oxytrichia nova, X03948; Giarda ardeae, Z177210; Latimeria chalumnae mitochondrion, Z21921; Xenopus laevis mitochondrion, M27605; Drosophila virilis mitochondrion, X05914; Apis mellifera mitochondrion, S51650; Penicillium chrysogenum mitochondrion, L01493; Chlamydomonas reinhardtii mitochondrion, M25119; Saccharomyces cerevisiae mitochondrion, V00702.

Evaluation PRAGA was evaluated by comparing the results on the test cases with results obtained using traditional dynamic programming, RNAlign and by comparison with the reference alignments. Dynamic programming was implemented using Gotoh’s algorithm (35) with local gap penalties, so as to make it comparable with optimizing the OF with λ = 0. Due to the length of the sequences and the memory requirement, it was only possible to run RNAlign (18) on two of the test cases (3 and 5). Comparison of an alignment with the reference taken from the databases can be done in several ways. We use three different measures: m1, m2 and m3. m1 is the percentage of the aligned columns of nucleotides in the reference alignment that are reproduced in the test alignment (columns with gaps are ignored). m2 is based on the alignment of stems. It is the average offset of stems between the reference and the test alignment. If a position Ak of the master sequence is aligned with Bi in the reference alignment and with Bj in the new alignment the offset will be (i – j). m2 is equal to the average of each offset absolute value. The better the alignment, the smaller m2. The main advantage of m2 is that it takes into account some close sub-optima that would otherwise be completely disregarded by m1. Giving some credit to these types of alignments makes sense, especially when aligning similar structures with very divergent sequences. m3 is

a measure very similar to m1. In m3 we only consider residues that form a pair in the secondary structure (stems). To be considered correctly aligned both residues of a pair must be aligned in a similar way to the reference. m3 is the percentage of such residues over the total number of pairs in the common core structure. Implementation RAGA and PRAGA are written in ANSI C and run under UNIX. PRAGA can be run on a variety of different UNIX platforms as long as they can each run RAGA. For RAGA the memory requirement is ∼20 MB for an average alignment length close to 2000 regardless of the population size. A beta release for PRAGA and RAGA is available free of charge from the corresponding author by Email request, including ‘RAGA or PRAGA’ in the title. RESULTS Dynamic programming reference For each test case a pairwise alignment was produced using dynamic programming with local gap penalties. Another was made without local penalties using ClustalW. We compared these alignments to their reference using m1, m2, m3 and found that alignments made with local penalties were ∼10% (as measured

4575 Nucleic Acids Research, 1997, Vol. 25, No. Nucleic Acids Research, 1994, Vol. 22, No. 122 4575

Figure 3. Complexity. (a) Time (in generations) required to find a solution as a function of λ. The distances are as in Table 1 and measure the distance between the slave and the master sequence. Four test cases producing alignments of similar length were used (6, 5, 4 and 3, which have lengths comprised between 900 and 1000 nucleotides). (b) Time (in generations) needed to find an optimum as a function of alignment length. For each of the four test cases measures were made varying λ. The four test cases used (6, 8, 7 and 9) have a comparable distance between the master and the slave sequence (1.23–1.33). In our system (see Methods) the time required for one generation was ∼54 s.

Figure 4. Evaluation of optimum λ on test case 7. (a) m1 and m3 (see Methods) were measured on the alignment produced by PRAGA with different values of λ. (b) As (a) but with m2.

with m1 and m3) more similar to the reference than alignments made without local gap penalties. We then measured the distance between the sequences in each reference alignment, as described in the previous section. Similarity to the reference alignment, using measures m1, m2 and m3, was plotted against these distances. As expected, we find that there is a clear correlation between the level of identity of the sequences aligned and the similarity of their pairwise alignment to the reference. In order to improve on these results we introduced secondary structure information into the alignment procedure and used PRAGA to do so. Efficiency and accuracy of PRAGA Since the optimization procedure is central to our work, we analysed PRAGA for its ability to perform this task. We looked at two criteria: the accuracy of optimization and the consistency of the results. Our algorithm, being a stochastic heuristic, can be expected to give different results when run several times with the same set of parameters. In order to have a program that is as

reliable as possible one would like to minimize the level of variation from one run to another. First, we checked, through the use of crossovers and mutations, that our program was able to reproduce the patterns of gaps and matches present in any of the reference alignments. We did so by using as an OF the measure of overall identity (m1) between a PRAGA alignment and the reference alignment. For all the test cases the GA was able to produce alignments 100% identical to the reference. Several runs were made for each test case that showed a total consistency in the scores. This is a good sign that PRAGA has the potential to explore the whole solution space when aligning two sequences. Since dynamic programming with local gap penalties is equivalent to the OF described in the method with λ = 0, we checked that when using such an OF PRAGA was able to reproduce the dynamic programming alignments. In all cases it managed to produce alignments having exactly the same score as the dynamic programming reference. Here again we found a very good consistency from run to run (<0.1% deviation). When looking at the similarity between these alignments and the reference we found that the deviation was significantly higher (2.2% on m1, 2.1% on m3, 0.1 residues on m2). The highest variations were found for alignments where the two sequences aligned shared a low level of identity. This fact is not surprising, since it is well known that several alternative alignments of the

4576 Nucleic Acids Research, 1997, Vol. 25, No. 22

Figure 5. Optimal values of λ as a function of the slave/master distance. For each test case the values of λ leading to the best alignment were measured on plots similar to that shown in Figure 4 and plotted against the slave/master distance.

Figure 6. Comparison of PRAGA and dynamic programming with local gap penalties using the m3 measure. Each point corresponds to one of the test cases in Table 1. PRAGA alignments were obtained using an optimal value for λ.

same sequences can share the same score (36). This is simply a consequence of the OF properties. In a second stage PRAGA was tested with values of λ set between 1 and 9. Each test case was analysed, four runs being made with each value of λ. Since no mathematical optimal solution was available to serve as a benchmark for these alignments, we focused our analysis on the consistency of the program. We found that overall the deviation of the score of equivalent alignments was <0.5%. This deviation tended to remain constant with different values of λ. The deviation of the score of the comparison with the reference alignment was higher (3.2% on m1, 2.1% on m2 and 0.4 residues on m3) and tended to increase slightly with higher values of λ. In order to verify that the use of dynamic programming (DPAN) was not adding some uncontrolled bias, most of these experiments were repeated while switching off the DPAN seeding and DPAN mutation described in Methods. Results obtained in that way were consistent with the rest of our experiment. This also allowed us to confirm that the use of DPAN gives an ∼3-fold speed-up to the optimization procedure and does not create any premature convergence problem. Finally, an attempt was made to establish the complexity of the algorithm as a function of the different parameters (Fig. 3a and b). Due to the properties of the OF the time needed to compute one generation increases linearly with the average length of the alignments in the population. This average length is roughly similar to the length of a regular dynamic programming alignment. For a typical test case (7 in Table 1) the average time needed for one generation is of the order of 54 s CPU time. The number of generations needed to reach an optimal solution, however, is a function of several factors, including the value of λ, the length of the alignment and the similarity of the sequences. Our experiments show that the level of similarity has a significant effect on the time requirement. This is in agreement with previous observations made on protein sequences using a similar model of alignment (23). The complexity of the gap pattern (as seen from the point of view of the operators) tends to increase with the

distance between the two sequences, making it harder for the GA to find the right configuration. We also found (Fig. 3a) that for a fixed length and a fixed level of identity the number of generations needed to reach the optimum increases with λ. However, more remarkable is that for λ = 0 the number of generations required tends to be independent of the alignment length. This means that under these conditions the time requirement increases almost linearly with the sequence length (this observation holds when seeding is done in a random way). In theory this is a clear improvement over dynamic programming, which requires at least a quadratic amount of time. In practice, however, the overhead is so large that the sequences to be aligned would need to be extremely long (>10 000 nt) for this speed-up to become really noticeable and we still need to check that this linearity holds for such long sequences. Tuning of λ Corpet and Michot described their OF as giving the best results with λ = 3. Since we aligned sequences with a wide range of identity, it was important to know whether λ should be set to a value that is a function of the distance between the slave and the master sequence. For each of our seven test cases the accuracy, as measured by m1, m2 and m3, was plotted against λ. Most of the graphs show reasonable continuity, as shown in Figure 4a and b. We deduced an optimal value for λ from each of these graphs and found the results to be mostly consistent with each similarity measure used (m1, m2 or m3). Figure 5 is a plot of ‘optimal λ’ against the slave/master distance. It shows that the value of λ should roughly reflect the level of identity between the two sequences analysed. It should be higher for sequences of low identity and low for very similar sequences. Our results also indicate that λ is quite a robust parameter and that a variation of one or two around the optimal value has little effect on the actual quality of the alignment. Such a robustness means that one can perform a dynamic programming alignment

4577 Nucleic Acids Research, 1997, Vol. 25, No. Nucleic Acids Research, 1994, Vol. 22, No. 122 beforehand, measure the distance (using Kimura or any other scheme) and deduce from this a reasonable λ. For instance, λ should be set to 1 for closely related sequences (distance <0.5 estimated substitutions/site), 3 for more distantly related pairs (distance <1) and 5 for more remote homologues (distance >1). Comparison with dynamic programming The new alignments generated with PRAGA were compared with those obtained by dynamic programming. In all cases (Fig. 6, Table 1) we found that using our method leads to a significant improvement over the dynamic programming approach regardless of the comparison measure. Although both methods follow the same trend and decrease in accuracy when the slave/master distance increases, PRAGA is clearly less affected than dynamic programming. The accuracy of the alignments produced by PRAGA is also clearly a function of the slave/master structural similarity. For instance, let us consider test cases 6–9. These alignments have comparable distances (1.23–1.33 estimated substitutions/site), therefore it seems that the factor responsible for the lower accuracy observed for 6 and 8 is mostly due to the fact that in these alignments the level of conservation of secondary structure is lower than for 7 and 9 (see column ‘pairs’ in Table 1). 4577

last stage, considering the area in between the anchor points, RNAlign performs a complex dynamic programming that takes into account both primary and secondary structure constraints. This dynamic programming is very intensive in terms of time [O(M2N3)] and memory [O(M2N2)]. This means that when trying to align sequences with low levels of identity the setting of anchor points is difficult and leads the program to re-align stretches of the alignment much too long to be handled in that way. In practice, the longer the sequences, the more similar they need to be for RNAlign to align them. To overcome these limitations the authors use a multiple alignment (Bank) instead of a single master sequence. From this multiple alignment they remove areas of very low identity by a semi-automatic method and then use this reduced profile in RNAlign. Computation of pseudoknots Pseudoknots are structures that involve interaction of a loop with a domain on the 3′- or 5′-side of its stem (16,37). They can be considered as RNA tertiary motifs. Computationally, prediction of pseudoknots is very difficult using traditional approaches (38). In their method Corpet and Michot (18) had to exclude pseudoknots. It is interesting to notice that in the case of PRAGA there is no real distinction between a pseudoknot and any other type of Watson–Crick interaction. This means that the algorithm should have no more difficulty in aligning pseudoknots than normal stems. In the previous experiments, in order to remain consistent with RNAlign, we excluded these interactions from the master structure. In order to demonstrate the ability of PRAGA to deal with such structures we re-introduced some of them. We only considered pseudoknots involving more than two residues and associated through Watson–Crick interactions. By doing so it is possible to add 6 bp to the previously used structure. These base pairs are boxed in green in Figure 7a and b. The GA was then used with this new master structure, setting λ to the optimal value previously reported. The experiment was performed on four of the test cases (4, 6, 7 and 9) and the results are given in Table 2. They show unambiguously that our program can efficiently use pseudoknot information in order to improve the alignment. It should be noted that the computation of pseudoknots has no noticeable effect on the algorithmic complexity previously discussed. The fact that even without having pseudoknots present in the master structure (Table 2, PRAGA–PN) PRAGA improves the alignment is due to the constraint imposed by other structures in the vicinity of these pseudoknots (see Fig. 7).

Comparison with RNAlign An attempt was made to align each of the nine test cases using the program RNAlign (18). This was done on a Pentium PC with 64 MB memory. Only two of the test cases (3 and 5) could be aligned successfully. All the others caused the machine to run out of memory or the program to issue a warning message. For 3 and 5 we tuned λ as we did for PRAGA. The optimal values found were the same as those reported with the GA. We found RNAlign alignments to be roughly similar to PRAGA with the three measures (for instance measure m3 gave 89.3% for test case 3 and 68.6 for 5). These results are quite consistent with those obtained with PRAGA (Table 1) and constitute one more piece of evidence that the optimization procedure is accurately performed by our program. The reason why the other test cases could not be aligned has to do with the way RNAlign works. It first produces a dynamic programming alignment with local gap penalties. In the second stage it identifies some ‘anchor points’ in this alignment. These are regions of the alignment that can be considered as correctly aligned, using some conservative evaluation scheme. During the
Table 2. Incorporating pseudoknot information TC Distance m2 (residues) DP PRAGA (Struc –PN) 4 6 7 9 0.43 1.23 1.26 1.33 0.00 2.80 4.90 13.5 0.00 0.61 0.25 0.20

m3 (%) PRAGA (Struc +PN) 0.00 0.00 0.00 0.00 100 16.6 0.00 0.00 DP PRAGA (Struc –PN) 100 50.0 50.0 66.6 PRAGA (Struc +PN) 100 100 100 100

Alignments were made incorporating into the master structure some of the positions known to form a pseudoknot (green boxes in Fig. 7). These positions make a total of six new pairs of nucleotides. The alignments were compared to the reference for their accuracy on the newly added positions. m2 and m3 were calculated on the new pairs. The results (Struc +PN) were compared with those obtained by dynamic programming with local gap penalties (DP) and by PRAGA without the pseudoknot information (Struc –PN).

4578 Nucleic Acids Research, 1997, Vol. 25, No. 22

a

Figure 7. Comparison of dynamic programming (a) and PRAGA (b) on test case 9. The boxed stems indicate stems that were included in the master structure (Homo sapiens mitochondrion). Blue portions are the elements not shared by the two structures (and therefore not part of the master structure). A stem is considered correctly aligned if at least one position of its alignment is strictly identical to the reference alignment. This scheme does not take pairing into account. This explains why it is possible to have only the left or the right strand of a stem correctly aligned. The pseudoknots boxed in green are those that were used in the alignment (see text).

4579 Nucleic Acids Research, 1997, Vol. 25, No. Nucleic Acids Research, 1994, Vol. 22, No. 122 4579

b

DISCUSSION PRAGA is a powerful tool for RNA alignment. Using existing structures allowed us to predict quite accurately the core structure of several ribosomal sequences, even those remotely related to the sequence for which the master structure was known. We also show that this type of analysis can capture some of the tertiary properties of the folding, such as pseudoknot interactions. The next step with PRAGA will be implementation of an OF that allows the use of a whole alignment instead of a single master

sequence. We are currently investigating ways of maximizing the information that can be extracted from such alignments (sequence weighting, local penalties, local substitution schemes, use of secondary structure, etc.). Using a GA gives us a lot of freedom in the design of the OF. In practice, almost any type of constraint or information can be built into an OF and used for optimization purposes. This could include, for instance, SCFG-based functions, which have sounder theoretical justifications than the function we have been using here (39). Another possible extension of PRAGA would be to use the alignment to predict non-conserved stems

4580 Nucleic Acids Research, 1997, Vol. 25, No. 22 between correctly aligned core stems with traditional folding prediction methods (4). We are currently investigating ways of defining the local reliability of a given alignment in order to produce a complete structure. The RAGA algorithm itself is mostly adapted from SAGA (23). This means that the operators used by RAGA were initially designed for aligning protein sequences (linear alignments). It is surprising that these operators do so well in a context where long range non-linear pairings are involved. We believe that this can be explained by the balance between potentially disruptive operators, such as crossovers, and the ability of the other operators (mutations) to fix these disruptions. This balance is automatically maintained by dynamic scheduling of the operators usage probability. It should be noted that the uniform crossover is much less disruptive than would be other forms of exchange, such as the one point crossover described for SAGA. In many cases the uniform crossover tends to respect long range interactions, simply because they often belong to stable parts of the alignment and are likely to be widespread in the population. This approach could probably be taken further and new operators could be designed with respect to the RNA tree structures (40) or using stochastic context-free grammar. A very important aspect of PRAGA is its ability to deal with noise. By noise we mean unconserved secondary elements that are present in the master structure and absent in the slave structure. We have shown that although such elements significantly decrease the performance of our algorithm, overall, even in these cases, PRAGA remains significantly better than any other alternative we know of. In the future we will focus our attention on defining a better OF that would be able to discriminate this type of event, therefore minimizing their negative effect. Although our algorithm has been specifically designed for aligning rRNA, it can in theory be applied to any structured RNA. An aspect of PRAGA that still needs to be studied would be its ability to discriminate between RNA structural homologues. The program is too slow to be used for scanning databases as it is now, but since it is a GA, it can also produce sub-optimal solutions in a shorter time. In our experience the quality of the solution found by PRAGA in the first 10% of a run is usually a good indicator of the overall score that may be reached after further optimization. This property could be used to design a ‘short PRAGA’ as some sort of filter, later focusing the long runs on candidates having a potentially good structural match with some query sequence. ACKNOWLEDGEMENTS This work was partly supported by a grant from the EC Biotechnology Program (BIO4-CT95 0130). We wish to thank Burkhart Rost and Miguel Andrade for useful comments. We also wish to thank the three anonymous referees for their useful remarks and interesting suggestions. REFERENCES
1 Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443–453. 2 Michot,B., Qu,L.H. and Bachellerie,J.P. (1990) Eur. J. Biochem., 188, 219–229. 3 Gutell,R.R., Weiser,B., Woese,C.R. and Noller,H.F. (1985) Prog. Nucleic Acid Res. Mol. Biol., 32, 155–216. 4 Zuker,M. (1989) Science, 244, 48–52. 5 Gouy,M. (1987) In Bishop,M.J. and Rawlings,C.J. (eds), Nucleic Acid and Protein Sequence Analysis: A Practical Approach. IRL Press, Oxford, UK, pp. 259–284. 6 Schmitz,M. and Steger,G. (1996) J. Mol. Biol., 255, 254–266. 7 Shapiro,B.A. and Wu,J.C. (1996) Comput. Applicat. Biosci., 12, 171–180. 8 Gultayaev,A.P., van Batenburg,F.D.H. and Pleij,C.W.A. (1995) J. Mol. Biol., 250, 37–51. 9 Ogata,H., Yutaka,A. and Minoru,K. (1995) Nucleic Acids Res., 23, 419–426. 10 Sankoff,D. (1985) SIAM J. Applicat. Math., 45, 810–825. 11 Eddy,S.R. and Durbin,R. (1994) Nucleic Acids Res., 22, 2079–2088. 12 Kim,J., Cole,J.R. and Pramanik,S. (1996) Comput. Applicat. Biosci., 12, 259–267. 13 Lathrop,R.H. (1994) Protein Engng, 7, 1059–1068. 14 Sakakibara,Y., Brown,M., Underwood,R.C., Mian,I.S. and Haussler,D. (1994) In 27th Hawaii International Conference on System Sciences. IEEE Computer Society Press, Los Alamitos, CA, pp. 284–293. 15 Lefebvre,F. (1995) In ISMB-95. AAAI Press, CA, pp. 222–230. 16 Pleij,C. (1990) Trends Biochem. Sci., 15, 143–147. 17 Tabaska,E.J. and Stormo,G.S. (1997) In ISMB-97. AAAI Press, Menlo Park, CA, pp. 311–318. 18 Corpet,F. and Michot,B. (1994) Comput. Applicat. Biosci., 10, 389–99. 19 Davis,L. (1991) The Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York, NY. 20 Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, New York, NY. 21 Kirkpatrick,S., Gelatt,C.D.J. and Vecchi,M.P. (1983) Science, 220, 671–680. 22 Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Science, 262, 208–214. 23 Notredame,C. and Higgins,D.G. (1996) Nucleic Acids Res., 24, 1515–1524. 24 Thompson,J.D. (1995) Comput. Applicat. Biosci., 11, 19–29. 25 Thompson,J., Higgins,D. and Gibson,T. (1994) Nucleic Acids Res., 22, 4673–4690. 26 Gerstein,M. and Levitt,M. (1996) In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 59–67. 27 Gotoh,O. (1993) Comput. Applicat. Biosci., 9, 361–370. 28 Myers,E.W. and Miller,W. (1988) Comput. Applicat. Biosci., 4, 11–17. 29 Van de Peer,Y., Jansen,J., De Rijk,P. and De Watcher,R. (1997) Nucleic Acids Res., 25, 111–116. 30 Neefs,J.M. and De Wachter,R. (1990) Nucleic Acids Res., 18, 5695–5704. 31 Neefs,J.M., Van de Peer,Y., De Rijk,P., Chapelle,S. and De Wachter,R. (1993) Nucleic Acids Res., 21, 3025–3049. 32 Subramanian,A.R. (1985) Essays Biochem., 21, 45–85. 33 Felsenstein,J. (1988) Cladistics, 5, 355–356. 34 Kimura,M. (1980) J. Mol. Evol., 16, 111–120. 35 Gotoh,O. (1982) J. Mol. Biol., 162, 705–708. 36 Gotoh,O. (1990) Bull. Math. Biol., 52, 509–525. 37 Westhof,E. and Jaeger,L. (1992) Curr. Opin. Struct. Biol., 2, 327–333. 38 Abrahams,J.P., van der Berg,M., van Batenburg,E. and Pleij,C. (1990) Nucleic Acids Res., 18, 3035–3044. 39 Grate,L. (1995) In ISMB-95. AAAI Press, Menlo Park, CA, pp. 136–144. 40 Shapiro,B.A., Maizel,J., Lipkin,L.E., Currey,K. and Whitney,C. (1984) Nucleic Acids Res., 12, 75–88.

Published online 17 April 2008

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 doi:10.1093/nar/gkn174

R-Coffee: a method for multiple alignment of non-coding RNA
´ Andreas Wilm1, Desmond G. Higgins1 and Cedric Notredame2,*
1 2

The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and Centre for Genomic Regulation (CRG), Dr Aiguader, 88, 08003 Barcelona, Spain

Received December 20, 2007; Revised March 14, 2008; Accepted March 25, 2008

ABSTRACT R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (www.tcoffee.org). INTRODUCTION A number of recent discoveries have cast new light on the importance of RNA, revealing a functional scope much broader than realized only a few years ago. Small noncoding RNAs (ncRNAs) are actively involved in a wide range of cell processes, including gene regulation, cell diﬀerentiation, genome maintenance, RNA maturation

and protein synthesis (1,2). The ncRNA big picture could change even further in the coming years, as suggested by a recent report of the ENCODE consortium (3) showing an unexpected level of ncRNA transcription across the entire human genome. While the problem of aligning sequences has been regularly addressed over the last 40 years (4), delivering accurate alignments for ncRNAs remains a challenging task for at least two main reasons. First of all, RNA molecules have a low chemical complexity compared to proteins with just a four-letter alphabet. This limited information content makes it hard to use sequence similarity as an estimator of the biological relevance of RNA alignments. The most notable consequence is the limited sensitivity of RNA alignments, and it is generally accepted that the RNA twilight zone (i.e. the level of identity below which pairwise alignments become uninformative) is close to 70%, as opposed to 25% for proteins (5–7). The second reason for the diﬃculty in aligning ncRNA comes from their rate and pattern of evolution. In general, functional ncRNAs have a well-deﬁned structure and their evolution seems to be mainly constrained to retain a speciﬁc folding, mostly based on Watson and Crick base pairs. Maintaining such a structure can be achieved through compensatory mutations, a phenomenon that explains why sequences can diverge a lot while still coding for the same structure (8). Therefore, sequence identity alone is a very crude measure of biological similarity, as it does not reﬂect very well the level of structural conservation. Because of these problems, it is highly desirable to take RNA secondary structure into account while aligning ncRNA sequences, in order to assure optimal usage of the positional interdependence. Sankoﬀ’s algorithm, published 20 years ago (9), does exactly this, but suﬀers from enormous runtime and memory requirements. Given two sequences of length L, the pairwise alignment requires O(L6) in time and O(L4) in computer space, while its extension to N sequences is exponential: O(L3N) in time and O(L2N) in space. Only a few simpliﬁed implementations exist, usually constrained to pairwise alignment (10–13). Recently a number of multiple alignment versions

*To whom correspondence should be addressed. Tel: +34 93 316 02 71; Fax: +34 93 316 00 99; Email: cedric.notredame@crg.es
ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 2 OF 10

have been published (11,14–18), which employ various techniques to reduce run-time and memory consumption, for example by limiting the length or types of structure motifs or by using banding techniques during the dynamic programming stage [for review, see (19)]. The most accurate of theses heuristics are restricted to sequences shorter than approximately 200 residues, a limitation that explains why it is often more practical to use regular sequence aligners when dealing with larger sequences such as ribosomal RNA, or RNA motifs embedded in long genomic sequences. Most of these aligners treat RNA sequences like regular DNA and rely on an identity-based scoring scheme only suitable for closely related sequences. For instance, ClustalW has long been used for establishing reference collections of ribosomal RNA alignments (20). Following manual curation and visual inspection of the conserved secondary structures, these alignments have been widely used to infer phylogenetic relationships between most living organisms. Taking secondary structures into account may not, however, improve alignment accuracy across the entire spectrum of known ncRNA. For instance, secondary structure-based alignments will not improve the comparison of mature miRNAs or mRNAs that are not structurally conserved. In this work, we address the problem of RNA multiple sequence alignments by taking advantage of the T-Coﬀee framework (21). T-Coﬀee is a multiple sequence alignment method able to combine the output of diﬀerent sequence alignment packages. It takes as input, a collection of alignments (pairwise or multiple) and outputs a multiple sequence alignment containing all the sequences. The input, which is referred to as a ‘library’, can consist of alternative and possibly inconsistent alignments of the same sequences. The purpose of the algorithm is to generate a ﬁnal alignment that is as consistent as possible with the original input alignments. The main advantage of this procedure is its ﬂexibility. For instance, in the original T-Coﬀee, the library was compiled from pairwise local and global alignments of the sequences. In M-Coﬀee (22) the compilation was made using alternative multiple sequence alignments while in 3D-Coﬀee (23) or Expresso (24), the library is derived from structure-based pairwise alignments. This simple protocol can easily be built on top of any existing method, as illustrated by two RNA alignment packages: Marna (25) and T-Lara (19). Both packages focused on the development of a novel pairwise RNA alignment algorithm, which was then used to generate an alignment library fed to T-Coﬀee in order to produce a multiple alignment. In the present work we decided to go further and modify the library processing/ extension algorithm so that it could take advantage of known and predicted secondary structures. This is done when compiling the library and while evaluating the matching score of two residues. This novel strategy forms the core of R-Coﬀee. Our primary goal was not to produce a stand-alone method, but rather a novel component that can seamlessly be added on top of any existing alignment method. We demonstrate here that it is possible to improve the alignment quality of most existing methods by means of R-Coﬀee, with only minor computational over-head.

SYSTEMS AND METHODS Reference alignments and evaluation We used two diﬀerent benchmark sets: BRAliBase 2.0 (5), the standard RNA reference alignment dataset and Cmﬁnder (26), a smaller dataset speciﬁcally designed for testing local analysis of long sequences. BRAliBase is collection of RNA reference alignments especially designed for the benchmark of RNA alignment methods. We only used its multiple alignment component made of 388 multiple sequence alignments. These datasets are evenly distributed between 35% and 95% average sequence identity. Each of these datasets was originally produced by extracting sub-alignments from larger seed alignments coming from four RNA families (tRNA, group II intron, 5S rRNA and U5 RNA). Two of these were seed alignments obtained from the Rfam database (27). This procedure may appear slightly circular as it involves comparing sequence-based reference alignments with other sequence-based alignments. In order to address this issue, BRAliScore, the benchmarking scoring scheme, was designed in such a way that it not only depends on the similarity between the reference and the evaluated alignment but also on the intrinsic structural conservation of the target alignment [see also (6)]. This tradeoﬀ illustrates the diﬃculties in establishing a gold standard for RNA analysis. The main problem comes from the lack of suﬃcient experimentally validated RNA structures, in contrast to protein sequence analysis where literally hundreds of accurate 3D structures exist. The BRAliScore combines two measures: the Sum of Pairs Score (SPS) and the Structural Conservation Index (SCI) (28). The SPS is the ratio between the number of residue pairs identically aligned in the target and the reference, divided by the number of pairs in the reference. It was measured using a variant of compalignp [based on Sean Eddy’s compalign; see also (6)] adapted to restrict the evaluation to a pre-deﬁned core region. The SCI is a reference-independent measure. It is deﬁned as the ratio between the average free energies of all single sequences of the MSA [as calculated by RNAfold; (29)] and the free energy of the MSA consensus structure [as calculated by RNAalifold; (30)]. A value of 0 indicates a lack of a conserved structure, 1 corresponds to a perfect agreement between the energies of the single sequences and the consensus energy, while values higher than 1 indicate a very good agreement supported by additional co-variation. The BRAliScore is the product of the SCI and the SPS score. This combination can lead to problems when either the SPS or the SCI are close to 0. In practice however, this situation rarely arises, and the combination of these two scores provides a very robust measure, less sensitive than the SPS, to the eﬀective accuracy of the reference alignment. To test for statistical diﬀerences between pairs of methods, we applied the Wilcoxon signed rank test as in (6). All analyses were carried out using tools provided from http://www.biophys.uni-duesseldorf.de/ bralibase/. Our second dataset is named after the RNA motif ﬁnder program Cmﬁnder (26). It contains Rfam sequences embedded in 200 nt of their original ﬂanking genomic

PAGE 3 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

Table 1. Programs used for benchmarking and as input for T/R-Coﬀee Program ClustalW Consan Dynalign Foldalign FoldalignM Maﬀt Marna M-Locarna Murlet Muscle Pcma Pmcomp Pmmulti Poa Proalign Probcons Prrn Rnasampler Stemloc Stral T-Lara T-Coﬀee Reference (33) (10) (12) (13) (15) (35) (25) (17) (14) (32) (45) (11) (11) (46) (47) (34) (48) (44) (16) (49) (19) (21) Version 1.83 1.2 Dec-06 2.0.3 1.0.1 5.861 Jan-07 0.99 Mar-06 3.6 2 Jun-04 Jun-04 2 0.5.a3 1.1 SCC 3.0.a 1.3 Dart 0.19b 0.5.4 1.31 5.19 Structure N Y Y Y Y Y Y Y Y N N Y Y N N N N Y Y Y Y N Sankoﬀ N Y Y Y Y N N Y Y N N Y Y N N N N Y Y N N N Alignment mode Multiple Pairwise Pairwise Pairwise Multiple Multiple Multiple (T-Coﬀee) Multiple Multiple Multiple Multiple Pairwise Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple (T-Coﬀee) Multiple Command line -type = dna -m mixed80.mod Len2-len1 + 5 0.4 5 20 2 1 0 -global -score_matrix global.fmat ginsi/ﬀtns mlocarna-p -seqtype rna

blosum80.mat

-multiple –slow -global -o lara.params -dp_mode myers_miller_pair_wise

This table lists all the packages evaluated along with their version numbers (or download date). The Structure column indicates whether the consider packages use predicted secondary structures (Y for Yes, N for No). The Sankoﬀ column indicates whether the package is a heuristic implementation of the Sankoﬀ original algorithm. The Alignment Mode column indicates whether the package performs pairwise or multiple alignment or if it’s based on the T-Coﬀee package. The last column gives used command line parameters; most programs were used as in the BRAliBase alignment benchmark publications (5) and (6).

regions, randomly distributed between the 50 and the 30 of the ncRNA sequence (i.e. x nucleotides on the 50 -, y nucleotides on the 30 -end with x and y randomly chosen so that x + y = 200). The unaligned datasets were kindly provided by the Cmﬁnder authors. We limited our choice to datasets having less than 40 sequences thus generating 11 reference alignments (9 to 35 sequences, length between 167 and 368 nt). The average level of identity within the core regions of these alignments ranges from 31% to 73%. These characteristics make the Cmﬁnder dataset a diﬃcult target, especially because of the sequence length and the inclusion of ﬂanking regions. These datasets are also closer to ‘real life’ situations that often involve discovering an RNA motif within poorly characterized sequences. The presence of ﬂanking genomic regions potentially embedded in the Cmﬁnder datasets made it impossible to systematically use the SCI component of the BRAliscore. Instead we used the Sum of Pairs score (SPS) and restricted the scoring to the Rfam core region of the alignment. Note that most available RNA alignment benchmark sets are based on Rfam alignments. These alignments are by no means a gold standard (especially not the ‘full’ Rfam alignments) and are not based on 3D superposition as most protein alignment benchmarks. For example, the BRAliBase benchmark set was created from four RNA families, two of which were full Rfam alignments (U5, g2intron) and two were Rfam seed alignments (tRNA, 5S) i.e. hand-curated and thus more likely of high quality. The Cmﬁnder data sets are exclusively based on Rfam seed alignments. The low number of quality alignments suited especially for benchmarking (i.e. equally distributed over

a wide sequence identity range etc.) of multiple RNA alignment programs is a notorious problem. New RNA alignment benchmarks with a high number of RNAs using expert, hand-curated alignments, which are based on structural superposition [e.g. from the Comparative RNA web site (31)] would constitute a useful advance in this area. Alignment programs To test and benchmark R-Coﬀee, we compared a variety of programs with diﬀerent features (Table 1). These programs can be roughly divided in three categories: pairwise structural aligners, multiple structural aligners and regular multiple sequence aligners. The pairwise structural aligners like Consan (10) are heuristic approximations to the original Sankoﬀ algorithm. Their heavy computational requirements limit them to short sequences. The second category includes structural aligners extended to deal with multiple sequences like FoldalignM (15) or Stemloc (16). Like their pairwise counterparts, they use structure and sequence information during the alignment and are therefore restricted to short datasets. The third category is made of regular multiple sequence alignment programs like Muscle (32) or ClustalW (33). These programs do not rely on any kind of structural modeling, although some of them [like Probcons (34) and Maﬀt (35)] have optimized parameters for BRAliBase i.e. program parameters were trained on BRAliBase alignments by the respective program’s authors. These two last categories of packages can either be used to align multiple sequence datasets or pairs of sequences.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 4 OF 10

Most programs were used as described in (5) and (6). Marna (25), Pmmulti (11) and Stemloc (16) were not able to align all test sets of BRAliBase. In particular, Marna cannot align sequences that contain IUPAC characters and the ability of Stemloc to align sequences seems to depend on the size of the main memory. We therefore had to exclude these packages from the comparison, although it should be noted that they produced accurate alignments on the datasets they could align (data not shown). In case of T-Lara we did not use the pairwise alignments as T-Lara already uses T-Coﬀee. Instead we used R-Coﬀee as a drop-in-replacement for T-Coﬀee. We used the standalone versions of all packages to compute multiple alignments for all the reference datasets. We also used them in combination with either T-Coﬀee or R-Coﬀee. All programs were run on a Quad-Xeon-3 GHz machine with 6 GB RAM running Red Hat Enterprise Linux. Original T-Coffee strategy T-Coﬀee is a versatile MSA package that allows the combination of many pairwise (or multiple) sequence alignments into one unique ﬁnal model. The principle is fairly straightforward. Given a set of sequences, a collection of pairwise alignment is computed. This collection can be redundant (several alternative alignments for each pair of sequences) or not, and is compiled into a data structure called the primary library. The primary library contains the list of all the pairs of aligned residues observed in the alignment collection. Each of these pairs receives a weight equal to the score of the alignment it came from (in practice the percent identity is used). These weights are then re-estimated in a process named library extension. The purpose of the new weights (extended weights) is to reﬂect the level of consistency between each pair of aligned residues and the rest of the library. High-scoring pairs are those in very good agreement with the rest of the pairs and their high score insures that they should easily ﬁnd their way into the ﬁnal alignment. R-Coﬀee uses the Myers and Miller algorithm (command line option: -dp_mode = myers_miller_pair_ wise) to align pairs of sequences or proﬁles rather than the current T-Coﬀee default (-dp_mode = cfasta_pair_wise) that uses a banded dynamic programming implementation extensively tuned for proteins. The Myers and Miller setting corresponds to the T-Coﬀee algorithm as described in the original publication (21). Adaptation of T-Coffee to use RNA structural information The novel RNA-speciﬁc mode of T-Coﬀee described here has been designed to be able to use secondary structure predictions. The current design supports an arbitrary amount of structural prediction, and each sequence can be associated with one or more secondary structure predictions that do not need to be in agreement. It is also possible not to associate any structural information with some sequences. In practice, however, we expect the best results to be obtained when using at least one secondary structure prediction for each sequence in the dataset. These structural predictions are passed to R-Coﬀee, using a data structure similar to the T-Coﬀee primary library

and named a structural library. In this structural library, each line indicates a pair of nucleotides predicted to be base-paired. Like its primary sequence counterpart, this structural library can be redundant, contain conﬂicting pairs or lack data for some pairs. RNA structures were computed using either a global or a local prediction method. Global structure predictions were obtained with RNAfold (29) which ﬁnds a structure with Minimal Folding Energy (MFE). When using a MFE structure as input, each predicted base pair was directly added to the structural library without any further ﬁltering. This global MFE-based prediction has two major limitations: the algorithm is computationally demanding when being applied to very long sequences and its prediction accuracy decreases with sequence length (36,37). When dealing with long sequences, a sensible alternative is to use local RNA structure prediction methods such as RNAplfold (38). RNAplfold predicts local pair probabilities for base pairs within a certain span (default is set to 100 nt). The program outputs base pair probabilities rather than one precise structure and in order to reduce noise, we excluded pairs exhibiting a low thermodynamic probability. We determined a suitable probability threshold by varying the ﬁltering threshold between 0.0 and 0.8 (in steps of 0.1) and estimating the accuracy of the resulting R-Coﬀee alignments (Figure 2). We found 0.3 to be the optimal threshold, although our results indicate a relative stability of the system around this value (ﬂat peak). The structural pairs thus gathered are then fed to R-Coﬀee, the version of T-Coﬀee using the R-Score (see later). The structural libraries used here contain un-weighted structure pairs, although it is, in principle, possible to apply a weighting scheme onto these pairs, possibly reﬂecting the thermodynamic stability or the likelihood of each considered pair. For testing purposes we also used random structures as input. These structures were computed by shuﬄing the input sequences before predicting the structures using RNAfold/RNAplfold as described earlier. For shuﬄing we used the program shuﬄe from Sean Eddy’s squid package. The R-score: a novel T-Coffee scoring scheme The original T-Coﬀee algorithm was modiﬁed in order to incorporate structural information within the library compilation process. This novel evaluation procedure is named the R-score and gives its name to R-Coﬀee, with the letter R standing for RNA. The R-score requires the secondary structures of the considered sequences to be pre-computed and it also involves two modiﬁcations of the original T-Coﬀee algorithm: one when compiling the pairwise alignment library and the other when evaluating the score for aligning two residues. The new library compilation procedure involves extending the original T-Coﬀee library with any residue pair not observed within the pairwise alignments but whose relevance is suggested by the secondary structure predictions (Figure 1). For instance, let A $ X be two nucleotides predicted to form a secondary structure in sequence 1 and B $ Y two other paired nucleotides in sequence 2.

PAGE 5 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

primary library. The R-score of that same pair is then deﬁned as: RscoreðA,BjA $ X,B $ YÞ ¼ MAX ðTscoreðA,BÞ,TscoreðX,YÞÞ 2

where X pairs with A and Y with B as indicated by the structural library (A $ X, B $ Y). Whenever the structural library is ambiguous (i.e. A $ X, A $ W, B $ Y, B $ W), the ﬁnal score is estimated by considering all the resulting combinations: RscoreðA,BjA $ X,A $ Z,B $ Y,B $ WÞ ¼ MAXðTscoreðA,BÞ,TscoreðX,YÞ, TscoreðX,WÞ,TscoreðZ,YÞ,TscoreðZ,WÞÞ The R-score, like the regular T-Coﬀee scoring scheme is then used as a position-speciﬁc substitution matrix while computing an alignment. R-Coﬀee uses the progressive alignment strategy described in the original T-Coﬀee algorithm and inspired from the ClustalW implementation. Sequences are all aligned two by two, using a standard identity based matrix and the Myers and Miller implementation of dynamic programming. These alignments are then used to derive a distance matrix that is turned into a Neighbor-Joining tree (39). This tree is used as a guide tree to deﬁne the order in which the sequences are aligned to create the multiple alignment. These alignments are made using the R-score as a positionspeciﬁc scoring scheme and the Myers and Miller pairwise algorithm. Apart from the use of the Myers and Miller pairwise alignment option, all the other T-Coﬀee parameters have been left to their original default values. Availability R-Coﬀee is part of the T-Coﬀee package, an open source freeware distributed under the GPL license and available at no cost along with documentation from www.tcof fee.org. R-Coﬀee can be compiled on most platforms (UNIX, Mac OS X and Windows). RESULTS AND DISCUSSION R-Coﬀee is an RNA multiple sequence alignment method able to use RNA secondary structure information while computing a multiple sequence alignment. One of the key properties of R-Coﬀee is its low computational complexity. Given predicted structures, R-Coﬀee can compute structure-based sequence alignments in a time and space complexity similar to that reported for T-Coﬀee or Probcons [in the order of O(N3L2) for N sequences of length L]. Nonetheless, the computation of the predicted structures can be a limiting factor. For example, for global Minimal Folding Energy methods like RNAfold (29), can be quite demanding with growing sequence length and prediction quality depends on sequence length (36,37). Our ﬁrst concern was therefore to check whether the replacement of RNAfold with the less-demanding RNAplfold (38) could prove useful. RNAplfold is able to predict the fold of long sequences thanks to its local 3

Figure 1. R-Coﬀee’s RNA-extension. The two Gs correspond to a pair of matched residues observed in the input pairwise alignment. This gets incorporated in the library as a constraint. Both of these nucleotides are predicted to be base paired (Bp) with two Cs that have not been found aligned. The RNA extension involves incorporating the associated constraint (C matched to C), based on the information contained in the provided structures. This structure-based extension is one of the two main ingredients of the R-Coﬀee scoring scheme.

In the standard T-Coﬀee procedure, if the primary alignment of sequences 1 and 2 contains the aligned pair A–B, this pair will be added as an entry to the library and associated with a weight equal the average identity of the alignment of sequences 1 and 2 (the weights will be added if several alternative alignments contribute the same pair). The R-Coﬀee library procedure goes further and involves incorporating the pair X–Y into the library (with the weight of A–B) even it was not aligned in any of the input library alignments. The rationale is that if the alignment of A–B is correct and if the predicted structures are correct as well, then the pair X–Y should be aligned and it therefore makes sense to add it to the library (if X–Y is already part of the primary library, its weight is increased by the A–B weight). Whenever more than one structure has been provided for each sequence, the secondary structure library may be ambiguous and provides several alternative base pairings for one or both residues (e.g. A $ X, A $ W in one sequence and B $ Y, B $ Z in the other). In this case, the update will consider a combination of all the potential structure-induced aligned pairs (i.e. X–Y, X–Z, W–Y, W–Z) and increase their primary weight with that of A–B. The R-score also uses a new evaluation procedure. The regular T-Coﬀee scoring scheme computes the matching score of a given residue pair A–B by summing up over the score of all the residue triplets including A, B and a third residue x from a third sequence. This can be formalized as follows: X MINðWeightðA,xÞ,WeightðB,xÞÞ 1 TscoreðA,BÞ ¼
x

where Weight(A,x) is a primary weight and x is any residue reported aligned both to A and B within the

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 6 OF 10

Figure 2. R-Coﬀee/RNAplfold base pair probability threshold optimization. Base pairs predicted by RNAplfold above a certain probability threshold were used as input for R-Coﬀee. Then all BRAliBase sets were aligned and the average alignment accuracy (BRAliscore) calculated. The optimal threshold was determined to be 0.3.

Figure 3. Eﬀect of the RNA-extension on T-Coﬀee’s performance on BRAliBase 2.0. The plot shows the alignment accuracy as function of the sequence identity. Scores are averaged over 5% sequence identity bins. Standard T-Coﬀee is compared to R-Coﬀee using structure input from RNAfold and RNAplfold as well as random structures.

structure prediction algorithm. In practice, this result is achieved by restricting the computation to the local neighborhood of each nucleotide (default is a span of 100 nt). RNAplfold outputs base-pairing probabilities rather than a single secondary structure. We therefore determined an optimal threshold for ﬁltering out unreliable base pairs. We did so by extensive testing on the BRAliBase dataset (see ‘Material and methods’ section and Figure 2). The cutoﬀ value thus determined (0.3) was used throughout the remaining experiments. Given this cutoﬀ value, we systematically compared the BRAliscore obtained by R-Coﬀee when using RNAfold and RNAplfold structural libraries. Both structural libraries (RNAfold and RNAplfold) give similar results. Interestingly, this correlation is high regardless of whether the considered sequences are closely or distantly related (Figure 3). The mean BRAliscore for the two methods is the same (0.64) with 53% of the 388 BRAliBase datasets having their BRAliscore within 5% of each other when using the RNAfold or the RNAplfold structural library. We therefore decided to use RNAplfold as the default provider of secondary structure predictions for the rest of the analysis. This allows R-Coﬀee to deal with sequences of arbitrary size. In order to check the eﬀect of the accuracy of the predicted structures, we also tested R-Coﬀee using random structure predictions, as input. The performance then returns to the default T-Coﬀee accuracy (Figure 3), i.e. alignment quality does not get worse as compared to default T-Coﬀee, but clearly decreases when compared with genuine structure predictions. Altogether these results suggest that it makes little diﬀerence in accuracy whether we use RNAfold or RNAplfold for secondary structure prediction in R-Coﬀee. They also conﬁrm the eﬀectiveness of the incorporation of structural information within the alignment procedure. We wish to note here, that although we limited our analysis to these two approaches, the ﬂexibility of R-Coﬀee’s RNA extension allows

incorporating and combining any kind of structure prediction. Alternatives include using RNAfold’s partition function and an applied threshold (as done with RNAplfold here) or using methods with a higher selectivity like Contrafold (40). But one could also include, for example sub-optimal structure or pseudoknot predictions (41). Next, we examined the merits of R-Coﬀee in comparison with other methods. It should be stressed here that our primary goal was not to produce a stand-alone method, but rather to use R-Coﬀee as a novel component that can seamlessly be combined with any existing RNA alignment method. We therefore focused our eﬀorts on the evaluation of the combination between R-Coﬀee and other established methods. In order to determine the baseline of our analysis we ran common sequence alignment methods on the 388 BRAliBase datasets (top part of Table 2). Our results are relatively consistent with previous reports (42,43) of accuracy on protein sequence alignments: Maﬀt (35), Probcons (34) and Muscle (32) deliver the best alignments. The default T-Coﬀee is notably inaccurate with RNA (5), most likely because it uses, by default, a banded dynamic programming heavily tuned on protein sequences. The second part of Table 2 (structural aligners) is also consistent with previous reports and conﬁrms that RNA alignment methods making use of structural information have a higher accuracy than sequence aligners. Our results show that FoldalignM (15), Rnasampler (44), T-Lara (19) and Murlet (14) clearly outperform all the regular sequence alignment methods, with more than ﬁve points diﬀerence between the best structure-based alignment methods (FoldalignM/Rnasampler) and their best non-structurebased counterpart (Maﬀt ginsi). The most straightforward way to embed these methods within R/T-Coﬀee is to use each individual method to generate libraries of pairwise alignments. This protocol merely requires a pairwise alignment for each pair of sequences within a dataset and using the resulting

PAGE 7 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

Table 2. BRAliBase evaluations Method BRAliscore Net improvement

Default +T-Coﬀee +R-Coﬀee +T-Coﬀee +R-Coﬀee T-Coﬀee Poa Pcma Prrn ClustalW Proalign Maﬀt ﬀtns Probcons Muscle Maﬀt ginsi M-Coﬀee4 M-Locarna Stral Murlet Rnasampler FoldalignM Dynalign Foldalign T-Lara Consan 0.59 0.62 0.62 0.64 0.65 0.66 0.68 0.69 0.69 0.70 0.71 0.66 0.71 0.73 0.75 0.75 / / / / / 0.65 0.64 0.61 0.65 0.68 0.68 0.67 0.69 0.68 / 0.69 0.70 0.70 0.70 0.76 0.62 0.62 0.74 0.79 0.63 0.70 0.67 0.66 0.69 0.71 0.72 0.71 0.73 0.72 0.74 0.71 0.72 0.72 0.71 0.76 0.62 0.77 0.73 0.79 / 48 34 À63 À7 30 17 À74 À17 À49 / 101 À4 À132 À101 72 / / / / 125 154 120 45 83 128 68 51 42 39 84 133 19 À73 À95 76 / / / /

Each line in the table corresponds to the evaluation of the package listed in the Method column. The BRAliscore section indicates the average BRAliscore performance of the package. The default column indicates the score obtained by the considered package. The +T-Coﬀee indicates the average BRAliscore using the corresponding package combined with T-Coﬀee. The +R-Coﬀee column indicates the average BRAliscore of the same package combined with R-Coﬀee. The slash / indicates values that could not be computed, either because the method only produces pairwise alignments (Dynalign, Foldalign and Consan), or because the method is a derivative of or uses T-Coﬀee (e.g. T-Lara). The Net Improvement section indicates the net improvement over the stand-alone methods.

alignments as a primary library for either T-Coﬀee or R-Coﬀee. The structural libraries were computed once on the entire dataset and then re-used. This protocol was used on all the aligners with the exception of T-Lara for which we followed the combination protocol described by T-Lara’s authors. It involves compiling partial T-Coﬀee libraries with Lara (i.e. libraries restricted to aligned stems) and combining them with the default T-Coﬀee libraries made of global and local pairwise alignments, that same protocol was used when combining Lara with R-Coﬀee. We ﬁrst evaluated the eﬀect of using the regular T-Coﬀee to compute an MSA with pairwise libraries generated either with regular sequence or structural aligners. The results are displayed in the +T-Coﬀee column of Table 2. For each T-Coﬀee/method X combination (X being any of the tested methods), we calculated the average BRAliScore and the Net Improvement (NI), which is the absolute improvement induced by combining that method with T-Coﬀee. It is deﬁned as the number of test cases where a method X outperforms that method combined with T-Coﬀee (T-Coﬀee/X) minus the number of times the T-Coﬀee/X combination outperforms method X:     T-Coffee XoutperformsT-Coffee NI¼ À 4 XoutperformsX X

The NI provides a guide as to whether one of the methods outperforms another. Results in Table 2 are easier to interpret when the regular sequence aligners and the structural aligners are separately considered. The regular aligners show little beneﬁt from the T-Coﬀee combination of their pairwise output (Column +T-Coﬀee), probably because these methods already make an eﬃcient use of their sequence information, or at least because they use it as eﬃciently as T-Coﬀee could. It is not a surprising result since most of these methods either use a T-Coﬀee inspired consistency-based scoring scheme (Maﬀt g/linsi, Probcons) or a sophisticated iterative method (Muscle, Prrn) to improve the original progressive MSA. R-Coﬀee, on the other hand, provides a clear improvement to all the regular sequence alignment methods tested here (Table 2, +R-Coﬀee column). This improvement remains regardless of the metrics used (BRAliscore or Net Improvement). The results obtained when combining R/T-Coﬀee with structural aligners follow a similar albeit less marked pattern. When added on the top of structural aligners, T-Coﬀee improves two methods out of ﬁve and R-Coﬀee improves three out of ﬁve. These observations are fairly consistent with the underlying principles of the alignment programs (sequence and structural aligners). They suggest that the potential beneﬁts of using R-Coﬀee come as much from the T-Coﬀee consistency-based scoring scheme as they do from the R-extension. The relatively small beneﬁt coming from the R-extension in this case also makes sense if one considers that the structural aligners already use structural information and are therefore less likely to beneﬁt from the incorporation of RNAplfold predictions than their sequence-based counterparts. This is especially true when combining T-Coﬀee with Consan. It is worth mentioning, however, that the use of the R-scoring scheme outperforms similar T-Coﬀee combinations in most cases with ﬁve methods out of nine being improved when switching from the T-Coﬀee to the R-Coﬀee combination and four methods remaining unchanged. Altogether, the data collected in Table 2 strongly suggest that consistency-based scoring schemes provide an eﬃcient framework for making the best out of pairwise alignment methods. T/R-Coﬀee/Foldalign and T/RCoﬀee/Consan provide the best illustration of this concept (bottom of Table 2). Consan is computationally too expensive to be easily extended to MSAs, yet, a straightforward combination with R-Coﬀee results in a method that outperforms all the other methods analyzed in this work (Tables 2 and 3). Figure 4 shows a detailed performance plot on BRAliBase and compares R-Coﬀee/ Consan with the best sequence alignment method (Maﬀt ginsi) and FoldalignM. This plot shows, that R-Coﬀee/ Consan performs better than FoldalignM across the full range of sequence identities, even if the diﬀerence is not statistically signiﬁcant (Table 3). It is important to point out that the shape of this curve is a side eﬀect of the two components that comprise BRAliScore (SCI, the structural component and SPS the sequence one). High levels of sequence identity naturally result in high-scoring alignments. At the other side of the spectrum at low identity levels, numerous compensating base pair

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 8 OF 10

Table 3. Net Improvement of R-Coﬀee/Consan and RM-Coﬀee4 over programs on BRAliBase Method Poa T-Coﬀee Prrn Pcma Proalign Maﬀt ﬀtns ClustalW Probcons Maﬀt ginsi Muscle M-Locarna Stral FoldalignM Murlet Rnasampler T-Lara versus R-Coﬀee-Consan 241ÃÃÃ 241ÃÃÃ 232ÃÃÃ 218ÃÃÃ 216ÃÃÃ 206ÃÃÃ 203ÃÃÃ 192ÃÃÃ 170ÃÃÃ 169ÃÃÃ 234ÃÃÃ 169ÃÃÃ 146 130Ã 129Ã 125Ã versus RM-Coﬀee4

Table 4. Cmﬁnder data set comparison Method SPS Net improvement

Default +T-Coﬀee +R-Coﬀee +T-Coﬀee +R-Coﬀee 217ÃÃÃ 199ÃÃÃ 198ÃÃÃ 151ÃÃÃ 150ÃÃ 148Ã 136ÃÃÃ 128Ã 115 111 183ÃÃ 62 61 À12 À27 À30 ClustalW Maﬀt ginsi Maﬀt ﬀtns Muscle Pcma Poa Proalign Probcons Prrn M-Locarnap T-Coﬀee R/M-Coﬀee4 0.54 0.64 0.60 0.32 0.49 0.31 0.40 0.50 0.43 0.53 0.54 / 0.57 0.64 0.64 0.40 0.55 0.38 0.39 0.45 0.54 0.63 / 0.63 0.58 0.64 0.64 0.42 0.58 0.42 0.41 0.51 0.56 0.63 0.53 0.65 5 À1 6 4 8 4 À4 À3 3 6 / / 5 2 6 8 8 8 À2 2 4 5 2 0

This table indicates the relative performance of the methods listed in the Method column in comparison with the R-Coﬀee/Consan and RMCoﬀee4 combinations, as net improvement. Asterisks indicate statistically signiﬁcant diﬀerences according to Wilcoxon tests (Ã P = 0.05; ÃÃ P = 0.01; ÃÃÃ P = 0.001). The upper part of the table contains sequence aligners only, the lower part structural alignment programs. Within these sections, programs are sorted by net improvement.

Each line in the table corresponds to the evaluation of the package listed in the Method column. The SPS section indicates the averaged sum-of-pairs scores (applied to the Rfam core alignment) measured on the considered package; +T-Coﬀee is the same score measured on the package combined with T-Coﬀee (+T-Coﬀee); the +R-Coﬀee column corresponds to that same package combined with R-Coﬀee. The slash / indicates values that could not be computed because the method is a derivative of T-Coﬀee (T-Coﬀee and M-Coﬀee). The Net Improvement section indicates the net improvement for similar combinations.

Figure 4. Comparison of R-Coﬀee/Consan and RM-Coﬀee with other programs. The plot shows the alignment accuracy on BRAliBase 2.0 as function of the sequence identity. Scores are averaged over 5% sequence identity bins. We included the best stand-alone sequence aligner (MAFFT ginsi), one of the two best structural aligners (FoldalignM), the best R-Coﬀee combination (R-Coﬀee/Consan) and RM-Coﬀee4 that combines the pairwise alignments of Probcons, MAFFT ginsi/ﬀtns and Muscle by means of R-Coﬀee.

mutations can result in high scores, because they are taken into account by the SCI (see also Reference alignments and Evaluation). Nonetheless, and across the whole identity spectrum, our data supports well the idea that R-Coﬀee/Consan is probably the most accurate RNA MSA alignment method currently available for the kind of datasets found in BRAliBase (i.e. less than 150 nt). We next assessed whether R-Coﬀee is also useful for aligning long sequences. We analyzed the Cmﬁnder dataset made of Rfam alignments embedded within surrounding genomic sequences of varying lengths. None of the structural aligners except M-Locarna (17), was able

to run on all the 11 datasets and the analysis was restricted to regular sequence aligners (Table 4). With the notable exception of Muscle (32), the ranking in this table is not dramatically diﬀerent from that in Table 2. The behavior of these methods when combined with T- or R-Coﬀee is also similar. When considering the 10 sequence aligners with T-Coﬀee, we observed an improvement on 7 methods out of 10. This ﬁgure rises to 9 out of 10 when making the combination with R-Coﬀee. Although these results are based on too small a dataset (11 alignments) to be considered statically signiﬁcant, they are in very good agreement with those reported on BRAliBase in Table 2 and conﬁrm R-Coﬀee’s ability to improve over most sequence alignment methods. The main practical problem with using R-Coﬀee is that to reach its highest level of accuracy, it requires the installation of RNA alignment packages, which may be extremely greedy with memory and CPU usage. We therefore checked whether a simpler alternative could be better suited for more modest computational conﬁgurations, or for high throughput applications. In a previous paper, Wallace et al. reported and characterized a novel mode of T-Coﬀee named M-Coﬀee (22). M-Coﬀee is a meta-aligner that combines alternative multiple sequence alignment methods into one consensus alignment. This combination usually results in an improvement over the constituting methods. We used the M-Coﬀee approach to combine the four best regular alignment methods (i.e. non-structure based), and tested them on BRAliBase. Following the strategy outlined in the original M-Coﬀee paper, we incorporated the sequence aligners in order of decreasing performances and kept the combination with the highest average. This protocol resulted in RMCoﬀee4, a combination of Muscle, Probcons, Maﬀt ginsi and Maﬀt ﬀtns fed to T-Coﬀee (M-Coﬀee4) or R-Coﬀee

PAGE 9 OF 10

Nucleic Acids Research, 2008, Vol. 36, No. 9 e52

(RM-Coﬀee4). The results (Table 2 and Table 3, Figure 4) are unambiguous and indicate that RM-Coﬀee4 clearly outperforms all the sequence alignment methods while delivering the best BRAliBase alignments one may obtain without using a structural aligner. These results were not conﬁrmed on the 11 Cmﬁnder datasets (Table 4), either because this dataset is too small to reveal the trend or because of the negative eﬀect of Muscle on RM-Coﬀee4 on this speciﬁc dataset. CONCLUSION We have presented a modiﬁed version of the T-Coﬀee (21) multiple sequence alignment method, named R-Coﬀee, designed for delivering highly accurate multiple ncRNA alignments. R-Coﬀee is a heuristic, able to take advantage of secondary structure predictions carried out beforehand. It is best described as an alignment improver and we show in this work that it can eﬀectively improve all sequence alignment packages, taken oﬀ the shelf and without tuning. Among all the combinations tested here, one clearly outperformed the alternatives: the combination of R-Coﬀee and Consan (10). Most of these tests were carried out on the BRAliBase reference datasets (5). We also checked whether R-Coﬀee was able to deal with datasets of longer sequences, combining a mixture of related and unrelated segments. For that purpose, we used a dataset designed for the Cmﬁnder algorithm (26). We found that the R-Coﬀee combination improved, to a greater or lesser extent, all the tested alignment methods. The combined observations made on the BRAliBase and Cmﬁnder datasets suggest that the R-Coﬀee scoring scheme is able to make eﬀective use of RNA predicted secondary structures in order to improve accuracy over most regular sequence aligners. This strategy also works when applied to structural aligners, although less dramatically than when considering regular sequence aligners. These results conﬁrm the strength of consistency-based scoring schemes over regular alignment methods. They suggest that most pairwise alignment methods can usefully be incorporated in a consistency-based framework such as T-Coﬀee. Our results also indicate that the meta-method approach originally described for M-Coﬀee (22) can be applied to R-Coﬀee, and that whenever the computation of highly accurate structure-based RNA pairwise alignments is not feasible, one may obtain alignments of reasonable quality by combining purely sequence-based alignments via R-Coﬀee. Further progress will also require the assembly of more demanding reference datasets, especially for long sequences. Such datasets are hard to assemble because RNA structural information is scarce (compared to protein structure information). RNA alignment remains a rapidly developing ﬁeld. With an increasing number of novel biological functions associated with yet poorly characterized RNA genes, there is an ever growing need for methods allowing accurate comparison of RNA sequences and the identiﬁcation of distant homologues. Any improvement in alignment accuracy is likely to have a big impact. In this context,

R-Coﬀee can easily be further improved. The ﬂexible way in which secondary structures are fed to the program allows a seamless combination of data from heterogeneous sources. It is important to point out that all the possibilities supported by the current software implementation have not yet been explored. Most notably, we have not yet fully exploited the possibility to associate more than one predicted structure to each sequence. These alternative structures could either be suboptimal structures, or the output of alternative structure prediction programs, such as ContraFold or Rfold. One could also combine structure predictions of any kind, including local, global or even tertiary interactions like pseudoknots, with experimentally veriﬁed structures. The possibility of combining data from various sources is, perhaps, the major strength of R-Coﬀee. ACKNOWLEDGEMENTS We thank Iain M. Wallace for useful discussions and all authors for their assistance with using their programs. This work was partly supported by funding from the Science Foundation Ireland. C.N. thanks the centre for genomic regulation for support and funding. Funding to pay the Open Access publication charges for this article was provided by Centro de Regulacio Genomica (CRG). Conﬂict of interest statement. None declared. REFERENCES
1. Zamore,P.D. and Haley,B. (2005) Ribo-gnome: The Big World of Small RNAs. Science, 309, 1519–1524. 2. Costa,F.F. (2007) Non-coding RNAs: lost in translation? Gene, 386, 1–10. 3. Birney,E., Stamatoyannopoulos,J.A., Dutta,A., Guigo,R., Gingeras,T.R., Margulies,E.H., Weng,Z., Snyder,M., Dermitzakis,E.T., Thurman,R.E. et al. (2007) Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. 4. Levenshtein,V.I. (1966) Binary codes capable of correcting deletions, insertions, and reversals. Cybern. Control Theory, 10, 707–710. 5. Gardner,P.P., Wilm,A. and Washietl,S. (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. 6. Wilm,A., Mainz,I. and Steger,G. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol., 1, [Epub ahead of print]. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690. 8. van Nimwegen,E., Crutchﬁeld,J.P. and Huynen,M. (1999) Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA, 96, 9716–9720. 9. Sankoﬀ,D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J1 Appl1 Math., 45, 810–825. 10. Dowell,R. and Eddy,S. (2006) Eﬃcient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7, 400. 11. Hofacker,I.L., Bernhart,S.H.F. and Stadler,P.F. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227. 12. Mathews,D.H. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, 21, 2246–2253.

e52 Nucleic Acids Research, 2008, Vol. 36, No. 9

PAGE 10 OF 10
information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2. 32. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. 33. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 34. Do,C.B., Mahabhashyam,M.S.P., Brudno,M. and Batzoglou,S. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. 35. Katoh,K., Kuma,K.-i., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 36. Doshi,K., Cannone,J., Cobaugh,C. and Gutell,R. (2004) Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics, 5, 105. 37. Dowell,R. and Eddy,S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71. 38. Bernhart,S.H., Hofacker,I.L. and Stadler,P.F. (2006) Local RNA base pairing probabilities in large sequences. Bioinformatics, 22, 614–615. 39. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. 40. Do,C.B., Woods,D.A. and Batzoglou,S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90–e98. 41. Reeder,J., Hochsmann,M., Rehmsmeier,M., Voss,B. and Giegerich,R. (2006) Beyond Mfold: recent advances in RNA bioinformatics. J. Biotechnol., 124, 41–55. 42. Carroll,H., Beckstead,W., O’Connor,T., Ebbert,M., Clement,M., Snell,Q. and McClellan,D. (2007) DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics, 23, 2648–2649. 43. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. Opin. Struct. Biol., 16, 368–373. 44. Xu,X., Ji,Y. and Stormo,G.D. (2007) RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics, 23, 1883–1891. 45. Pei,J., Sadreyev,R. and Grishin,N.V. (2003) PCMA: fast and accurate multiple sequence alignment based on proﬁle consistency. Bioinformatics, 19, 427–428. 46. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464. 47. Loytynoja,A. and Milinkovitch,M.C. (2003) A hidden Markov ¨ model for progressive multiple alignment. Bioinformatics, 19, 1505–1513. 48. Gotoh,O. (1996) Signiﬁcant improvement in accuracy of multiple protein sequence alignments by iterative reﬁnement as assessed by reference to structural alignments. J. Mol. Biol, 264, 823–838. 49. Dalli,D., Wilm,A., Mainz,I. and Steger,G. (2006) STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599.

13. Havgaard,J.H., Lyngso,R.B., Stormo,G.D. and Gorodkin,J. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. 14. Kiryu,H., Tabei,Y., Kin,T. and Asai,K. (2007) Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics, 23, 1588–1598. 15. Torarinsson,E., Havgaard,J.H. and Gorodkin,J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics, 23, 926–932. 16. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6, 73. 17. Will,S., Reiche,K., Hofacker,I.L., Stadler,P.F. and Backofen,R. (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol., 3, e65. 18. Meyer,I.M. and Miklos,I. (2007) SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput. Biol., 3, e149. 19. Bauer,M., Klau,G.W. and Reinert,K. (2007) Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8, 271. 20. Wuyts,J., Perriere,G. and Van de Peer,Y. (2004) The European ribosomal RNA database. Nucleic Acids Res., 32, D101–D103. 21. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coﬀee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 22. Wallace,I.M., O’Sullivan,O., Higgins,D.G. and Notredame,C. (2006) M-Coﬀee: combining multiple sequence alignment methods with T-Coﬀee. Nucleic Acids Res., 34, 1692–1699. 23. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoﬀee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 24. Armougom,F., Moretti,S., Poirot,O., Audic,S., Dumas,P., Schaeli,B., Keduas,V. and Notredame,C. (2006) Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coﬀee. Nucleic Acids Res., 34, W604–W608. 25. Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics, 21, 3352–3359. 26. Yao,Z., Weinberg,Z. and Ruzzo,W.L. (2006) CMﬁnder-a covariance model based RNA motif ﬁnding algorithm. Bioinformatics, 22, 445–452. 27. Griﬃths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.R. and Bateman,A. (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res., 33, D121–D124. 28. Washietl,S., Hofacker,I.L. and Stadler,P.F. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. 29. Hofacker,I.L. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., 31, 3429–3431. 30. Hofacker,I.L., Fekete,M. and Stadler,P.F. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066. 31. Cannone,J.J., Subramanian,S., Schnare,M.N., Collett,J.R., D’Souza,L.M., Du,Y., Feng,B., Lin,N., Madabusi,L.V., Muller,K.M. et al. (2002) The Comparative RNA Web (CRW) Site: ¨ an online database of comparative sequence and structure

© 1996 Oxford University Press

Nucleic Acids Research, 1996, Vol. 24, No. 8 1515–1524

SAGA: sequence alignment by genetic algorithm
Cédric Notredame* and Desmond G. Higgins
EMBL outstation, The European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, UK
Received December 5, 1995; Revised and Accepted March 4, 1996

ABSTRACT We describe a new approach to multiple sequence alignment using genetic algorithms and an associated software package called SAGA. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as measured by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimise the well known sums of pairs objective function, SAGA performs better than some of the widely used alternative packages. This is seen with respect to the ability to achieve an optimal solution and with regard to the accuracy of alignment by comparison with reference alignments based on sequences of known tertiary structure. The general attraction of the approach is the ability to optimise any objective function that one can invent. INTRODUCTION The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Multiple alignments are used to help predict the secondary or tertiary structure of new sequences; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families; to suggest primers for PCR and as an essential prelude to phylogenetic reconstruction. The great majority of automatic multiple alignments are now carried out using the ‘progressive’ of Feng and Doolittle (1) or variations on it (2–4). This approach has the great advantage of speed and simplicity combined with reasonable sensitivity as judged by the ability to align sets of sequences of known tertiary structure. The main disadvantage of this approach is the ‘local minimum’ problem which stems from the greedy nature of the algorithm. This means that if any mistakes are made in any intermediate alignments, these cannot be corrected later as more sequences are added to the alignment. Further, there is no objective function (a measure of overall alignment quality) which can be used to say that one alignment is preferable to another or to say that the best possible alignment, given a set of parameters, has been found.

There are two main alternatives to progressive alignment. One is to use hidden Markov models (HMMs; 5) which attempt to simultaneously find an alignment and a probability model of substitutions, insertions and deletions which is most self consistent. Currently, this approach is limited, in practice, to cases with very many sequences (e.g. 100 or more) but does have the great advantage of a sound link with probability analysis. A second approach is to use objective functions (OFs) which measure multiple alignment quality and to find the best scoring alignment. If the OF is well chosen or is an accurate measure of quality, then this approach has the advantage that one can be confident that the resulting alignment really is the best by some criterion. Unfortunately, the number of possible alignments which must be scored in order to choose the best one becomes astronomical for more than four or five sequences of reasonable length. Two solutions to this problem exist. The MSA program (6,7) attempts to narrow down the solution space to a relatively small area where the best alignment is likely to be. It then guarantees finding the best alignment in this reduced space. Even with this reduction, it is limited to small examples of around seven or eight sequences at most. Nonetheless, it is the only method we know of that seems capable of finding the globally optimal alignment or close to it, starting with completely unaligned sequences. A second approach is to use stochastic optimisation methods such as simulated annealing (8), Gibbs sampling (9) or genetic algorithms (GAs; 10). Simulated annealing has been used on numerous occasions for multiple alignment (e.g. 11–13) but can be very slow and usually only works well as an alignment improver i.e. when the method is given an alignment that is already close to optimal and is not trapped in a local minimum. Gibbs sampling has been very successfully applied to the problem of finding the best local multiple alignment block with no gaps but its application to gapped multiple alignment is not trivial. Finally, we know of one attempt at using GAs in this context (14). Here they used a hybrid iterative dynamic programming/GA scheme. In this paper, we describe a GA strategy and software package called SAGA (sequence alignment by genetic algorithm) which appears capable of finding globally optimal multiple alignments (or close to it) in reasonable time, starting from completely unaligned sequences. It can find solutions that are as good as or better than either MSA or CLUSTAL W (3) as measured by the OF score or by reference to alignments of sequences of known tertiary structure. The approach has a further advantage in that it can be used to optimise any OF one can invent. Biologically, the key to successful application of optimisation methods to this problem, depends critically on the OF. If the OF is not a good

* To

whom correspondence should be addressed

1516 Nucleic Acids Research, 1996, Vol. 24, No. 8 descriptor of multiple alignment quality, then the alignments will not necessarily be best in any real sense. The search for useful OFs for sequence alignment, perhaps for different purposes, is surely a key area of research. Without SAGA, however, it is difficult to consider most new OFs as one cannot optimise them. METHODS The overall approach is to use a measure of multiple alignment quality (an OF) and to optimise it using a genetic algorithm. A set of well known test cases is used as a reference to evaluate the efficiency of the optimisation. Objective function Evaluation of the alignments is made using an OF which is simply a measure of multiple alignment quality. We use two OFs related to the weighted sums of pairs with affine gap penalties (15). The principle is to give a cost to each pair of aligned residues in each column of the alignment (substitution cost), and another cost to the gaps (gap cost). These are added to give the global cost of the alignment. Furthermore, each pair of sequences is given a weight related to their similarity to the other pairs. Variations involve: (i) using different sets of sequence weights; (ii) different sets of costs for the substitutions [e.g. PAM matrices (16) or BLOSUM tables (17)]; (iii) different schemes for the scoring of gaps (18). The cost of a multiple alignment (A) is then: ALIGNMENT COST(A) + Sequence alignment by genetic algorithm (SAGA) To align protein sequences, we designed a multiple sequence alignment method called SAGA. SAGA is derived from the simple genetic algorithm described by Goldberg (21). It involves using a population of solutions which evolve by means of natural selection. The overall structure of SAGA is shown in Figure 1. The population we consider is made of alignments. Initially, a generation zero (G0) is randomly created. The size of the population is kept constant. To go from one generation to the next, children are derived from parents that are chosen by some kind of natural selection, based on their fitness as measured by the OF (i.e. the better the parent, the more children it will have). To create a child, an operator is selected that can be a crossover (mixing the contents of the two parents) or a mutation (modifying a single parent). Each operator has a probability of being chosen that is dynamically optimised during the run. These steps are repeated iteratively, generation after generation. During these cycles, new pieces of alignment appear because of the mutations and are combined by the crossovers. The selection makes sure that the good pieces survive and the dynamic setting of the operators helps the population to improve by creating the children it needs. Following this simple process, the fitness of the population is increased until no more improvement can be made . All these steps, shown in Figure 1, can be summarised by the following pseudo-code: Initialisation 1. create G0 Evaluation 2. evaluate the population of generation n (Gn) 3. if the population is stabilised then END 4. select the individuals to replace 5. evaluate the expected offspring (EO) Breeding 6. select the parent(s) from Gn 7. select the operator 8. generate the new child 9. keep or discard the new child in Gn+1 10. goto 6 until all the children have been successfully put into Gn+1 11. n = n+1 12. goto EVALUATION End 13. end Initialisation. The first step of the algorithm (Fig. 1a) is the creation of a random population. This generation zero consists of a set of alignments containing only terminal gaps. A population size of 100 was used in all of the results presented here. To create one of these alignments, a random offset is chosen for all the sequences (the typical range being from 0 to 50 for sequences 200 residues long) and each sequence is moved to the right, according to its offset.The sequences are then padded with null signs in order to have the same length, L. The alignments of generation zero will be the parents of the children used to populate generation one. Evaluation. To give birth to a new generation, the first step is the evaluation of the fitness of each individual. This fitness is assessed by scoring each alignment according to the OF. The better the alignment, the better its score, and thus the higher its fitness. If the purpose is to minimise the OF, as is the case for OF1 and OF2, then the scores are inverted to give the fitness. The expected offspring (EO) of an alignment is derived from the fitness. It is typically a small integer. The method we used to derive it is known as remainder stochastic sampling without

ȍȍW
N i–l i+2 j+1

i,j

COST(A i, A j)

where COST is the alignment score between two aligned sequences (Ai and Aj) and Wi,j is their weight. The COST function includes gap opening and extension penalties for opening and extending gaps. Altschul (18) made an extensive review describing the different ways of scoring gaps in a multiple alignment. Two different methods were used in SAGA: (i) natural affine gap penalties and (ii) quasi-natural affine gap penalties. These methods differ in how they treat nested gaps, i.e. a gap in one sequence that is completely contained in the second. In both cases, positions where both sequences have a null are removed. With the natural gap penalties, gap opening and extension penalties are charged for each remaining gap. With the quasinatural gap penalties, an additional gap opening penalty is charged for any gap in one sequence that starts after and ends before a gap in the second sequence (before the columns of null are removed). Terminal gaps are penalised for extension but not for opening. Sequence weights are an attempt to minimise redundant information, based on the relatedness of the sequences. In MSA, a weight for every pair of sequences is derived from a phylogenetic tree connecting the sequences. In CLUSTAL W (20), a weight is calculated for each sequence and the pair weight (Wi,j) for two sequences is simply their product. These weights differ in detail although both are designed for a similar purpose. In this study we give results for the optimisation of two OFs: (i) OF1 weighted sums of pairs using the pam250 weight matrix with quasi-natural gap penalties and MSA, rationale 2, weights (19). This is the function optimised by MSA. (ii) OF2 weighted sums of pairs using the pam250 weight matrix with natural gap penalties and CLUSTAL W weights (20).

1517 Nucleic Acids Research, 1996, Vol. 24, No. Nucleic Acids Research, 1994, Vol. 22, No. 18 1517

Figure 1. The layout of the SAGA algorithm. (a) Initial population (G0). (b) One generation cycle (Gn). The method continues until the terminal conditions are met. Boxes P1n to Pmn indicate parents in generation n, boxes C1n+1 to Cmn+1 indicate the children of these Parents. Parents and children are alignments. Bold boxes indicate alignments selected to survive unchanged from one generation to the next. OP is a randomly chosen operator.

replacement (22). In the case of OF1 and OF2 the typical values of the EO are between 0 and 2, which can be considered as an acceptable range (21). Only a portion (e.g. 50%) of the population is to be replaced during each generation. This technique, known as overlapping generation (23), means that half of the alignments will survive unchanged, the other half will be replaced by the children. We chose in SAGA to keep only the best individuals, and to replace the others. In practice, all the individuals are ranked according to their fitness, and the weakest are replaced by new children while creating the generation n+1 from the generation n. The other individuals (the fittest) will simply survive as they are during the breeding. Breeding. First, the new generation is directly filled with the fittest individuals from the previous generation (typically 50%). Next, the remaining 50% of the individuals in the new generation

are created by selecting parents and modifying them. During the breeding, the EO is used as a probability for each individual to be chosen as a parent. A wheel is spun where each potential parent has a number of slots equal to its EO. When an individual is chosen to be a parent, its EO is accordingly decreased before the next turn of selection (selection without replacement). This weighted wheel selection is carried on until all the parents have been chosen. To modify the parent(s), an operator has to be chosen. An operator is a small program that will modify an alignment (e.g. shuffle the gaps or merge two alignments into a new one). We have designed several operators. Each of them has a specific probability of being used. To create a child, one operator is chosen according to this probability (by spinning another weighted wheel). The chosen operator is then applied to the chosen parent(s). Some operators require two parents, others require only one.

1518 Nucleic Acids Research, 1996, Vol. 24, No. 8 An important aspect of the SAGA population structure is the constraint we put on the absence of duplicates. In the same generation, all the alignments have to be different. This technique helps maintain a high level of diversity in a population of small size (24). To do so, each newborn child is checked to ensure it is not identical to any of the children already generated. If it is not, it will be put into the new generation. Otherwise, it will simply be discarded along with its parent(s) and the operator, in order to avoid deadlock problems. This process is carried on until enough children (e.g. 50% of the population) have been successfully inserted in the new generation. The Evaluation/Breeding process will be carried on until the decision is made to stop the search. End. There is no valid proof that a GA must reach the optimum, even in an infinite amount of time, as there is for Simulated Annealing (25). Thus the decision to stop the search has to be an arbitrary choice using more or less sophisticated heuristic criteria. We use stabilisation as a criterion: SAGA is stopped when the search has been unable to improve for some specified number of generations (typically 100). This condition is the most widely used when working on a population with no duplicates (26). The operators According to the traditional nomenclature of genetic algorithms (21), two types of operators are represented in SAGA: the crossovers and the mutations. These programs perform modifications (mutation) or merging of parent alignments (crossover). In SAGA we do not make any distinction between these two types with regard to how we apply them. They are designed as independent programs that input one or two alignments (the parents) and output one alignment (the child). Each operator requires one or more parameters which specify where the operation is to be carried out. For example, an operator which inserts a new gap must be told where (at which position in the alignment) and in which sequences the gap is to be inserted. The parameters of an operator may be chosen completely randomly in some range in which case the operator is said to be used in a stochastic manner. Alternatively, all except one of the parameters may be chosen randomly and the value of the remaining parameter will be fixed by exhaustive examination of all possible values. The value which yields the optimal fitness will be used. When an operator is applied in this way, it is said to be used in semi-hill climbing mode. Most of the SAGA operators may be used in either way. The crossovers. Crossovers are responsible for combining two different alignments into a new one. We implemented two different types of crossover: one-point and uniform. The one point crossover combines two parent alignments through a single exchange. Figure 2 outlines this mechanism. The first parent is cut straight at some randomly chosen position and the second one is tailored so that the right and the left pieces of each parent can be joined together while keeping the original sequence of amino acids. Any void space that appears at the junction point is filled with null signs. Because of the specificity of this junction point, where rearrangements can occur, this operator combines both the traditional properties of a crossover and those of a local rearrangement mutation. Only the best of the two children produced that way is kept. This one point crossover can be very disruptive, especially at the junction point. To avoid this drawback, we added a second
Figure 2. A one point crossover between two parent alignments to produce two children. The arrows indicate the way the two parents are cut having randomly chosen a position in the left hand alignment. Child 1 is produced by combining the left side of parent 1 and the right side of parent 2. Child 2 is produced by combining the right side of parent 1 and the left side of parent 2. Only one of these two children alignments is kept (whichever scores better). The boxed sections show some patterns from the parent alignments that are combined in the child.

operator: the uniform crossover, designed to promote multiple exchanges between two parents in a more subtle manner. This operator is based on an analogy with biological crossover: exchanges are promoted between zones of homology. The first step consists of mapping the alignment positions that are consistent between the two parents. In an alignment, a position is a column of residues or nulls stacked on top of each other. Two positions are said to be consistent between two alignments, if in each line they contain the same residue (by reference to the original sequence) or a null coming from the same gap (i.e. between the same residues). For instance, if in one line of a given position we have ALA125 and at the same line of a position in the other alignment we have ALA101 then the two position are not consistent. This process is outlined in Figure 3. Blocks between consistent positions can be directly swapped. One can do so in a semi-hill climbing way, if only the best combination of blocks is chosen, or in a stochastic way, if the block to place between two consistent positions is randomly chosen between the two alignments. Both uniform crossovers, the semi-hill climbing one and the stochastic one, are implemented in SAGA. Gap insertion. While the crossovers combine patterns, there is still a need to generate these patterns. All the remaining operators were designed to serve this purpose. The gap insertion operator is the simplest. This operator extends alignments by inserting gaps. Its mechanism is detailed in Figure 4. To keep the sequences aligned, each sequence will get a gap insertion of the same size. The sequences are split into two groups. Within each group, all the sequences get the insertion at the same position. The two groups

1519 Nucleic Acids Research, 1996, Vol. 24, No. Nucleic Acids Research, 1994, Vol. 22, No. 18 1519

except P1 (the position of the insertion in the first group of sequences) are chosen randomly and all possible values of P1 are tested. The value of P1 that gives the best scoring alignment is chosen. In general, it is dangerous to assume that the topology of the underlying tree is correct. In the current usage, the main effect of an incorrect tree topology will be to slow the program down. The ability to find the globally optimal alignment should not be changed, just the speed at which the solution will be found. When two groups are chosen, using the tree, one of the groups can consist of a single sequence. This means that, eventually, all possible arrangements of gaps can be found, even if the tree topology is completely wrong. Ideally one would use fuzzy groupings based on the tree but which allows alternative groupings.
Figure 3. A uniform crossover. All of the positions in the two parents that are consistent between the two alignments are marked (stars). Children are produced by swapping blocks between the two parents where each block is randomly chosen between two consistent positions.

Figure 4. Gap insertion. (a) The estimated phylogenetic tree connecting the five sequences is randomly divided into two sub trees. This gives two groups of sequences (G1 and G2). (b) Two positions P1 and P2 are randomly chosen in the alignment. A gap of random length (here 2 nulls) is inserted at position P1 in the sequences of subgroup G1, and the same number of nulls are inserted at position P2 in subgroup G2.

are chosen, based on an underlying estimated phylogenetic tree between the sequences. The tree is randomly split into two sub-trees (Fig. 4a). Each group consists of all the sequences in one of the two sub-trees. For one of the groups, a position is randomly chosen (Fig. 4b). A gap of randomly chosen length is then inserted in each of the sequences of the group at the same position (P1 in Fig. 4b). A gap of the same length is also inserted into all of the sequences of the second group at some position within a maximum distance from the first gap insertion (P2 in Fig. 4b). This is the stochastic version of the block insertion operator. The semi-hill climbing version of this operator is similar to the stochastic one described above but in this case, all the parameters

Block shuffling. Generating an optimal arrangement after a gap insertion can often be a matter of shifting a gap to the left or to the right. Therefore we designed an operator that moves blocks of gaps or residues (but not both together) inside an alignment. Here, we depart from the usual definition of a block as a section of alignment containing no gaps, with all of the sub-sequences having the same length (27). For the purposes of this operator, we define a block of residues to be a set of overlapping stretches of residues from one or more sequences, each stretch being delimited by a gap or an end of a sequence. Each sub-sequence can be a different length but all sub-sequences must overlap. Similarly, a block of gaps is a set of overlapping gaps. An example of each is given in Figure 5a. A block is chosen by first selecting one residue or gap position from the alignment and then deriving the block to which it belongs. These can be moved inside the alignment, to generate new configurations. Figure 5b, c and d show some types of move that can be made inside an alignment. These moves are an extension of those proposed for a simulated annealing approach described by (12). The limits of this move are contained in the alignment itself. A gap can only be shifted until it merges with another gap. Similarly, a stretch of residues can only be shifted until it merges with another stretch of residues. We can enumerate the different ways these operations may be used as follows: D Move a full block of gaps or a full block of residues. (Fig. 5b). D Split the block horizontally and move one of the sub blocks to the left or to the right. The subdivision of a block is made according to the tree (cf. gap insertion operator) (Fig. 5c). D Split the block vertically and move one half to the left or to the right (Fig. 5d). D The move can be made in a semi-hill climbing way, looking for the best position, or in a stochastic manner. These different combinations lead to a total of 16 possible operators, designed to shuffle gaps, in all possible directions. All sixteen operators are implemented in SAGA. Block searching. A set of operators including crossovers, gap insertion and block shuffling, is theoretically able to create any arrangement needed for the correct alignment, but it is also bound to lose a lot of time, trying to generate some configurations that a simple heuristic would easily find. Therefore, we designed a crude method that, given a substring in one of the sequences, tries to find in the alignment, the block

1520 Nucleic Acids Research, 1996, Vol. 24, No. 8 point crossovers and the bloc shuffling operator. LAGA is typically run for a number of generations equal to 10-fold the number of sequences and with a population size of 20. Dynamic scheduling of the operators The 16 block shuffling operators, the two types of crossover, the block searching, the gap insertion and the local rearrangement operator, make a total of 22 operators (uniform crossovers and gap insertion may be used in a stochastic or semi hill climbing way). During initialisation of the program, all the operators have the same probability of being used, equal to 1/22. There is no guarantee that these probabilities are optimal. Even if they were for the first stages of the run, they could become inadequate in later stages. They could also be test case specific. How to schedule the different operators in a general way that will be efficient in many situations is a difficult problem. In fact, the more operators one has, the more difficult it becomes. We implemented an automatic procedure that deals with this problem and allows us to easily add or remove operators without any need for retuning. Dynamic schedules, optimised on the run, are an elegant solution to this problem, that was proposed by Davis (28). In this model, an operator has a probability of being used that is a function of the efficiency it has recently (e.g. 10 last generations) displayed at improving alignments. The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement on its parents, the operator that is directly responsible for its creation gets the largest part of the credit (e.g. 50%). Then the operator(s) responsible for the creation of the parents also get their share of the remaining credit (50% of the remaining credit, i.e. 25% of the original credit), and so on. This report of the credit goes on for some specified number of generations (e.g. 4). After a given number of generations (e.g. 10) these results are summarised for each of the operators. The credit of an operator is equal to its total credit divided by the number of children it generated. This value is taken as usage probability and will remain unchanged until the next assessment, 10 generations later. To avoid the early loss of some operators that may become useful later on, all the operators are assigned a minimum probability of being used (the same for all them, typically equal to half their original probability i.e. 1/44). Choice of the mutation sites Experience shows that, while monitoring the search, areas containing gaps are those that are most likely to change during a run. For this reason we found it useful to bias the choice of the mutation site by some probability related to the concentration of gaps in an area. This bias is moderated in order to avoid local minimum problems but it greatly helps the algorithm. Typically, in the middle of a run, the probability of hitting a position containing a gap is twice the probability of hitting a position without gaps. Test cases We used a set of 13 test cases based mainly on alignments of sequences of known tertiary structure. Twelve were chosen from

Figure 5. Block shuffling. (a) An irregular block of gaps (left example) or residues (right example) is chosen in the alignment. The block is constructed, starting with a randomly chosen position in a randomly chosen sequence (arrow). (b) An example of moving a full block of gaps, one position to the left. (c) An example of splitting a block of gaps horizontally (according to the tree). One of the sub-blocks is then moved in some direction (e.g. one position to the left). (d) An example of splitting a block of gaps vertically. One of the sub-blocks (e.g. the right one) is then moved to the right.

to which it may belong. Here we define a block as a short section of alignment without any gaps (27). First, we select a substring of random length at a random position in one of the sequences. Then, all substrings of the same length in all of the other sequences are compared with the initial substring and the best matching one is selected. This new substring is added to the first one, in order to form a small profile (31). Then, in the remaining sequences, the best match is located and added to the profile. The process goes on iteratively until a match has been identified in all the sequences. The sequences are then moved to reconstruct the block inside the alignment. This method does not depend on the underlying phylogenetic tree or on the order of the sequences. The initial substring is randomly chosen (typical length 5–15 residues). The block searching is not performed on the whole alignment, but only in a section tailored randomly around the position of the initial substring (typical size between 50 and 150 alignment positions). The ultimate rearrangement occurs inside that section only. This precaution is taken in order to minimise the side effect that could be caused by the existence of repeated motifs inside some of the sequences. This block searching mutation generates more dramatic changes than any of the other operators. Local optimal or sub-optimal rearrangement. Some situations remain where the presence of a very stable local minimum makes it quite difficult for the other operators to generate the optimal configuration. In order to overcome this problem, we designed our last operator. It attempts to optimise the pattern of gaps inside a given block. This is done in two ways: (i) by exhaustive examination of all gap arrangements inside the block or (ii) by a local alignment GA (LAGA). The exhaustive examination is carried out if it requires less than a specified number of combinations to examine (typically 2000). Otherwise, LAGA is used. LAGA is a crude version of the simple genetic algorithm described by Golberg (21). It uses only one

1521 Nucleic Acids Research, 1996, Vol. 24, No. Nucleic Acids Research, 1994, Vol. 22, No. 18 the Pascarella structural alignment data base (29) and one of chymotrypsin sequences from (6,12). We chose test cases of varying length (60–280 residues) and various numbers of sequences (4–32). The test cases were divided into two groups. The first group (nine test cases) is made of small alignments (4–8 sequences, and 60–280 residues long) that can be handled by MSA. Because they can be computed by MSA, they allow us to asses SAGAs ability to minimise the MSA OF. The second group (four test cases) is made of larger alignments (9, 12, 15 and 32 sequences). Three of them are only extended versions of some of the small test cases, the fourth contains 32 sequences of immunoglobulins (for details see Tables 1 and 2). These test cases cannot be handled by MSA, and are designed to show the ability of SAGA to perform multiple sequence alignments of realistic size. We analysed the results by comparing the scores obtained by MSA and SAGA using OF1 and CLUSTAL W and SAGA using OF2. To analyse the similarity between the structural alignments and those obtained by one of these three programs, we use a measure of consistency between two alignments. This measure gives the percentage of residues that are aligned in a similar manner in the two alignments. It allows us to measure the level of sequential consistency between computed alignments generated by SAGA, MSA and CLUSTAL W and the reference structural alignments.
Table 1. The performance of MSA and SAGA on nine test cases Test case Cyt c Gcr Ac protease S protease Chtp Dfr secstr Sbt Globin Plasto Nseq 6 8 5 6 6 4 4 7 5 Length 129 60 183 280 247 189 296 167 132 MSA score 1 051 257 371 875 379 997 574 884 111 924 171 979 271 747 659 036 236 343 MSA versus structure (%) 74.26 75.05 80.10 91.00 * 82.03 80.10 94.40 54.03 CPU-time 7 3 13 184 4525 5 7 7 22 SAGA score 1 051 257 371 650 379 997 574 884 111 579 171 975 271 747 659 036 236 195 SAGA versus structure (%) 74.26 82.00 80.10 91.00 * 82.50 80.10 94.40 54.05 CPU-time 960 75 331 3500 3542 411 210 330 510

1521

Implementation SAGA was written in ANSI C and was implemented on an open VMS system. Memory requirements are low, the main usage being to store the separate alignments in the population. For 10 sequences with an average alignment length of 200 and a population size of 100, ∼1 Mb of memory is sufficient. The source code is available free of charge from the authors; please send an e-mail message to Cedric.Notredame@EBI.ac.uk . RESULTS We analysed three aspects of SAGA in detail. As the robustness of our optimisation strategy depends on the dynamic operator setting, we checked its behaviour on various test cases. In order to show that SAGA was able to perform a rigorous optimisation, we used the first group of nine test cases, for which, thanks to MSA, a mathematically optimal, or sub-optimal solution is known for OF1. We verified that SAGA was able to find a solution at least as good. Then, using the second set of four test cases, we analysed the ability of SAGA to perform a multiple alignment on sequences that could not be aligned by MSA. We compared these results with those given by CLUSTAL W on the same test cases. With these two sets of experiments, we also tried to assess the biological relevance of the alignments produced by SAGA by reference to the structural alignments.

Nseq, number of sequences; Length, the length of the final SAGA alignment; Score, the alignment score using OF1. The columns marked ‘versus structure’ give the percentage of the alignment that matches the structural alignment. CPU time is given in seconds and is taken from the best of three runs for SAGA. The PDB structure identifiers for each test case are as follows: Cyt c: 451c, 1ccr, 1cyc, 5cyt, 3c2c, 155c; Gcr: 2gcr, 2gcr-2, 2gcr-3, 2gcr-4, 1gcr, 1gcr-2, 1gcr-3, 1gcr-4; Ac protease: 1cms, 4ape, 3app, 2apr, 4pep; S.protease: 1ton, 2pka, 2ptn, 4cha, 3est, 3rp2; Dfr secstr: 1dhf, 3dfr, 4dfr, 8dfr; Chtp: 3rp2, M13143 (EMBL accession number), 1gmh, 2tga, 1est, 1sgt; Sbt: 1cse, 1sbt, 1tec, 2prk; Globin: 4hhb-2, 2mhb-2, 4hhb, 2mhb, 1mbd, 2lhb, 2lh1; Plasto: 7pcy, 2paz, 1pcy, 1azu, 2aza. Table 2. The performance of CLUSTAL W and SAGA on four test cases Test case Igb Ac Protease2 S Protease2 Globin2 Nseq 32 10 12 12 Length 144 186 281 171 CLUSTAL W score 31 812 824 10 514 101 16 354 800 5 249 682 CLUSTAL W versus structure (%) 55.86 41.02 64.37 94.90 CPU-time 60 16 21 18 SAGA score 31 417 736 10 393 145 16 282 179 5 233 058 SAGA versus structure (%) 55.97 43.50 66.18 94.01 CPU-time 41 135 12 236 20 537 2538

The columns are as for Table 1 but score refers to the optimisation of OF2. The PDB identifiers for the structures in each test case are as follows: Igb: 2fb4, 2fb4-2, 2fb4-3,2fb4-4, 2fbj, 2fbj-2, 2fbj-3, 2fbj-4, 1fc2, 1fc2-2, 1mcp, 1mcp-2, 1pcf, 1rei, 2rhe, 3fab, 3fab-2, 3fab-3, 3fab-4, 2hfl, 2hfl-2, 2hfl-3, 1fl9, 1fl9-2, 1fl9–3, 1fl9-4, 1cd4, 3hla, 3hla-2, 4fab, 3hfm, 1mcw; Ac protease2: 1cms, 4ape, 3app, 2apr, 4pep, 1cms-2, 4ape-2, 3app-2, 2apr-2, 4pep-2; S protease2: 1ton, 2pka, 2ptn, 2trm, 4cha, 3est, 1hne, 3rp2, 1sgt, 2sga, 3sgb, 2alp; Globin2: 4hhb, 4hhb-2, 2mhb, 2mhb-2, 1fdh, 1mbd, 1mbs, 2lhb, 1eca, 2lh1, 1pmb, 1mba.

1522 Nucleic Acids Research, 1996, Vol. 24, No. 8 Some operators are stochastic and some work in a semi-hill climbing way. We verified that these semi-hill climbing operators were not over-weighted with respect to the other operators. The results are shown in Figure 6b. This figure clearly reveals that the semi-hill climbing and stochastic operators behave in a complementary way during the run. During the early stages, semi-hill climbing operators that easily generate improvements are favoured. Once all the possible easy improvements have been made, however, stochastic operators gradually replace the semi-hill climbing ones, opening the way for new configurations. This alternate use of both types of mutation is repeated through a series of cycles, until the evolution stops. When this point is reached, operators find it difficult or impossible to generate new improvements and they stabilise at their original, default level of 1/22 (when no improvement has been made for some specified number of generations). Although these schedules vary from one test case to another, the main patterns, described above, occur consistently. Closely related operators compete with each other and may display cycles of oscillation. Each possible modification may be viewed as a niche for which the various operators compete during the run. It must be emphasised here that these schedules are natural in the sense that no coding is responsible for the observed phenomenon of oscillation that we described. These results suggest that SAGA through the dynamic operator setting is able to optimise the use of each operator according to its real behaviour. Optimisation of OF1 SAGA and MSA were compared with regard to their ability to optimise OF1. This is the OF that MSA attempts to optimise. For nine of the test cases, we compared the alignments produced by MSA and those generated by SAGA. SAGA was run with default parameter settings. The results shown in Table 1 are the best from three trials. On a larger number of runs it was verified that SAGA reaches this solution in at least one third of the runs. In all the cases, SAGA was able to produce a score at least as good as that produced by MSA (note that the lowest scores are the better ones). In four cases, this score was better. We tried to derive a correlation between the mathematical optimisation of OF1 and its biological relevance. To do so, these four alignments were compared with the structural reference alignments for consistency. This analysis reveals that an improvement of the optimisation consistently correlates with an improvement of the accuracy: the alignments for which SAGA outperforms MSA are more similar to their structural references. Recently, MSA has been upgraded by a newer, faster version but the results are identical (7). In principle, MSA can be used to find the guaranteed optimal alignment for a set of sequences. In practice, however, the parameter settings required to do so will often be prohibitively expensive in terms of time and memory. By default, MSA uses heuristic bounds which do not guarantee optimality. In cases where SAGA achieves a better score than MSA, one can calculate new bounds from the SAGA alignment and use these to run MSA. In this case, MSA achieves the same score as SAGA (data not shown). In practice, if you do not have a higher scoring reference alignment (e.g. from SAGA), adjusting the bounds is not trivial. If they are set incorrectly, you either do not get the optimal alignment or MSA runs out of memory. Attempts at finding better solutions than those found by SAGA by increasing the bounds used by MSA, failed to find better scoring solutions. Starting with

Figure 6. Automatic scheduling of the operators. (a) One point crossover and uniform crossover. The analysis was performed on the globin test case with standard settings. The figure shows the probabilities of each of these operators of being used at any time during the run. (b) Stochastic mutations, semi-hill climbing mutations and crossovers (uniform and one point) for the globin test case. The stochastic mutation data were obtained by summing the usage probability of all the stochastic mutations. Crossovers and semi-hill climbing were obtained in a similar way.

Self tuning ability SAGA was run on all the test cases, and the schedules for all the operators were plotted. Figure 6a and b present some of these results. Figure 6a shows that the probabilities of being used of the two types of crossover (one point and uniform) evolve according to different schedules. In the early stages, the young population is very heterogeneous and lacks consistency. Thus the uniform crossover can hardly be used. Later in the run, when some order has been created, the uniform crossover can be applied more easily. It then gradually replaces the one point crossover. This graph clearly shows that the two types of crossover are competing with each other, although no extra information regarding the type of the operator is given to the algorithm. All the operators were analysed in the same way, in order to verify that they were needed (data not shown).

1523 Nucleic Acids Research, 1996, Vol. 24, No. Nucleic Acids Research, 1994, Vol. 22, No. 18 the bounds calculated from the SAGA alignment, the bounds were increased as much as possible up to the point where the problem became uncomputable with available computer time and memory. This suggests that the solutions presented in Table 1 could indeed be optimal. Optimisation of OF2 MSA uses quasi-natural gap penalties because of the computational cost of using natural ones. It can be argued that natural gap penalties are more biologically realistic (18) and we therefore use them for the remaining four test cases. MSA is also severely limited regarding the number and the length of the sequences it can align. In these four test cases, there were too many sequences for MSA to perform its task. Without the MSA reference, it becomes difficult to assess the efficiency of the optimisation. Therefore, we replaced the MSA reference with an alignment produced by CLUSTAL W. It must be stressed here that CLUSTAL W does not explicitly try to optimise any OF. Despite these limitations, by choosing an appropriate set of parameters, we used CLUSTAL W in conditions where it would produce a result as close as possible to the optimisation of OF2. These alignments were compared with those obtained from SAGA while optimising OF2. Both sets of alignments were then compared to the structural reference alignments of Pascarella (29). These results are presented in Table 2 and show that in all four test cases SAGA builds an alignment with a better score than CLUSTAL W, regarding OF2. This Table also shows that in three out of four test cases, the alignment generated by CLUSTAL W is less similar to the structural alignment than is that produced by SAGA. These results suggest that with similar types of weights, similar types of substitution cost (Pam 250) and similar range of gap penalties, SAGA performs more accurately than CLUSTAL W on data sets of realistic size. Figure 7 presents the N-terminus portion of the S protease2 test case obtained with SAGA. The reference structural alignment contains 12 completely conserved positions. SAGA is able to reconstitute 11 of these positions while CLUSTAL W only finds 10 of them. Overall, the comparison of the SAGA alignment with the structural reference shows that the main features are accurately found by our algorithm. DISCUSSION We believe SAGA to be a powerful and flexible tool for sequence alignment. This can be seen by the ability of SAGA to achieve what appear to be optimal alignment scores and by the consistency of our alignments with test cases of known tertiary structure. The consistency of the SAGA alignments with structural reference alignments is mainly a measure of the usefulness of the particular OFs we have tested. Nonetheless, even with the very limited range of OFs that we have tried, SAGA performs extremely well. SAGA is still fairly slow for large test cases (e.g. with >20 or so sequences) but we have made little effort at optimising the program for sheer speed. In the future, it may be desirable to use a hybrid progressive/genetic algorithm approach in order to combine the speed of the former with the accuracy of the latter. Currently, we seed the starting population of alignments completely randomly. We could use heuristic alignments generated by CLUSTAL W, for example, perhaps with different 1523

Figure 7. Example of an alignment obtained with SAGA.This is the N-terminus of the S protease2 alignment, used in Table 2. Completly conserved positions are marked (stars). The boxed column of glycine was not found with CLUSTAL W. Residues in upper case are correctly aligned with respect to the structural reference, those in lower case are misaligned.

parameter settings and refine these. We prefer not to, however, as the starting alignments could be trapped in local minima. If the starting alignment is close to the optimal solution, SAGA could be used very easily as an alignment improver. This would provide an easy method for generating hybrid alignments for very large test cases but we have not evaluated SAGA in detail in this respect. Genetic algorithms have been used successfully as a practical way to solve many computationally difficult problems. They are intellectually satisfying in their simplicity and the way they attempt to mimic biological evolution. From the point of view of multiple sequence alignment, the use of stochastic optimisation methods has proved to be difficult with just a few exceptions (9,32). We found that a simple GA, applied in a straightforward fashion to the alignment problem was not very successful. The main device which allowed us to efficiently reach very high quality solutions was to use a large number of mutational and crossover operators and to automatically schedule them. At first glance, this is not very satisfactory in that it makes the method seem very complicated and cumbersome. Multiple alignment, however, is not a simple problem. The most useful of our operators are the ones which appear most based in biological reality e.g. moving blocks using the tree as a guide. In reality, during the course of the evolution of a sequence family, many different evolutionary events may take place. The automatic scheduling has a further advantage. Should it turn out, in the future, that SAGA is not very efficient at handling certain types of situation, it is a simple matter to invent some new operators designed specifically for the problem and to slot them into the existing scheme. The automatically assigned probabilities of usage at different stages in the alignment give a direct measure of usefulness or redundancy for a new operator. The second major reason for using GAs in the context of multiple alignment is the complete freedom to use any OF one can

1524 Nucleic Acids Research, 1996, Vol. 24, No. 8 think of. This is perhaps the most important single feature of the approach. One key to successfully tackling the multiple alignment problem is to have a good measure of multiple alignment quality. The GA used in SAGA offers the opportunity to implement and test new OFs. After sequence alignment, there are two related questions which one might wish to ask. First one might like to know if the alignment is significant with respect to some statistical model e.g. one might like to know the probability of observing any particular alignment by chance alone. This is a very difficult problem which has solutions for two sequences under certain conditions (30). The second question is how stable is the alignment or which pieces of the alignment are stable i.e. are there alternative alignments with similar alignment scores? This is important if one is to usefully interpret new alignments and there are some solutions, again, for just two sequences (33). A by product of the GA strategy in SAGA is a measure of consistency for each column in the final alignment. This shows which columns are stable and which ones have high scoring alternative arrangements. The consistency is derived by counting how often a particular column occurs in the 100 alignments of a SAGA population during or after optimisation. We have no statistical interpretation of this consistency measure but it is an extremely useful by product of the SAGA alignment process at no extra computational cost. ACKNOWLEDGEMENT We wish to thank Stephen Altschul for advice on using MSA. REFERENCES
1 Feng,D.-F. and Dolittle,R.F. (1987) J. Mol. Evol., 25, 351–360 . 2 Taylor,W.R. (1988) J. Mol. Evol., 28, 161–169. 3 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic. Acids. Res., 22, 4673–4680. 4 Barton,G.J. and Sternberg,M.J.E. (1987) J. Mol. Biol., 198, 327–337. 5 Krogh,A., Brown,M., Mian,S., Sjolander,K. and Haussler,D. (1994) J.Mol. Biol., 235, 1501–1531. 6 Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) Proc. Natl. Acad. Sci. USA, 86, 4412–4415. 7 Gupta,S.K., Kececioglu,J. and Schaffer,A.A. (1996) J. Comput. Biol., 2, 459–472. 8 Aart,E.H.L. and van Laarhoven,P.J.M. (eds) (1987) Simulated Annealing: a Review of Theory and Applications, Kluwer Academic Publishers, Amsterdam. 9 Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Science, 262, 2–10. 10 Holland,J.H. (ed.) (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, Michigan, USA. 11 Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M. (1993) Comp. Applic. Biosci., 9, 419–426. 12 Kim,J., Paramanik,S. and Chung,M.J. (1994) Comp. Applic. Biosci., 10, 419–426. 13 Hirosawa,M., Hoshida,M., Ishikawa,M. and Toya,T. (1993) Comp. Applic. Biosci., 9, 161–167. 14 Ishikawa,M., Toya,T. and Totoki,Y. (1993) Artificial Intelligence and Genome Workshop, 13th International Joint Conference on Artificial Intelligence, Chambery, France. 15 Altschul,S.F. and Erickson,B.W. (1986) Bull. Math. Biol., 48, 603–609. 16 Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, NBRF Washington, vol 5, supplement 3. 17 Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl. Acad. Sci. USA, 89, 10915–10919. 18 Altschul,S.F. (1989) J. Theor. Biol., 138, 297–309. 19 Altschul,S.F., Carrol,R.J. and Lipman,D.J. (1989) J. Mol. Biol., 207, 647–653. 20 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comp. Applic. Biosci., 10, 19–29. 21 Golberg,D.E. (ed.) (1989) Genetic Algorithms in Search, Optimisation and Machine Learning, Addison-Wesley, New York. 22 Brindle,A. (1979) Analysis of Frequency Errors in Three Sampling Algorithms (unpublished manuscript) University of Albert, Department of Computer Science Edmonton, Canada. 23 De Jong,K.A. (1975) Dissertation Abstract Int., 36, 5140B. 24 Goldberg,D.E. and Richardson,J. (1987) In Gresfensette,J.J.(ed.), Proceedings of the Second International Conference on Genetic Algorithms, 81–87. 25 Davis,T.E. and Principe,J.C. (1991) In Belew,R.K. and Booker,L.B. (eds), Prcoceedings of the Fourth International Conference on Genetic Algorithms, 174–181. 26 Davis,L. (ed.) (1991) Handbook of Genetic Algorithms, Van Nostrand Reynolds, New York. 27 Henikoff, S. (1991) New Biol., 3, 1148–1150. 28 Davis,L. (1989) In Schaffer,J.D. (ed.) Proceedings of the Third International Conference on Genetic Algorithms, 61–69. 29 Pascarella,S. and Argos,P. (1992) Protein Eng., 5, 121–137. 30 Waterman,M.S. and Vingron,M. (1994) Proc. Natl. Acad. Sci. USA, 91, 4625–4628. 31 Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl. Acad. Sci. USA, 84, 4355–4358. 32 Eddy,S.R. (1994) In Rawlings,C., Clark,D., Altman,R., Hunter,L., Lengauer,T. and Wodak,S.(eds), Proceedings of the Third Conference on Inteligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, USA. 33 Vingron,M. and Argos,P. (1990) Protein Eng., 3, 565–569.

doi:10.1006/jmbi.2000.4042 available online at http://www.idealibrary.com on

J. Mol. Biol. (2000) 302, 205±217

T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment
Cedric Notredame1,2,3*, Desmond G. Higgins4 and Jaap Heringa1 Â
National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK ISREC, 155, Ch. des Boveresses, CH, 1066 Epalinges/s Lausanne Switzerland Information Genetique et Structurale, CNRS-UMR 1889 31 Ch. Joseph Aiguier 13402 Marseille, France
4 Department of Biochemistry University College, Cork Ireland 3 2 1

We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacri®ce in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are signi®cantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more dif®cult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.
# 2000 Academic Press

*Corresponding author

Keywords: pair-wise alignment; progressive alignment; local alignment; global alignment; multiple sequence alignment

Introduction
The simultaneous alignment of three or more nucleotide or amino acid sequences is one of the commonest tasks in bioinformatics. Multiple alignments are an essential pre-requisite to many further analyses of protein families such as homology modeling or phylogenetic reconstruction, or are simply used to illustrate conserved and variable sites within a family. These alignments may be further used to derive pro®les (Gribskov et al., 1987) or hidden Markov models (Bucher et al., 1996; Haussler et al., 1993) that can be used to scour databases for distantly related members of the family. The automatic generation of an accurate multiple alignment is potentially a daunting task. Ideally, one would make use of an in-depth knowledge of the evolutionary and structural relationships within the family, but this information is often
Abbreviations used: 3D, three-dimensional; SP, sumof-pairs; NBS(s), nucleotide-binding site(s). E-mail address of the corresponding author: cedric.notredame@gmail.com
0022-2836/00/010205±13 $35.00/0

lacking or dif®cult to use. General empirical models of protein evolution (Benner et al., 1992; Dayhoff, 1978; Henikoff & Henikoff, 1992) are widely used instead, but these can be dif®cult to apply when the sequences are less than 30 % identical (Sander & Schneider, 1991). Further, mathematically sound methods for carrying out alignments, using these models, can be extremely demanding in computer resources for more than a handful of sequences (Carrillo & Lipman, 1988; Wang & Jiang, 1994). In practice, heuristic methods are used for all but the smallest data sets. The most commonly used heuristic methods are based on the progressive-alignment strategy (Feng & Doolittle, 1987; Hogeweg & Hesper, 1984; Taylor, 1988). with ClustalW (Thompson et al., 1994) being the most widely used implementation. The idea is to take an initial, approximate, phylogenetic tree between the sequences and to gradually build up the alignment, following the order in the tree. Although successful in a wide variety of cases, this method suffers from its greediness. Errors made in the ®rst alignments cannot be recti®ed later as the rest of the sequences are added in. T-Coffee is an attempt to minimize that effect, and
# 2000 Academic Press

206 although the strategy we propose here is also a greedy progressive method, it allows for much better use of information in the early stages, as we will see below. The main alternative to progressive alignment is the simultaneous alignment of all the sequences. Two such packages exist (MSA (Lipman et al., 1989) and DCA (Stoye et al., 1997)), based on the Carrilo and Lipman (1988) algorithm, but they remain an extremely CPU and memory-intensive approach. Iterative strategies (Gotoh, 1996; Notredame & Higgins, 1996) are another interesting alternative. They do not provide any guarantees about ®nding optimal solutions but are reasonably robust and much less sensitive to the number of sequences than their deterministic counterparts. All of these methods attempt to carry out global alignments, where one tries to align the full lengths of the sequences with each other. Alternatively, one might wish to consider local similarity, as occurs when two proteins share only a domain or motif. For two-sequence comparisons, there is the well-known Smith and Waterman (1981) algorithm. Here we use Lalign (Huang & Miller, 1991), from the FASTA package (Pearson & Lipman, 1988), which is a variant of the Smith and Waterman method. It produces sets of non-overlapping local alignments from the comparison of two sequences. For multiple sequences, the Gibbs sampler (Lawrence et al., 1993) and Dialign2 (Morgenstern, 1999) are the main automatic methods. These programs often perform well when there is a clear block of ungapped alignment shared by all of the sequences. They perform poorly, however, on general sets of test cases when compared with global methods (Thompson et al., 1999b; this work). In principle, a method able to combine the best properties of global and local multiple alignments might be very powerful. This is the second motivation for T-Coffee: the design of a method that provides a simple, ¯exible and, most importantly, accurate solution to the problem of how to combine information of this sort. Accuracy is tested as overall performance on 141 test case alignments from the BaliBase collection (Thompson et al., 1999a,b).

T-Coffee: a Method for Sequence Alignment

The second main feature of T-Coffee is the optimization method, which is used to ®nd the multiple alignment that best ®ts the pair-wise alignments in the input library. We use a so-called progressive strategy (Feng & Doolittle, 1987; Taylor, 1988; Thompson et al., 1994), which is similar to that used in ClustalW. This has the advantage of being fast and relatively robust. With TCoffee, however, we make use of the information in the library to carry out progressive alignment in a manner that allows us to consider the alignments between all the pairs while we carry out each step of the progressive multiple alignment. This gives us progressive alignment, with all its advantages of speed and simplicity, but with a far lesser tendency to make errors like the one shown in Figure 2(a), i.e. misalignment of the word CAT. TCoffee is a progressive alignment with an ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage. Generating a primary library of alignments The primary library contains a set of pair-wise alignments between all of the sequences to be aligned. We use the structure described by Notredame et al. (1998). This does not require the alignments to be consistent (e.g. two or more different alignments of the same pair of sequences can be included). In the library, we include information on each of the N(N À 1)/2 sequence pairs, where N is the number of sequences. Here, we use two alignment sources for each pair of sequences, one local and one global. The global alignments (Figures 1 and 2(b)) are constructed using ClustalW on the sequences, two at a time (default parameters; version 1.75). This is used to give one fulllength alignment between each pair of sequences. The local alignments (Figure 1) are the ten topscoring non-intersecting local alignments, between each pair of sequences, gathered using the Lalign program of the FASTA package with default parameters. Lalign is the FASTA implementation of the Sim program (Huang & Miller, 1991; Pearson & Lipman, 1988). In the library, each alignment is represented as a list of pair-wise residue matches (e.g. residue x of sequence A is aligned with residue y of sequence B). In effect, each of these pairs is a constraint. All of these constraints are not equally important. Some may come from parts of alignments that are more likely to be correct. We take this into account when computing the multiple alignment and give priority to the most reliable residue pairs. This is achieved by using a weighting scheme. Derivation of the primary library weights T-Coffee assigns a weight to each pair of aligned residues in the library (Figure 2(b)). An ideal primary weight will re¯ect the correctness of a constraint. We use sequence identity, which is known

T-Coffee Algorithm
T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation) has two main features. First, it provides a simple and ¯exible means of generating multiple alignments, using heterogeneous data sources. The data from these sources are provided to T-Coffee via a library of pair-wise alignments. Here we demonstrate the power of T-Coffee by computing multiple alignments using a library that was generated using a mixture of local and global pair-wise alignments (Figure 1).

T-Coffee: a Method for Sequence Alignment

207

Figure 1. Layout of the T-Coffee strategy; the main steps required to compute a multiple sequence alignment using the T-Coffee method. Square blocks designate procedures while rounded blocks indicate data structures.

to be a reasonable indicator of accuracy when aligning sequences with more than 30 % identity (Sander & Schneider, 1991). This weighting scheme proved to be highly effective for a previous consistency-based objective function (Notredame et al., 1998). It also has the advantage of great simplicity. Libraries are lists of weighted pair-wise constraints. Each constraint receives a weight equal to percent identity within the pair-wise alignment it comes from (Figure 2(b)). For each set of sequences, two primary libraries are computed along with their weights, one using ClustalW (global alignments; Figure 2(b)) and the second using Lalign (local). Combination of the libraries Our aim is the ef®cient combination of local and global alignment information. This is achieved by pooling the ClustalW and Lalign primary libraries

in a simple process of addition. If any pair is duplicated between the two libraries, it is merged into a single entry that has a weight equal to the sum of the two weights. Otherwise, a new entry is created for the pair being considered. This ``stacking'' of the signal is similar to previously described strategies (Bucka-Lassen et al., 1999; Heringa, 1999; Taylor, 1999). Pairs of residues that did not occur are not represented (by default they will be considered to have a weight of zero). This primary library can be used directly to compute a multiple sequence alignment. We could ®nd an alignment that best matched the weighted pairs of residues. However, we enormously increase the value of the information in the library by examining the consistency of each pair of residues with residue pairs from all of the other alignments. For each pair of aligned residues in the library, we can assign a weight that re¯ects the degree to which those residues align consistently with residues

208

T-Coffee: a Method for Sequence Alignment

Figure 2. The library extension. (a) Progressive alignment. Four sequences have been designed. The tree indicates the order in which the sequences are aligned when using a progressive method such as ClustalW. The resulting alignment is shown, with the word CAT misaligned. (b) Primary library. Each pair of sequences is aligned using ClustalW. In these alignments, each pair of aligned residues is associated with a weight equal to the average identity among matched residues within the complete alignment (mismatches are indicated in bold type). (c) Library extension for a pair of sequences. The three possible alignments of sequence A and B are shown (A and B, A and B through C, A and B through D). These alignments are combined, as explained in the text, to produce the position-speci®c library. This library is resolved by dynamic programming to give the correct alignment. The thickness of the lines indicates the strength of the weight.

from all the other sequences. This process is called library extension. Extending the library Fitting a set of weighted constraints into a multiple alignment is a well-known problem, formulated by Kececioglu as an instance of the ``maximum weight trace'', an NP-complete problem (Kececioglu, 1993). Recently, two optimization

strategies were proposed (Notredame et al., 1998; Reinert et al., 1997). The ®rst one relies on a genetic algorithm while the second is based on a graphtheoretical method using a branch and bound algorithm. Neither of these methods is entirely satisfactory. The genetic algorithm (Notredame & Higgins, 1996) is robust but may require prohibitive computation time. The graph-theory-based algorithm has a complexity only partially charac-

T-Coffee: a Method for Sequence Alignment

209 data sets used here, the complexity is closer to (O)N3L. Weights will be zero for any residue pairs that never occur (this will be true of the majority of residue pairs). Otherwise, the weight will re¯ect a combination of the similarity of the pair of sequences or sequence segments that the residue pair comes from and the consistency of that residue pair with all other residue pairs in the primary library. These scores can then be used to align any two sequences from our data set using conventional dynamic programming (Gotoh, 1982). When one normally aligns a pair of sequences, one uses a set of scores derived from some general table of amino acid weights such as a Blosum matrix (Henikoff & Henikoff, 1992). In our case, we can replace that matrix with a set of scores that are speci®c to every possible pair of residues in our two sequences. This will allow an alignment to be carried out that will take account of the particular residues in the two sequences but will also be guided towards consistency with all of the other sequences in the data set (Figure 2(c)). This is a powerful ability and can be used to carry out progressive alignment while avoiding many of the local-minimum problems normally associated with that technique. Progressive alignment strategy In the progressive alignment (Thompson et al., 1994), pair-wise alignments are ®rst made to produce a distance matrix between all the sequences, which in turn is used to produce a guide tree using the neighbor-joining method (Saitou & Nei, 1987). This is a phylogenetic tree, which is used to direct the grouping of sequences during the multiple alignment process (Figure 2(a)). The closest two sequences on the tree are aligned ®rst using normal dynamic programming. This alignment uses the weights in the extended library above to align the residues in the two sequences. This pair of sequences is then ®xed and any gaps that have been introduced cannot be shifted later. Then the next closest two sequences are aligned or a sequence is added to the existing alignment of the ®rst two sequences, depending which is suggested by the guide tree. The next two closest sequences or pre-aligned group of sequences are always joined. This continues until all the sequences have been aligned. To align two groups of pre-aligned sequences the scores from the extended library are used, as before, but the average library scores in each column of existing alignment are taken. As used here, the procedure does not require any additional parameters such as gap penalties. This stems, in part, from the fact that the substitution values (the library weights) were computed on alignments where such penalties had already been applied. Furthermore, high scoring segments that show consistency within the data set see their score enhanced by the extension to such a point that they become insensitive to gap penalties. In practice, this means that during the progressive

terized and may fail in some cases for reasons that are dif®cult to predict. We circumvent the problem by using a heuristic algorithm that we call library extension (Figure 2(c)). The overall idea is to combine information in such a manner that the ®nal weight, for any pair of residues, re¯ects some of the information contained in the whole library. To do so, a triplet approach is used, as summarized in Figure 2(c). The strategy bears some similarities to the concept of overlapping weights developed in Dialign2 (Morgenstern, 1999) or the intermediatesequence method proposed by Neuwald et al. (1997) for searching databases. It is based on taking each aligned residue pair from the library and checking the alignment of the two residues with residues from the remaining sequences. For instance, let us consider the four sequences A, B, C and D of Figure 2. Let us call A(G) the G of GARFIELD in sequence A, B(G) the equivalent G in sequence B and W(A(G), B(G)) the weight associated with this pair of symbols in the primary library. In the direct alignment of A and B, A(G) and B(G) are matched (Figure 2(b) and (c)). Therefore, the initial weight for that pair of residues can be set to 88 (primary weight of the alignment of sequence A and B, which is the percent of identity of this pair). If we now look at the alignment of sequence A and sequence B through sequence C (Figure 2(c)), we can see that the A(G) and C(G) are aligned, as well as C(G) and A(G). We conclude that there is an alignment of A(G) with B(G) through sequence C. We associate that alignment with a weight equal to the minimum of W1  W(A(G), C(G)) and W2  W(C(G), B(G)). Since W1  77 and W2  100, the resulting weight is set to 77. In the extended library, this new value is added to the previous one to give a total weight of 165 (i.e. 77  88) for the pair A(G), B(G). The complete extension will require an examination of all the remaining triplets. Not all of them bring information. For instance, the alignment of A and B through sequence D does not contain any information relative to A(G) or B(G), and, therefore, it has no in¯uence on the weight associated with A(G) and B(G). In summary, the weight associated with a pair of residues will be the sum of all the weights gathered through the examination of all the triplets involving that pair. The more intermediate sequences supporting the alignment of that pair, the higher its weight. Extension will be carried out on each pair of residues of A and B. Once the operation is complete, sequence pair A and B will have gathered information from all the other sequences in the set. This scenario is repeated for each remaining pair (AC, AD, BC, BD, CD) of sequences. The complete set of pairs constitutes the extended library. The worst-case complexity of this computation is O(N3L2) with L being the average sequence length. However, this will only occur when all the included pair-wise alignments are totally inconsistent. In practice, with the

210 phase, we use a dynamic-programming algorithm (Gotoh, 1982) with gap-opening penalties and gapextension penalties set to zero for aligning two sequences or two groups of pre-aligned sequences. Biological validation of the results In order to test the accuracy of our method, we used the BaliBase database of multiple sequence alignments (Thompson et al., 1999a,b). This collection contains 141 protein alignments that we use as test cases. For most members within each test case, a three-dimensional (3D) structure is available. The BaliBase multiple alignments were constructed by manual structure comparison and validated using structure-superposition algorithms such as SSAP (Orengo & Taylor, 1996) or DALI (Holm & Sander, 1995). The alignments are thus unlikely to be biased toward any speci®c multiple-alignment method. For analysis purposes the authors have annotated these alignments by marking blocks of columns deemed to be correctly aligned. Such decisions were made in a conservative manner, only including blocks for which structural evidence is conclusive. This removes most scope for human error but also removes many sections where there are no meaningful alignment between the struc-

T-Coffee: a Method for Sequence Alignment

tures. Altogether, these trusted regions represent 58 % of the aligned residues and have a level of identity on average 5 percentage points higher than that of the complete alignment. There are ®ve basic categories of alignments (families) in BaliBase. They encompass most of the situations that arise when making multiple sequence alignments. The level of average identity within each BaliBase alignment can be seen in Figure 3; it ranges from 10 to 70 %. The coverage is similar for each of the ®ve categories. The ®rst category is made up of phylogenetically equidistant members. In the second category, each alignment contains one orphan sequence with a group of close relatives. The third category contains two distant groups, while the fourth and ®fth categories, respectively, involve long terminal and internal insertions. Overall, these 141 test cases constitute the most versatile and sensitive benchmark available today for assessing the accuracy of multiple sequence alignment methods (Thompson et al., 1999b). The version of BaliBase used here is the one that was publicly available in January 1999, and is a more recent version, with different alignment ®les, than that used in the analysis by Thompson et al. (1999b). The differences between the two BaliBase releases

Figure 3. Comparison between T-Coffee and Prrp. For each alignment in BaliBase, the average level of pair-wise identity was measured on the core regions of the reference alignment. Alignment accuracy was assessed for T-Coffee and Prrp on these core regions. The latter two values were subtracted (%T-Coffee accuracy À%Prrp accuracy) and plotted versus the average identity for the alignment. Points in the top area indicate alignments where T-Coffee is outperforming Prrp and inversely for the bottom area. Alignments have been divided into three sets: below 30 % identity (34 alignments), between 30 and 40 % identity (52 alignments) and above 40 % identity (55 alignments). The percentages given in the corners of the plot indicate the fraction of alignments for which T-Coffee outperforms Prrp (top) and vice-versa (bottom). These percentages do not add up to one hundred as for some alignments the same accuracy was obtained with each method (e.g. for alignments having less than 30 % identity, T-Coffee outperforms Prrp in 53 % of the cases, Prrp outperforms T-Coffee in 38 % of the cases, and the two methods draw in 9 % of the cases).

T-Coffee: a Method for Sequence Alignment

211 respective publications and installed on our machines. Statistical validation It is critical to establish whether differences observed between two methods are statistically meaningful. We used the same strategy as Gotoh (1996), which involves applying the Wilcoxon signed matched-pair ranked test on the results obtained with two methods on the 141 BaliBase alignments. This non-parametric test allows the association of a P-value with the differences measured on these two series of results. In that case, the P-value is the probability that the observed differences may be due to chance. The lower the P-value, the more signi®cant the result. Implementation T-Coffee is implemented in ANSI C. Its tree-parsing and tree-calculating facilities were taken from the ClustalW package, and it uses a modi®ed version of the Lalign program. T-Coffee is available free of charge on request from the authors and is distributed with documentation and examples (send a request to cedric.notredame@gmail.com). Here, the program was run on a LINUX platform with Pentium II processors (330 MHz).

mostly affect category 4, where four test-cases out of 13 are different between the two releases. Validation is carried out by comparing each calculated multiple alignment with its counterpart in BaliBase. The scoring scheme is the percentage of the trusted columns in the reference that have been correctly aligned. This column-wise comparison has been described as being more sensitive and discriminating (Thompson et al., 1999b) than the alternative pair-wise comparison used by Gotoh (1996), especially in the case of categories 2 and 3 of BaliBase. In the context of this work, the column measure, applied to the trusted regions, is the default that was used to generate our results. For comparison purposes, we also implemented the so-called sum-of-pairs (SP) measure, where a calculated alignment is assessed on proper alignment of all possible pairs of residues in each of the alignment columns. This measure generally leads to a more gradual loss of score in case of misalignment than the above column-count measure. The comparison routine we used was devised following Baliscore, the program made available and used by Thompson et al. (1999b), although some updates were effected to ensure accurate alignment comparison. Comparison with other methods To compare T-Coffee with other methods, we produced multiple alignments of each BaliBase family with other programs. We evaluated three such packages here. They include the methods described as performing best by the authors of BaliBase. Together, they cover a large portion of the existing algorithms for multiple sequence alignments. Prrp (Gotoh, 1996) attempts to simultaneously align all the sequences in an iterative manner. It uses a scoring function very similar to the MSA program (weighted sums of pairs; Lipman et al., 1989). ClustalW (Thompson et al., 1994) is a progressive-alignment method. Dialign2 (Morgenstern, 1999) is a segment-based method that constructs the multiple alignment by assembling a collection of high-scoring segments in a sequence-independent progressive manner. Methods based on multidimensional dynamic programming like MSA (Lipman et al., 1989) or DCA (Stoye, 1998) could not be used in the evaluation as they aborted the construction of alignments in about 10 % of the BaliBase sets. For the alignments that MSA and DCA could construct, the accuracy was comparable to Prrp. We constructed the alignments by extracting the sequences from the BaliBase reference alignments and realigning them with a given program. In each case, the parameters used were the default settings provided by the authors. We made no attempt of tuning, either on T-Coffee and its associated methods, or on the methods used for comparison. The packages used here are those that were available in January 1999 when they were downloaded from the sites indicated by the authors in their

Results
Combining local and global alignments without extension The effect of combining local and global alignments is shown in Table 1. Three alternative primary libraries (i.e. without extension) were used to make the alignments: the ClustalW pair-wise library (C), the Lalign pair-wise library (L), and pooling of the ClustalW and Lalign pair-wise libraries (CL). In each of the ®ve BaliBase categories, the combination of local and global information (CL) induced a statistically meaningful improvement over the two single method-based protocols (Table 1). On average (Total in Table 1), CL is at least 7.6 percentage points better than C or L. The Wilcoxon test shows that these differences are associated with P-values lower than 0.001. Effect of the library extension The three previously used libraries (C, L, CL) were extended. In all three cases, extended libraries (CE, LE, CLE) induced signi®cantly improved performances when compared to their non-extended counterparts (C, L, CL), regardless of the BaliBase category (Table 1). Most importantly, CLE signi®cantly outperforms all of the alternative protocols in all categories. Table 1 also shows that the performance of CLE is highly sustained, while in contrast, the second-best protocol varies over the BaliBase categories (CE in Cat1 and 2, CL for Cat3,

Table 1. The effect of combining local and global alignments
Name C CE L LE CL CLE ClustalW ClustalW ... ... ClustalW ClustalW pw pw pw pw Protocol ... ... Lalign Lalign Lalign Lalign ... extend ... extend ... extend Cat1 (81) 70.6 77.1 65.4 72.6 76.2 80.7 Cat2 (23) 26.7 33.6 12.1 25.6 32.0 37.3 Cat3 (4) 43.0 47.6 22.8 47.2 48.3 52.9 Cat4 (12) 56.0 64.8 53.9 77.5 76.2 83.2 Cat5 (11) 60.0 75.9 66.0 85.5 74.6 88.7 Total (141) 58.9 66.3 52.0 64.2 66.5 72.1 Significance 7.8a 17.7a 7.8a 16.3a 12.1a ...

pw pw pw pw

Protocol shows the way the library was created. ClustalW pw and Lalign pw show the pair-wise alignments computed with one of these programs, using default parameters. Extend indicates that the library was extended before progressive alignment. CLE uses a combination of ClustalW and Lalign alignments and library extension. Cat1 to Cat5 are the ®ve reference categories of BaliBase; number sin parentheses indicate the number of alignments in a category. The average accuracy is then given for each protocol. The best accuracies in each column are shown in bold and underlined. Total gives the average accuracy across all 141 test alignments. The last column shows the percentage of times that CLE is outperformed by each other protocol. The statistical signi®cance of the improvement of CLE over each protocol is shown by a (P < 0.001).

T-Coffee: a Method for Sequence Alignment

213 two methods (Gotoh, 1996) and shows that the alignments for test-cases with less than 30 % average sequence identity improve the most. The Figure shows that at this low identity level, there is an almost two-thirds chance of obtaining the best alignment when using T-Coffee rather than Prrp. Application to serine/threonine kinases A major application of any alignment algorithm will be the delineation of motifs or domains. In Figure 4 we show an example that illustrates the usefulness of T-Coffee for identifying functional features of a series of kinases taken from BaliBase (kinase3 in ref5). These proteins belong to a subfamily of protein serine/threonine kinases. Each sequence is identi®ed by its SwissProt identi®er except for gcn2, which is from PDB. A 3D structure is also available for 11 of these sequences. Each of the 19 sequences in the alignment contains a nucleotide-binding site (NBS), marked by boldtype capital letters in Figure 4. In all these sequences, the NBS is followed by a second conserved motif toward the C terminus (also marked in capital letters). T-Coffee was able to accurately align 18 of the 19 NBSs, as were Dialign2 and Prrp. ClustalW was only able to correctly align 16 of these NBSs. The second motif is more dif®cult because of the long indel in st11 yeast. Here as well, T-Coffee can properly align 18 of the motifs, while Prrp and ClustalW get 15 correct, and Dialign2 only 13. This trend is con®rmed with the measure of the accuracy on that portion of the alignment. The column measure indicates a score of 0 % for ClustalW and Prrp, 30.9 % for Dialign2 and 39.8 % for T-Coffee. The SP measure gives a score of 65.4 % for ClustalW, 73.1 % for Prrp, 83.0 % for Dialign2, and 92.7 % for T-Coffee. The Gibbs sampler (Lawrence et al., 1993) was also attempted on the set of kinase sequences, but could never align more than ten of the motifs (and only when provided with an estimate of the total number of blocks in the alignment). As a result of combining local and global alignment information, T-Coffee managed to align almost all of the motifs as in the BaliBase reference alignment. Moreover, T-Coffee was the only program that correctly aligned the second motif of kp68 human, which is an interferon-induced kinase and an essential component of the viral response. It is activated by interacting with double-stranded RNA (Meurs et al., 1990), whereupon it induces inhibition of protein synthesis. Efficiency The complexity of the whole procedure is given by: ON 2 L2   ON 3 L  ON 3   ONL2  where O(N2L2) is associated with the computation of the pair-wise library, O(N3L) the extension,

LE in Cat4 and Cat5). These results show that the combination of local (Lalign) and global (ClustalW) information boosts the quality of multiple alignment. Table 1 indicates that the CLE protocol is outperformed by the second-best protocol (CL) in only about 12 % of the cases, as assessed over 141 BaliBase alignments. Comparing T-Coffee with other multiple sequence alignment methods The protocol used to assess the four methods (Dialign2, ClustalW, Prrp and T-Coffee (CLE)) is identical with that described in the previous section, and the results are organized in a similar layout (Table 2). Each program was executed using its default parameters (see, Using T-Coffee). T-Coffee (CLE protocol) shows the highest average accuracy in each BaliBase category (Table 2), even in category 4 where long internal deletions require a method able to deal with local similarity such as Dialign2. These differences are all statistically signi®cant (Table 2). When considering the unweighted average accuracy over the ®ve categories (Table 2, Total2) T-Coffee is 9.7 % more accurate than the next-best method, Prrp. In most BaliBase categories, Prrp is the second-best method, slightly outperforming ClustalW, as reported by Gotoh (1996). Repeating the evaluation on the complete alignments (as opposed to the core regions only) shows that the trend is persistent: TCoffee still outperforms all the other methods in the ®ve categories. However, when measured over the complete alignments, ClustalW becomes the second-best protocol, with an unweighted average accuracy of 43.7 % as opposed to 48.7 % for TCoffee. The difference of 5 % in performance is statistically signi®cant, as the Wilcoxon test results in a P-value lower than 0.01. These alignments were also evaluated using the sum-of-pairs measure, where each pair of residues is compared between the two alignments. This measure is less drastic than the column measure as it allows one to score columns that are partially correct. For instance, it tolerates the complete misalignment of one sequence without making all the columns count as being wrong. This measurement, carried out on the annotated blocks of BaliBase, gave similar results: T-Coffee outperforms the other methods in the ®ve categories. The differences between the methods are slightly less pronounced: T-coffee achieves 89.7 % of the pairs correctly aligned, while Prrp, the second-best, aligns 86.2 % of the pairs correctly. ClustalW comes third, with 85.6 % of the pairs correctly aligned. We conclude that the increase in alignment accuracy observed with T-Coffee is signi®cant and consistent over the two generally applied accuracy measures used here. Most of the improvement with T-Coffee tends to concentrate in the BaliBase alignments having a low level of average identity. Figure 3 follows the representation proposed by Gotoh for comparing

Table 2. T-Coffee compared with other multiple sequence alignment methods
Method Dialign ClustalW Prrp T-Coffee Cat1 (81) 71.0 78.5 78.6 80.7 Cat2 (23) 25.2 32.2 32.5 37.3 Cat3 (4) 35.1 42.5 50.2 52.9 Cat4 (12) 74.7 65.7 51.1 83.2 Cat5 (11) 80.4 74.3 82.7 88.7 Total1 (141) 61.5 66.4 66.4 72.1 Total2 (141) 57.3 58.6 59.0 68.7 Significance 11.3a 26.2a 36.9a

Method indicates the name of the method evaluated. T-Coffee is the protocol CLE in Table 1. Total1 gives the average accuracy across all the 141 alignments. Total2 is the average accuracy across the ®ve BaliBase categories (unweighted). The last column shows the percentage of times that T-Coffee is outperformed by each other protocol. The statistical signi®cance of the improvement of T-Coffee over each method is shown by a (P < 0.001). The Table layout is otherwise similar to that of Table 1.

T-Coffee: a Method for Sequence Alignment

215

Figure 4. Example of a T-Coffee alignment. This N-terminal alignment of 19 kinases shows two boxes containing the nucleotidebinding site and a conserved motif. The residues in capital letters are annotated as core regions in BaliBase. The core residues in red are correctly aligned with respect to the BaliBase reference. This alignment belongs to BaliBase category 5 (long insertion).

O(N3) the computation of the NJ tree and O(NL2) the computation of the progressive alignment (assuming N sequences of length L that can be aligned in a multiple alignment of length L). The CPU time consumption of T-Coffee was analyzed empirically. Our measurements (data not shown) indicate that with alignments of similar size as those considered here, the apparent complexity of the program is quadratic, both relative to the average sequence lengths and to the number of sequences. This result can be explained by the fact that in the cases analyzed here, L4N. Therefore, the time required for the library and the alignment computation is much

larger than the time required for the library extension: O(N2L2)  O(NL2)4O(N3L). The complexity of the latter is the same as that of ClustalW, even if in absolute time, the overhead is higher. For instance, given the Lalign and ClustalW primary libraries, T-Coffee is about two times slower than ClustalW.

Discussion
T-Coffee is a new progressive method for sequence alignment. It can combine signals from heterogeneous sources (e.g. sequence-alignment programs, structure alignments, threading, manual

216 alignment, motifs and speci®c constraints) into a unique consensus multiple sequence alignment. We show here that a combination of local and global alignments leads to a signi®cant increase in alignment accuracy. The method is more accurate than its counterparts and has proved successful in a wide variety of cases. The main difference from traditional progressive alignment methods is that, instead of using a substitution matrix for aligning the sequences, a position-speci®c scoring scheme is used (the extended library). Thanks to the extension process, the values contained in the library for a given pair of sequences also depend on information from the other sequences in the set. In this way, errors are less likely to occur during early stages of the progressive alignment. As a consequence, even though the paradigm ``once a gap always a gap'' (Feng & Doolittle, 1987) remains true, misplacing gaps becomes much less likely. The second important feature of T-Coffee is the combination of local and global information. Although it has long been suspected that such a combination was probably necessary for computing high-quality alignments (McClure et al., 1994), to date no satisfactory formula had been found to address this problem ef®ciently. Through combining local and global alignments from widely used programs with a new formalism, T-Coffee appears to provide a convincing solution. The end-user bene®ts from the simplicity of the method and does not need to provide any extra parameter values. A key ingredient of the method is the primary weighting scheme. A shortcoming of the current use of average sequence identity is that this tends to overweight small segments where high similarity is more likely to occur by chance. This is particularly signi®cant when weighting shorter segments obtained from a local alignment program such as Lalign. The main reason why T-Coffee can tolerate such noise is because short high-scoring segments are rarely consistent enough to have a strong effect on the position-speci®c scoring scheme after extension. Moreover, ®nal alignments are processed using dynamic programming (progressive alignment). This makes it less likely for misplaced high-scoring segments to affect the alignment. For other protocols, which incorporate segments in a multiple alignment following a strict order based on their weight (Morgenstern, 1999), such fortuitous segments can be a major pitfall. Although the protocol proposed here (Lalign  ClustalW pair-wise alignments  extension) employs a minimal combination of local and global information, there is no theoretical limit to the number of methods that can be used. For instance, alignments from structural comparisons could be combined with sequence alignments. It is also possible to incorporate, in the library, information extracted from multiple alignments.

T-Coffee: a Method for Sequence Alignment

Acknowledgments
The authors thank the following people: Philipp Bucher for useful discussions and advice at an early stage of the project, Willie Taylor for useful discussions, Nigel Douglas for providing us with an ef®cient LINUX environment, and Webb Miller and Bill Pearson for allowing us to use and modify Lalign from the FASTA package. This work was funded in part by grant 310049669.96 from the Swiss National Science Foundation.

References
Benner, S. A., Cohen, M. A. & Gonnet, G. H. (1992). Response to Barton's letter: computer speed and sequence comparison. Science, 257, 609-1610. Bucher, P., Karplus, K., Moeri, N. & Hofmann, K. (1996). A ¯exible motif search technique based on generalized pro®les. Comput. Chem. 20, 3-23. Bucka-Lassen, K., Caprani, O. & Hein, J. (1999). Combining many multiple alignments in one improved alignment. Bioinformatics, 15, 122-130. Carrillo, H. & Lipman, D. J. (1988). The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073-1082. Dayhoff, M. O. (1978). Atlas of Protein Sequence and Structure, vol. 4, Suppl. 3, National Biomedical Research Foundation, Washington, USA, DC. Feng, D.-F. & Doolittle, R. F. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360. Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708. Gotoh, O. (1996). Signi®cant improvement in accuracy of multiple protein sequence alignments by iterative re®nements as assessed by reference to structural alignments. J. Mol. Biol. 264, 823-838. Gribskov, M., McLachlan, M. & Eisenberg, D. (1987). Pro®le analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355-5358. È Haussler, D., Krogh, A., Mian, I. S. & Sjolander, K. (1993). Proceedings for the 26th Hawaii International Conference on Systems Sciences, Wailea, HI, USA. Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915-10919. Heringa, J. (1999). Two strategies for sequence comparison: pro®le-preprocessed and secondary structureinduced multiple alignment. Comput. Chem. 23, 341364. Hogeweg, P. & Hesper, B. (1984). The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175186. Holm, L. & Sander, C. (1995). Third International conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. Huang, X. & Miller, W. (1991). A time-ef®cient, linearspace local similarity algorithm. Advan. Appl. Math. 12, 337-357. Kececioglu, J. D. (1993). The maximum weight trace problem in multiple sequence alignment. Lect. Notes Comput. Sci. 684, 106-119. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 62, 208-214.

T-Coffee: a Method for Sequence Alignment Lipman, D. J., Altschul, S. F. & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412-4415. McClure, M. A., Vasi, T. K. & Fitch, W. M. (1994). Comparative analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. 11, 571-592. Meurs, E., Chong, K., Galabru, J., Thomas, N. S. B., William, B. R. & Hovanessian, A. G. (1990). Molecular cloning and characterization of the human double stranded RNA-activated kinase induced by interferon. Cell, 62, 379-390. Morgenstern, B. (1999). Dialign2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211-218. Neuwald, A. F., Liu, J. S., Lipman, D. J. & Lawrence, C. B. (1997). Extracting protein alignment models from the sequence database. Nucl. Acids Res. 25, 1665-1677. Notredame, C. & Higgins, D. G. (1996). SAGA: sequence alignment by genetic algorithm. Nucl. Acids Res. 24, 1515-1524. Notredame, C., Holm, L. & Higgins, D. G. (1998). COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407-422. Orengo, C. A. & Taylor, W. R. (1996). SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol. 266, 617-635. Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444-2448. Reinert, K., Lenhof, H. P., Mutzel, P., Meihorn, K. & Kececioglu, J. D. (1997). A branch-and-cut algorithm for multiple sequence alignment. Recomb97, 241-249. Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406-125.

217
Sander, C. & Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9, 56-68. Smith, T. F. & Waterman, M. S. (1981). Identi®cation of common molecular subsequences. J. Mol. Biol. 147, 195-197. Stoye, J. (1998). Multiple sequence alignment with the divide-and-conquer method. Gene, 211, GC45-56. Stoye, J., Moulton, V. & Dress, A. W. (1997). DCA: an ef®cient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13, 625-626. Taylor, W. R. (1988). A ¯exible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169. Taylor, W. R. (1999). Protein structure comparison using iterated double dynamic programming. Protein Sci. 8, 654-665. Thompson, J., Higgins, D. & Gibson, T. (1994). ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673-4690. Thompson, J., Plewniak, F. & Poch, O. (1999a). BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15, 87-88. Thompson, J. D., Plewniak, F. & Poch, O. (1999b). A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682-2690. Wang, L. & Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337-348.

Edited by J. Thornton (Received 2 May 2000; received in revised form 15 June 2000; accepted 10 July 2000)

#

2003 Oxford University Press

Nucleic Acids Research, 2003, Vol. 31, No. 13 1–4 DOI: 10.1093/nar/gkg522

Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments
Olivier Poirot, Eamonn O’Toole1 and Cedric Notredame*
Information Genetique et Structurale, CNRS, 31 Chemin Joseph Aiguier, 13 402 Marseille Cedex 20, France and 1 High Performance Technical Computing Division, Hewlet Packard, BallyBrit, Galway, Ireland
Received February 14, 2003; Revised and Accepted March 17, 2003

ABSTRACT This paper presents Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientiﬁque. This server is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments. It uses the latest version of the T-Coffee package. Given a set of unaligned sequences, the server returns an evaluated multiple sequence alignment and the associated phylogenetic tree. This server also makes it possible to evaluate the local reliability of an existing alignment and to combine several alternative multiple alignments into a single new one. Tcoffee@igs can be used for aligning protein, RNA or DNA sequences. Datasets of up to 100 sequences (2000 residues long) can be processed. The server and its documentation are available from: http://igs-server.cnrs-mrs.fr/Tcoffee/.

INTRODUCTION Over the years, multiple sequence alignment methods have been established as a key component of biological sequence analysis techniques. Few procedures remain in bioinformatics that do not require at one point or another the assembly of a high quality multiple sequence alignment. One could cite in bulk: the identiﬁcation of a protein signature such as a PROSITE pattern (1), the building of a domain proﬁle (or HMM) needed for identifying the most remote members of a protein family (2), structure prediction (3) and phylogenetic analysis (4). More recently, multiple sequence alignments have also proven useful to the characterization of nsSNP (the nonsynonymous single nucleotide polymorphisms) (5,6). Despite the importance of these applications, the design of an efﬁcient and accurate algorithm for the assembly of multiple sequence alignments remains a difﬁcult problem that has not yet been entirely solved. As a consequence, most of the available packages merely provide approximate solutions (for recent reviews on this problem, see 7 and 8). Furthermore, none of

these methods is consistently better than the others. For instance, systematic benchmarking experiments carried out with established collections of reference alignments (9) have shown that each available package is better suited than the others to certain types of problems but that none is always the best (10). This situation explains why from one bioinformatics project to the next, the authors often use a different multiple sequence alignment package. Unfortunately it is usually difﬁcult to determine which software or algorithm will work best on a given set of sequences and the only way to address this problem is through a tedious trial and error process. This complicated situation makes it critical to ensure that non-specialists have full access to state-of-the-art resources dedicated to multiple sequence alignments. The server we introduce here offers access to the latest version of T-Coffee (11), a recent multiple sequence alignment method. The web interface makes it possible to put some emphasis on important options of this program that are normally buried in the command line syntax. Another advantage of this server is that the people who maintain it also develop the T-Coffee package. This insures that the server always runs the latest version of the package. Several tutorials are available for T-Coffee including one in the book ‘Bioinformatics for dummies’ (12) and another one in preparation for Current Protocols in Bioinformatics. METHODS Multiple sequence alignment assembly T-Coffee is a multiple sequence alignment method. Given a set of pre-selected protein or DNA sequences, T-Coffee computes a multiple sequence alignment of the input sequences. To that effect T-Coffee starts by computing a collection of pair-wise alignments: for each possible pair of sequences in the dataset, the program computes the best global alignment and the 10 best local alignments [using the Sim algorithm from the Lalign package (13)]. This collection of pairwise alignments is named a library in the T-Coffee jargon. In the second step of the procedure T-Coffee assembles a multiple sequence alignment that has the highest possible level of consistency with the alignments within the library. T-Coffee is only a heuristic and the optimality of this process is not guaranteed, although the results are usually satisfactory as judged by comparison with
Q1

*To whom correspondence should be addressed. Tel: þ33 ???; Fax: þ33 ???; Email: cedric.notredame@gmail.com

Nucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved

2

Nucleic Acids Research, 2003, Vol. 31, No. 13

alternative optimization methods (14). To assemble its alignment, our package uses a progressive strategy similar to the one described for ClustalW. Extensive details on this procedure are available in the original publication (11). It is important to point out that in T-Coffee, the construction of the library affords many possibilities. Although the strategy described in the original paper relies on a combination of local and global alignments, one may also use different types of libraries. For instance, the most obvious alternative would be to ﬁll the library with multiple sequence alignments generated with various methods or various parameters setting. This is an option we now give through our new interface (combining alignments). It is also possible to construct the library using structural information rather than sequence. This is exactly the strategy adopted for the assembly of the multiple sequence alignments in the DALI Domain Dictionary (15) where libraries are produced using the DALI structure superposition algorithm. Such an option will soon be available on the server we describe here. Users are also encouraged to download and install the package locally in order to test their own recipes: T-Coffee has been designed to seamlessly turn any type of sequence alignment into the kind of libraries it needs. Given a library, it is also possible to evaluate the consistency between a multiple alignment and every pair of aligned residues contained in this library. This measure of consistency indicates the ‘support’ of the library for the alignment. It can be measured for the complete alignment or at the local level for every individual residue. The local measure is named the CORE index (consistency of overall residue evaluation). Some properties of this index were recently characterized using an established collection of reference alignments (9). These analyses indicate that residues with a core index of ! 5 (on a scale 0–9) have 90% chance of being correctly aligned as judged from their reference structural alignments. This information on the alignment reliability can conveniently be used in order to remove from the alignment the portions that are incorrectly aligned. This may help enhance the sensitivity of protein domain proﬁles or the accuracy of phylogenetic tree reconstructions. Various recent studies have shown that T-Coffee is one of the most accurate multiple sequence alignment packages available today, for protein and nucleotide sequence alignments alike (10,16). These studies all show that T-Coffee is notably more accurate than its close relative ClustalW. This increased accuracy comes at a price and T-Coffee is N times more expensive in terms of CPU time than ClustalW with a time 3 2 Q2 complexity in the order of O(N L ) (L being the length of the sequences and N being the number of sequences). With datasets <30 sequences this difference is barely noticeable, but high performance hardware is needed for datasets >100 sequences.

homepage of the server (igs-server.cnrs-mrs.fr/Tcoffee/) contains pointers to the three types of calculation performed: computing, evaluating or combining multiple sequence alignments. The fourth section points to the online T-Coffee documentation. For each option, one has a choice between the regular and the advanced mode. Regular gives access to a very straightforward interface where the user simply needs to paste the data and retrieve the results. Advanced offers more possibilities for setting various computation and output parameters. It should be pointed out that the Swiss EMBnet node also maintains an alternative T-Coffee server: www.ch.embnet.org. Computing a multiple alignment Sequences must be pasted using the FASTA format. In the regular mode, the library is computed using global and local alignments. In the advanced mode, the user is free to change this and can also request the incorporation of a ClustalW multiple sequence alignment within the library (by checking the clustalw_aln box). As time goes on (and upon request), new methods will be incorporated in the method section of the advanced part. When the computation is ﬁnished, T-Coffee returns a table similar to the one shown in Figure 1. This table gives a pointer to the multiple alignment in its raw form or in a colored format. The table also gives access to the phylogenetic tree. This tree is not a guide tree (i.e. a pre-estimation of the phylogenetic tree computed from unaligned sequences) but a genuine phylogenetic tree computed with the multiple sequence alignment produced by T-Coffee. The procedure involves feeding ClustalW (17) with the T-Coffee alignment and running ClustalW from the command line with the following parameters: clustalw -infile ¼ tcoffee alignment -tree -bootstrap ¼ 100 -tossgaps In this mode (-tree), ClustalW computes the phylogenetic tree associated with an alignment without re-computing the alignment. The -tossgaps parameter causes ClustalW to remove from the alignment every column that contains a gap before computing a Neighbor Joining tree (18) on the remaining columns. The -bootstrap flag indicates that 100 bootstrap cycles are performed to assess the reliability of the tree (cf. ClustalW documentation). In the table of Figure 1, html and pdf point to a color-coded evaluated version of the T-Coffee multiple alignment (Fig. 2). In this colored representation, blue portions and green portions correspond to inconsistent bits, unlikely to be correctly aligned, while the yellow, orange and red bits correspond to the most consistent portions, much more likely to be correctly aligned. In this alignment the level of conservation column by column is also indicated using the ClustalW notation of a ‘Ã’ for completely conserved column, a ‘:’ for highly conserved ones and a ‘.’ for the less conserved ones. T-Coffee can also produce an ASCII version of the evaluated alignment (request the score_ascii output from the advanced menu). In the score_ascii format, residues are recoded using a 0–9 index (0 corresponds to blue bits and 9 to the red ones). This ASCII

THE Tcoffee@igs SERVER The IGS (Information Genetique et Structurale) has developed ´ the most powerful T-Coffee server to date. This server runs on an Alpha ES45 quadriprocessor, kindly provided by HP. This powerful server supports the analysis of a maximum number of 100 sequences with a maximum of 2000 residues each. The

Nucleic Acids Research, 2003, Vol. 31, No. 13

3

Figure 2. Colored output of T-Coffee. The ﬁrst lines indicate the average CORE index associated with every sequence. In the rest of the alignment, red residues correspond to highly reliable portions of the multiple alignment. The Cons line is a consensus, it indicates the average reliability value for every column.

Figure 1. Typical output of a standard T-Coffee computation. The results can be retrieved at the indicated URL for up to 3 days. ph, a pointer to the phylogenetic tree in Newik format; png, a graphic display of this same tree. Command line indicates the exact command line used by T-Coffee to compute the alignment.

version can be used for automatically ﬁltering unreliable portions of your alignment (a small package will soon be made available on our server for this purpose). Evaluating a multiple alignment It is possible to use the Tcoffee@igs server in order to evaluate pre-computed multiple sequence alignments. In this case, the user simply needs to input a pre-computed alignment in any of the following formats: ALN (the ClustalW output format), MSF or FASTA. T-Coffee automatically computes the corresponding library and outputs a colored version of the alignment. The ﬁnal colored output is similar to the one shown in Figure 2 and discussed in the previous section. Combining alternative multiple sequence alignments into a consensus multiple alignment The third option of the T-Coffee server makes it possible to use the package in order to combine several alternative alignments into one. The color-coded evaluated version makes it possible to identify the portions of the consensus alignment that occurred in many of the input alignments (red) and those that are poorly represented (blue). In the current conﬁguration, it is possible to combine up to six multiple alignments (two being the minimum). Upon request to the authors, this number will be increased. It is worth noting that the server is very ﬂexible with the nature and the state of the input sequences. While the program assumes that two sequences with the same name coming from two different alignments are indeed the same sequence,

T-Coffee does not require these sequences to come in the same order or to be perfectly identical. If a discrepancy occurs, T-Coffee aligns the conﬂicting sequences and reconstructs a master sequence based on the consensus given by the reconciliation alignment (see online documentation for further details). While they can be deceptive when working with data of poor quality, these facilities make it easier to work with structural data (where the sequences agreement is not always perfect) or local multiple alignments obtained from a database search where identical sequences may slightly differ on their extremities. It is also possible to combine alignments that do not contain the same number of sequences.

CONCLUSION AND FUTURE DEVELOPMENTS In this paper we describe Tcoffee@igs, a new multiple sequence alignment server. Tcoffee@igs makes it possible for non-specialists to use the T-Coffee package in a simple and intuitive fashion in order to produce high quality multiple sequence alignments. This server also makes it possible to evaluate existing multiple sequence alignments or to combine them into a consensus alignment. Future developments will include enhanced abilities regarding the mixing of sequences and structures. New modules are under development for this server that will smoothly allow the combination of sequences and structures within the framework of a multiple sequence alignment. While the most widely accepted means of validating a new multiple sequence alignment method is its benchmarking with a collection of reference alignments, it has become obvious over the years that the only efﬁcient (albeit highly empirical) means of carrying out such an evaluation is the feedback of biologists: how good was the method in their hand and how well did it fare on proteins or nucleic families they know well? A web interface reduces the technological barrier between biologists and the use of bioinformatics tools, making it easier

4

Nucleic Acids Research, 2003, Vol. 31, No. 13

to collect such feedback. We strongly encourage users of this server to let us know about their impressions, good or bad! REFERENCES
1. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. 2. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. 3. Jones,D.T. (1999) Protein secondary structure prediction based on position-speciﬁc scoring matrices. J. Mol. Biol., 292, 195–202. 4. Phillips,A., Janies,D. and Wheeler,W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. 5. Ng,P.C. and Henikoff,S. (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Res., 12, 436–446. 6. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. 7. Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131–144. 8. Duret,L. and Abdeddaim,S. (2000) In Higgins, D. and Taylor, W. (eds), Bioinformatics, Sequence, Structure and Databanks. Oxford University Press, Oxford.

Q3

9. Notredame,C. and Abergel,C. (2002) In Andrade, M. (ed.), Bioinformatics Methods for Genome Analysis, in press. 10. Lassmann,T. and Sonnhammer,E.L. (2002) Quality assessment of multiple alignment programs. FEBS Lett., 529, 126–130. 11. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 12. Claverie,J.M. and Notredame,C. (2003) Bioinformatics for Dummies. Wiley Publishing, Inc. 13. Huang,X. and Miller,W. (1991) A time-efﬁcient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. 14. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 15. Dietmann,S., Park,J., Notredame,C., Heger,A., Lappe,M. and Holm,L. (2001) A fully automatic evolutionary classiﬁcation of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res., 29, 55–57. 16. Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. 17. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. 18. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425.

Q4

Q5

OXFORD UNIVERSITY PRESS
JOURNAL: NAR MS NUMBER: 00259 TO: AUTHOR AUTHOR QUERIES - TO BE ANSWERED BY THE AUTHOR
The following queries have arisen during the editing of your manuscript. Please answer the queries by marking the required corrections at the appropriate positions in the text. Q1 Please supply tel and fax numbers

ARTID: gkg522

Q2

Is the O correct here, if so what does it mean?

Q3

Please supply article title and page range for ref 8

Q4

Please supply article title and page range for ref 9 and update if possible

Q5

Please supply publisher’s location