doi:10.1016/j.jmb.2004.04.058 J. Mol. Biol. (2004) 340, 385–395 3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments Orla O’Sullivan1, Karsten Suhre2, Chantal Abergel2 ´ Desmond G. Higgins1 and Cedric Notredame2,3* 1 Conway Institute, University College Dublin, Belfield Dublin 4, Ireland ´ Information Genomique et Structurale, CNRS UPR-2589 31, Chemin Joseph Aiguier 13402 Marseille, France Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges Switzerland 3 2 Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment. q 2004 Elsevier Ltd. All rights reserved. *Corresponding author Keywords: multiple alignment; structural superposition; TCoffee; threading; sap Introduction It has long been assumed that using structural information can increase the accuracy of multiple protein sequence alignments (MSA).1 Recent results2,3 suggest that accurate MSAs obtained this way are useful for making functional assignments. These findings are quite exciting in a context where a structure may soon be available for each protein family (transmembrane proteins excepted).4 However, making the best out of this wealth of data will require the development of new automatic methods, able to efficiently incorporate protein structure information within Abbreviations used: MSA, multiple protein sequence alignment(s); S-MSA, structure-based MSA; DP, dynamic programming; NW, Needlman & Wunsch; CS, column score. E-mail address of the corresponding author: cedric.notredame@gmail.com MSAs. The incentive for doing so is very strong, considering the critical role MSAs play in so many sequence analysis applications,5 like phylogenetic reconstruction, structure prediction, functional characterization, database searches and nonsynonymous single nucleotide polymorphism characterization.6 Despite their usefulness, accurate MSAs remain difficult to compute, owing to reasons that are both computational7 and biological.8 From a computational point of view, the assembly of an optimal MSA is a complex problem and an exact solution can be computed only for small sets of related sequences.9 This is the reason why most packages use an approximate heuristic, the progressive alignment algorithm,10 that gives no guarantee on delivering an optimal solution but can rapidly align large sets of sequences. On the biological side, one is limited by the lack of an objective and accurate criterion to assess MSA quality.8 As a consequence, most methods use 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. 386 3DCoffee: Mixing Sequences and Structures sequence similarity (assessed with a substitution matrix) as a criterion for optimization. However, similarity is not informative enough to drive the correct alignment of distantly related sequences, a situation that typically requires using structure comparison methods so that a structure-based MSA (S-MSA) can be derived. S-MSAs constitute the de facto standard of truth for assessing sequence alignment accuracy and several established S-MSAs collections11 – 13 are used routinely to evaluate MSA packages.14 – 17 Although one may argue that these highly accurate MSAs (as judged from structural analysis) are not always optimal from an evolutionary point of view, they usually reflect well the structural and functional relationships between the considered proteins. With 3DCoffee, we show that using a small amount of structural information when assembling an MSA makes it possible to improve alignment accuracy and emulate the computation of an S-MSA. Combining sequences and structures in this manner requires the integration of three types of methods: (i) sequence alignment methods; (ii) methods for comparing two or more structures and deduce a sequence alignment; (iii) methods for comparing sequences and structures, often referred to as threading. Sequence – sequence comparison methods rely mostly on the dynamic programming (DP) algorithm to compute an alignment where gaps are disposed in such a manner that similarity is maximized between the two sequences.18,19 Given a substitution matrix and a gap penalty scheme, DP can be used to compute global or local alignments20,21 but accurate alignments can be obtained only with pairs of sequences that are at least 30% identical.22 Structure–structure comparison has been approached using a wide variety of heuristics,23,24 and to this day more than 30 algorithms have been reported. The simplest, like LSQman,25 use rigid body superposition and let the algorithm look for an optimal superposition where intermolecular distances are minimized between superposed positions in the two structures. These methods perform well on similar structures where the 3D relationships of residues have been well preserved by evolution. These structures are usually encoded by closely related sequences. When dealing with more distantly related sequences, the residue equivalences can be worked out iteratively, as done in STAMP,26 where the equivalences are used to drive a superposition that is used, in turn, to compute a distance matrix. The algorithm uses this updated matrix to refine the set of residue equivalences and make a new superposition. The process is carried out until it converges. SAP27 uses a similar principal, although rather than being iterative, the algorithm computes the series of rigid superpositions associated with forcing the superposition of every possible pair of residues. The final alignment is computed by DP, using the summed distance matrices of all the superpositions considered. DALI produces align- ments of comparable accuracy, computed by considering the local comparison of the distance maps associated with the considered structures.28 Most of these methods make it possible to use structures for aligning sequences that are less than 30% identical. Although they diverge slightly in the alignment they produce, it is hard to establish which one (if any) performs better than the others. Sequence – structure comparisons (or threading) can be achieved using two categories of methods.29,30 One may use techniques inspired from molecular replacement to check whether a sequence is compatible with a 3D fold,31 or sophisticated DP where the algorithm analyses the 3D-structure to determine local gap penalties and local substitution costs. Fugue is based on this principle and turns a structure into a positionspecific substitution matrix, so that a sequence – structure alignment can be delivered using DP.32 Many of the structure-based alignment methods have been extended to generate S-MSAs. For instance, the double DP strategy of SAP has been coupled with a progressive algorithm to align more than two structures.33 At least two other pairwise structural alignment methods have been incorporated in a progressive alignment strategy: STAMP and COMPARER. COMPARER34 was used to assemble HOMSTRAD, the collection of multiple structural alignments used in this work for validation purposes. Other multiple structural alignment methods exist that use more specific procedures. For instance, DALI produces S-MSAs by aligning several structures to a master structure. One may use Fugue in a similar fashion by aligning several sequences to a single structural template. MNYFIT computes a consensus structure and uses it as a master to align all the others.35 The lack of method-independent reference datasets makes it difficult to benchmark these packages accurately and establish their respective strength and weaknesses. Yet they all share a common drawback: they are all built around a specific pairwise alignment algorithm, making it difficult to combine the respective strengths of several algorithms into a single model. Furthermore, none of the available methods can seamlessly handle a mixture of sequences and structures, and when doing so, the most common strategy is to start aligning the structures into an S-MSA, before adding the sequences in a semi-manual fashion.2 We designed 3DCoffee to address this problem. 3DCoffee uses the TCoffee v2.00 MSA package. TCoffee computes MSAs using pre-compiled libraries of pairwise alignments. The libraries can be compiled using any method able to generate pairwise alignments, like threading and structure superposition. This makes the library a powerful means to incorporate structural information into the MSA assembly process. Using methods like SAP or Fugue, we studied the effect of compiling the library with a mixture of sequences and structures. Our methodology could easily be extended to incorporate methods that have not yet been 3DCoffee: Mixing Sequences and Structures 387 superposition.27 When integrating these alignments within the primary library, we set to 100 the weight associated with each pair of aligned residues. This is the maximum weight an individual constraint can receive in a TCoffee primary library. Although this value is meant to reflect the high reliability of SAP, it only makes it more likely for these pairs to be aligned in the final MSA without explicitly forcing them to be so. Not forcing every pair of the structural alignments to find their way into the final alignment is important, as some portions of the SAP alignments correspond to non-superposable portions of the structures and are therefore unreliable. These segments usually have a low consistency within the primary library, and are therefore down-weighted at the extension stage. LSQman is a rigid body structure superposition package that makes structure-based sequence alignments.40 When turning an LSQman structure superposition into a sequence alignment, we considered two residues to be aligned if they were ˚ less than 3 A apart in the superposition. LSQman constraint weights are set to 100, like those of SAP and for similar reasons. Producing multiple sequence structure alignments We adapted TCoffee so that, given a collection of sequences and structures, one may specify which structures must be used and which methods should be applied on each possible pair. For instance, given a peptide file, 3DCoffee considers in turn every possible sequence pair within the dataset. For a given pair, the program computes a global alignment using NW and a series of local alignments using Lalign. If both sequences have an available structure, a pairwise alignment is computed using SAP and another one using LSQman. If one sequence only has a known structure, an alignment is made using the threading method Fugue. All these alignments are added to the TCoffee library using the standard procedure described above. Benchmarking procedure We used the February 2002 release of HOMSTRAD11 (e) to design a benchmark strategy for 3DCoffee. HOMSTRAD is a hand-curated database of high-quality S-MSAs built around the multiple structure alignment package COMPARER. We selected within HOMSTRAD the most demanding alignments using two criteria: at least four sequences and less than 25% average identity within the MSA. This yields a collection of 43 MSAs, four of which had to be discarded (FADOxidase_C, FAD-Oxidase_NC, TPR and bv) because they are impossible to align with any of the available methods and are therefore uninformative for the analysis. The 39 remaining MSAs (245 sequences) constitute our HOM39 considered so that biologists can integrate and combine their techniques of choice. Principle of the 3DCoffee method Computation of TCoffee multiple sequence alignments We used TCoffee version 2.00 to compute nonstructure-based MSAs (default mode), as well as S-MSAs. In its default mode, TCoffee does not use structures, it takes sequences as input and makes pairwise comparisons to compile a primary library. This primary library is a list of weighted pairs of residues.36 A residue pair appears in the library when it has been observed in one of the precompiled pairwise alignments. The pairwise alignments compiled in the primary library can be computed using any method one finds suitable. By default, TCoffee computes for each pair of sequence a global pairwise alignment obtained with the Needlman & Wunsch (NW)18 algorithm and the ten best-scoring local alignments as given by the SIM algorithm.37 The weight associated with every residue pair obtained this way is set to the average percentage identity within the primary alignment (local or global). When two alignments contribute the same pair of aligned residues, the weights are added. The weights within the primary library are then re-estimated according to the library selfconsistency,36 and the re-weighted library (named an extended library) is used as a position-specific substitution matrix to carry out a progressive multiple alignment.38 Doing so involves computing a distance matrix by comparing every pair of sequences and using this matrix to compute a neighbor-joining guide tree.39 The tree topology determines the order in which the sequences are incorporated within the MSA, using standard DP and the extended library as a position-specific substitution matrix. Incorporation of structural information within the TCoffee library Structural information is incorporated within the library by the means of structure-based pairwise sequence alignments. We used three methods, now fully integrated within TCoffee, providing the associated package is installed. Fugue is a threading method that aligns a protein sequence with a 3D-structure.32 3DCoffee directly submits sequence/structure pairs to the official Fugue server† and retrieves the resulting pairwise alignments that are integrated into the primary library using the standard TCoffee weighting scheme. SAP uses double DP to compute a pairwise alignment based on a non-rigid structure † http://www.cryst.bioc.cam.ac.uk/~fugue/ 388 3DCoffee: Mixing Sequences and Structures dataset. It has the advantage of being both compact and discriminative. We assessed the biological quality of our MSAs by comparing them with their HOM39 reference MSA, using the aln_compare package36 that computes the column score (CS), which is a measure of the fraction of columns aligned identically between two MSAs.41 We checked whether sequences without a known structure could benefit from being aligned with sequences whose structure is known. We named this measure the induced improvement, and measured it by removing the provided structure(s) from the reference and the target MSAs before comparing them. System and packages Academic licences (free of charge) to run TCoffee 2.00, SAP and LSQman were obtained for each package. These were installed on an SGI 02, running Irix 6.2. The protocols used here are now part of the TCoffee documentation. Results Improving MSA accuracy with a single structure Single structures can be incorporated into an MSA only by using a threading method like Fugue. Before doing so, we evaluated the accuracy of Fugue as a pairwise method on the entire HOM39 dataset. Figure 1(a) shows a comparison between Fugue and TCoffee (TCoffee uses SIM and NW by default) where the relative performances of the two methods are assessed by comparison with the HOM39 reference. Fugue clearly outperforms TCoffee when making pairwise alignments. For instance, when comparing Fugue and TCoffee on all pairs of sequences from HOM39 (Figure 1(a)), we found Fugue to be three percentage points more accurate than TCoffee (61.8% accuracy for Fugue against 58.8% for TCoffee). The difference is significant with a P-value of 1029 (Wilcoxon signed rank test). We then computed each HOM39 MSA while providing TCoffee with one structure via the -struc_to_use flag. In each test case, we chose the most distantly related sequence (as judged with the average percentage identity in the HOM39 reference). The extent of identity between the selected structures and the rest of their MSA ranged between 12% and 24%. A new flavor of TCoffee (TC-Fugue) was designed, that uses three pairwise alignment methods: SIM, NW, and Fugue (Table 1A). We also used TCoffee associated with the Fugue method only (Fugue) as a control. This last procedure amounts to aligning the sequences one after the other onto the sequence with known structure, using the Fugue algorithm. Two other controls were set up using TCoffee in default mode and CLUSTAL W version 1.83 (CW183). Figure 1. Performances of pairwise structure-based sequence alignment methods. Each dot corresponds to a parwise alignment taken from HOM39 (see method). The vertical axis represents the difference of alignment accuracy (Column Score) between TCoffee and (a) Fugue, (b) SAP and (c) LSQman. The horizontal axis shows the percent identity between the two sequences being considered, as measured on the reference HOM39 MSA. 3DCoffee: Mixing Sequences and Structures 389 Table 1. Direct (A) and induced (B) improvement when providing one structure of the HOM39 datasets Method N str. 0 0 1 1 0 0 1 1 Avg. acc. 42.24 38.43 31.26 46.33 52.83 45.75 35.53 54.73 Difference with TCoffee – 23.8 210.9 1 4.1 – 27.1 217.3 1 1.9 P-value (Wilcoxon signed-rank test) – 2 £ 1022 2 £ 1024 1 3 1023 – 1 £ 1023 3 £ 1024 4 3 1021 A. Direct improvement TCoffee CW-183 Fugue TC-Fugue B. Induced improvement TCoffee CW-183 Fugue TC-Fugue Method indicates the method being used: TCoffee (TCoffee with NW and SIM), CW-183 (CLUSTAL W, 1.83), TC-Fugue (TCoffee with NW, SIM and Fugue), Fugue (TCoffee þ Fugue, without NW or SIM). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical significance of the observed difference between the considered method and the default TCoffee. The best performing method is in bold. Our results (Table 1A) show that providing a structure to TC-Fugue improves MSAs by four percentage points over TCoffee (or by a litle less than eight percentage points over CLUSTAL W). The difference is significant with a P-value of 1023, and an observed improvement on 23 of the 31 alignments that are not tied between the two methods. We found (Figure 2(a)) that the amount of reported improvement depends loosely on the structure/sequence ratio, with high ratios yielding greater improvements. The low performances of the Fugue control are probably explained by the stringency of the CS measure that requires every sequence to be aligned correctly and is not well adapted to the pairwise-based alignment method used here. We measured the induced improvement in the TC-Fugue alignments by removing the provided structure and found the average TC-Fugue accuracy to remain higher than that of TCoffee (Table 1B and Figure 2(b)), although in this case the difference is not statistically significant, as the observed difference is associated with a P-value of only 0.4. Note that the values in Table 1B are higher than the corresponding values in Table 1A because in Table 1B the evaluation is carried out while ignoring the provided structure (usually the hardest sequence to align). Improving MSA accuracy with two structures Using two structures offers the possibility of making structure– structure (SAP, LSQman) as well as structure –sequence comparisons. Before using these methods to compute an MSA, we evaluated their pairwise accuracy (Figure 1(b) and (c)). As expected, we found SAP and LSQman to outperform TCoffee significantly. A measure made on the SAP alignments of every HOM39 pair of sequence (Figure 1(b)) indicates an average accuracy of 86.3%. The difference with TCoffee is highly significant with a P-value of 10211 (Wilcoxon signed rank test). Under the same conditions, LSQman outperforms TCoffee by 12 points with an average accuracy of 70.3%, and a difference also highly significant. We computed every HOM39 MSA while providing TCoffee with two structures: the one used previously with TC-Fugue and its most distantly related homologue (lowest percentage identity) within the considered HOM39 MSA. An attempt to use the most informative pairs guided this choice. In order to judge the individual contribution of each of the three structure-based methods (Fugue, SAP and LSQman) to the overall accuracy of 3DCoffee, we used them separately, each time in conjunction with SIM and NW (Table 2A). These three new flavors of TCoffee are named TC-Fugue, TC-SAP and TC-LSQ, and the combination of all the available pairwise methods (Fugue, SAP, LSQman, SIM and NW) constitutes the new 3DCoffee method (TC-3D in the Tables). As expected, TC-Fugue, TC-SAP and TC-LSQ all outperform TCoffee (Table 2A). Furthermore, TC-3D outperforms every alternative flavor and, given two structures, it produces MSAs on average ten percentage points better than TCoffee and 4.5 percentage points better than TC-Fugue (Table 2A). As indicated in Table 2A, all the differences reported between the new methods and TCoffee are statistically significant. Here as well, the extent of the improvement depends on the structure/ sequence ratio (Figure 3(a)). Similar trends were observed when measuring the induced improvement (Figure 3(b)), which amounts to slightly less than 3.5 percentage points when comparing TC-3D with TCoffee (Table 2B). Although limited in amplitude, this improvement is also statistically significant. Improving MSAs accuracy with many structures We examined the effect of varying the structure/ sequence ratio for every HOM39 MSA. We did so by applying TC-3D on each HOM39 dataset, using structural sets that contained between one and N structures (N being the total number of sequences). 390 3DCoffee: Mixing Sequences and Structures Figure 2. Comparative performances of TC-Fugue and TCoffee when using one structure. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39. The vertical axis indicates the difference of accuracy between a TC-Fugue and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (1 structure) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to Figure 2(a), the MSA Accuracy is measured while ignoring the contribution of the provided structure. The structural sets were assembled in an incremental manner. Given an MSA, one starts with the most distantly related structure (as shown above) before adding the structure of the less similar remaining sequences one by one, until N structural sets are defined for each HOM39 MSA. We then realigned every HOM39 MSA with each of its associated structural sets and compared the resulting alignments with the HOM39 reference. This makes a total of 200 MSA (between four and 15 for each HOM39 protein family) that were used to compute the data presented in Figure 4(a) and 161 for Figure 4(b). The results are presented in the form of a 3DCoffee: Mixing Sequences and Structures 391 Table 2. Direct (A) and Induced (B) improvement when providing two structure of the HOM39 datasets Method N Str. 0 0 2 2 2 2 0 0 2 2 2 2 Avg. acc. 42.24 38.43 46.39 50.81 47.26 52.52 56.12 50.22 58.07 58.49 57.52 59.55 Difference with TCoffee 0.0 23.8 þ4.0 þ8.5 þ5.0 1 10.3 0.0 25.9 þ1.9 þ2.4 þ1.4 1 3.4 P-value Wilcoxon signed-rank test 1.0 2 £ 1022 5 £ 1023 6 £ 1026 2 £ 1023 1 3 1025 1.0 1 £ 1021 2 £ 1021 2 £ 1021 4 £ 1021 2 3 1022 A. Direct improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D B. Induced improvement TCoffee CW-183 TC-Fugue TC-SAP TC-LSQ TC-3D Direct improvement is measured on the complete alignment, including the used structures. The induced improvement is measured only on the sequences whose structures were not used. Method indicates the method being used: TCoffee (TCoffee with SIM and NW), TCW-183 (CLUSTAL W version 1.83) TC-Fugue (TCoffee þ NW þ SIM þ Fugue), TC-SAP (TCoffee þ SIM þ NW þ SAP), TC-LSQ (TCoffee þ SIM þ NW þ LSQman), TC-3D (TCoffee þ SIM þ NW þ Fugue þ SAP þ LSQman). N str. indicates the number of structures provided. Avg. acc. indicates the average accuracy as measured with the CS score by comparison with the HOM39 reference alignments. P-value estimates the statistical significance of the difference between the considered method and TCoffee default using the Wilcoxon signed-rank test. The best performing method is in bold. boxplot in Figure 4(a) (direct improvement) and Figure 4(b) (induced improvement). Figure 4(a) indicates the existence of a reasonable correlation between the structure/sequence ratio and the MSA accuracy, although the data are not distributed evenly. One gains roughly ten percentage points in accuracy with every 20 percentage points increase of the structure/sequence ratio. An individual analysis of each protein family suggests that this trend is consistent across most of the HOM39 dataset, although the phenomenon varies in amplitude. When using 3DCoffee and all the available structures in a procedure that amounts to assembling a multiple structural alignment, we obtained a score of 71.9% accuracy, a value short of the theoretical maximum of 100 that might have been expected if the unreliable regions of HOM39 had been removed from the evaluation. This value is an estimate of the correlation between the two-structure superposition method SAP and COMPARER rather than an estimate of accuracy. The induced improvement follows a similar trend, albeit more modestly (Figure 4(b)), and yields a gain of roughly two percentage points for every 20 percentage points of ratio increase. The distribution of this induced improvement is even less regular than that of the direct improvement. It indicates that in the HOM39 dataset, sequences benefit only modestly from the incorporation of the 3D information associated with one of their remote homologue. Conclusion 3DCoffee is a novel method that takes advantage of structural information for aligning sequences. We benchmarked 3DCoffee using HOM39, a collection of high-quality reference S-MSAs. We used the TCoffee package to mix sequences, structures and structure/sequence alignment methods, and found this new protocol to improve MSA accuracy in a manner that depends on the structure/ sequence ratio within the considered dataset. Our results suggest that using structures can improve the alignment accuracy of sequences without a known structure. The 3DCoffee protocol bears several advantages. It is relatively fast: given all the pairwise alignments, it takes a few seconds to align ten sequences 200 residues long on a standard workstation. It is also very flexible and could easily be adapted to include any structure analysis method able to deliver a sequence alignment. We show here that one can effectively use this protocol to combine the output of methods based on different principles, like a rigid structure superposition method (LSQman) and a non-rigid one (SAP). This makes 3DCoffee a versatile tool that could easily be used in MSAs computation the way meta-methods are used in structure prediction.42 Yet, this study lends itself to a more paradoxical conclusion. Although structural information clearly helps improve MSA accuracy, it is surprising to find that its usage lacks the dramatic effect one may have expected. For instance, using one structure on a dataset of distantly related sequences increases the average accuracy by only an average four percentage points (and a maximum of ten). One may have hoped that the first or the first two structures would have delivered a larger share of the potential improvement. Yet this does not happen and every extra structure has about the same effect as the others on the overall accuracy, thus yielding a quasi-linear correlation between the structure/sequence ratio and the overall MSA accuracy. This finding suggests that the standard methods 392 3DCoffee: Mixing Sequences and Structures Figure 3. Comparative performances of TC-3D and TCoffee when using two structures. (a) Direct improvement. Each dot corresponds to an MSA taken from HOM39 (see method). The vertical axis indicates the difference in accuracy between a TC-3D and a TCoffee MSA. The horizontal axis indicates the ratio between the number of provided structures (2 structures) and the total number of sequences contained in the MSA. (b) Induced improvement. Similar to (a) with the MSA accuracy computed on the sequences without known structure. we used here are not yet able to let the structural information diffuse optimally onto distantly related sequences. As a consequence, the best way to obtain a highly accurate MSA of remote homologues is to use more than one structure and, if possible, one structure for each sequence (or group of closely related sequences). On the basis of these results one may argue that given current methods, the “one structure for every protein family” strategy43 may prove short of solving all the alignments problems faced by homology modeling. Achieving this purpose will require either better sequence comparison methods or more structures. 3DCoffee: Mixing Sequences and Structures 393 Figure 4. Alignment accuracy and structure/sequence ratio. (a) Each box indicates the average accuracy difference between TC-3D and TCoffee when computing HOM39 MSAs with various structure/sequence ratios: [0 – 20] (6 values), [21 – 40] (27 values), [41 –60] (44 values), [61 – 80] (44 values), [81 – 100] (20 values). The vertical axis shows the average difference of accuracy and the horizontal axis the average structure/ sequence ratio. The boxplot was generated with the R package using standard settings. Each box stretches from its lower hinge (defined as the 25th percentile) to its upper hinge (the 75th percentile). The median is shown as a line across the box. The top and the bottom whisker indicate the smallest data value larger then lower inner fence. The lower inner fence (not drawn) is equal to 1.5p spread to the 25th percentile. Values below the lower inner fence are plotted as a dot. The upper whisker is plotted in a similar fashion while using the 50th percentile as reference. (b) Induced improvement. Identical to 3b, with the measure of accuracy made on the sequences without known structure only. Acknowledgements Orla O’Sullivan was paid from Enterprise Ireland and Hewlett Packard provided some support. We thank Willie Taylor for helping us with setting up SAP, and Kenji Miziguchi for helping with Fugue. The comments of both referees were very helpful in improving the manuscript. We thank Jean-Michel Claverie for his many suggestions in improving and clarifying this manuscript. 394 3DCoffee: Mixing Sequences and Structures References 1. Lesk, A. M. & Chothia, C. (1980). How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136, 225– 270. 2. Al-Lazikani, B., Sheinerman, F. B. & Honig, B. (2001). Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases. Proc. Natl Acad. Sci. USA, 98, 14796– 14801. 3. Marchler-Bauer, A., Panchenko, A. R., Ariel, N. & Bryant, S. H. (2002). Comparison of sequence and structure alignments for protein domains. Proteins: Struct. Funct. Genet. 48, 439– 446. 4. Brenner, S. E. (2001). A tour of structural genomics. Nature Rev. Genet. 2, 801–809. 5. Duret, L. & Abdeddaim, S. (2000). Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics, Sequence, Structure and Databanks (Higgins, D. & Taylor, W., eds), pp. 135– 147 Oxford University Press, Oxford. 6. Ng, P. C. & Henikoff, S. (2002). Accounting for human polymorphisms predicted to affect protein function. Genome Res. 12, 436– 446. 7. Wang, L. & Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337– 348. 8. Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J. C. & Poch, O. (2001). Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 314, 937– 951. 9. Lipman, D. J., Altschul, S. F. & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412– 4415. 10. Hogeweg, P. & Hesper, B. (1984). The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175– 186. 11. Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469– 2471. 12. Raghava, G. P., Searle, S. M., Audley, P. C., Barber, J. D. & Barton, G. J. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. 13. Thompson, J. D., Plewniak, F. & Poch, O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87 – 88. 14. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl. Acids Res. 30, 3059– 3066. 15. Lassmann, T. & Sonnhammer, E. L. (2002). Quality assessment of multiple alignment programs. FEBS Letters, 529, 126– 130. 16. Lee, C., Grasso, C. & Sharlow, M. F. (2002). Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452– 464. 17. Thompson, J. D., Plewniak, F. & Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682 –2690. 18. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443– 453. 19. Pearson, W. R. & Miller, W. (1992). Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210, 575– 601. 20. Huang, X., Hardison, R. C. & Miller, W. (1990). A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 6, 373– 381. 21. Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195– 197. 22. Brenner, S. E., Chothia, C. & Hubbard, T. J. (1998). Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073– 6078. 23. Eidhammer, I., Jonassen, I. & Taylor, W. R. (2000). Structure comparison and structure patterns. J. Comput. Biol. 7, 685– 716. 24. Sillitoe, I. & Orengo, C. (2002). Protein structure comparison. In Bioinformatics: genes, proteins and computers (Orengo, C., Jones, D. & Thornton, J., eds), pp. 250– 265, BIOS Scientific Publisher, Oxford. 25. Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog. sect. A, 34, 827– 828. 26. Russell, R. B. & Barton, G. J. (1992). Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins: Struct. Funct. Genet. 14, 309– 323. 27. Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208, 1 – 22. 28. Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123– 138. 29. Jones, D. T., Orengo, C. A. & Thornton, J. M. (1996). Protein Folds and their Recognition From Sequence Protein Structure Prediction (Sternberg, M. J. E., ed.), 1st edit., vol. 170, pp. 173– 206, Oxford University Press, Oxford. 30. Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L. & Elofsson, A. (2001). A study of quality measures for protein threading models. BMC Bioinform. 2, 5. 31. Bryant, S. H. & Lawrence, C. E. (1993). An empirical energy function for threading protein sequence through the folding motif. Proteins: Struct. Funct. Genet. 16, 92 – 112. 32. Shi, J., Blundell, T. L. & Mizuguchi, K. (2001). FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243– 257. 33. Taylor, W. R., Flores, T. P. & Orengo, C. A. (1994). Multiple protein structure alignment. Protein Sci. 3, 1858– 1870. 34. Sali, A. & Blundell, T. L. (1990). Definition of general topological equivalence in protein structures. J. Mol. Biol. 212, 403– 428. 35. Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Knowledge based modelling of homologous proteins. Part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng. 1, 377–384. 36. Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205– 217. 37. Huang, X. & Miller, W. (1991). A time-efficient, linear-space local similarity algorithm. Advan. Appl. Math. 12, 337– 357. 38. Thompson, J., Higgins, D. & Gibson, T. (1994). CLUSTAL W.: improving the sensitivity of progressive 3DCoffee: Mixing Sequences and Structures 395 multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673– 4690. 39. Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406– 425. 40. Jones, T. A. & Kleywegt, G. J. (1999). CASP3 comparative modeling evaluation. Proteins: Struct. Funct. Genet. Suppl. 3, 30 – 46. 41. Karplus, K. & Hu, B. (2001). Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. 42. Bourne, P. E. (2003). CASP and CAFASP experiments and their findings. Methods Biochem. Anal. 44, 501 –507. 43. Vitkup, D., Melamud, E., Moult, J. & Sander, C. (2001). Completeness in structural genomics. Nature Struct. Biol. 8, 559– 566. Edited by J. Thornton (Received 14 November 2003; received in revised form 20 April 2004; accepted 22 April 2004) Nucleic Acids Research, 2004, Vol. 32, Web Server issue W37–W40 DOI: 10.1093/nar/gkh382 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment Olivier Poirot1, Karsten Suhre1, Chantal Abergel1, Eamonn O’Toole3 and Cedric Notredame1,2,* 1 ´ Information Genomique et Structurale UPR2589-CNRS, CNRS, 31, Chemin Joseph Aiguier, 13 402 Marseille Cedex 20, France, 2Swiss Institute of Bioinformatics, Lausanne University, Chemin des Boversesses, 1066 Epalinges, Switzerland and 3hp High Performance Technical Computing Division, Hewlett Packard, BallyBrit, Galway, Ireland Received February 14, 2004; Accepted March 16, 2004 ABSTRACT This paper presents 3DCoffee@igs, a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs). 3D-Coffee makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments. Structures can be either provided as PDB identifiers or directly uploaded into the server. Given a set of sequences and structures, pairs of structures are aligned with SAP while sequence–structure pairs are aligned with Fugue. The resulting collection of pairwise alignments is then combined into an MSA with the T-Coffee algorithm. The server and its documentation are available from http://igs-server.cnrsmrs.fr/Tcoffee/. INTRODUCTION The assembly of an accurate multiple sequence alignment (MSA) is a key step in many sequence analysis procedures. One could cite in bulk: the identification of a protein signature such as a Prosite pattern (1), the building of a domain profile (or HMM) needed for identifying the most remote members of a protein family (2), structure prediction and homology modeling (3) and phylogenetic analysis (4). More recently, MSAs have also proven useful to characterize nsSNPs (non-synonymous Single Nucleotide Polymorphisms) (5,6). The success of such applications depends very much on the MSA quality, hence the importance of accuracy when computing an alignment. In practice, structurally correct alignments are considered to be a good starting point for most MSA applications (with maybe the exception of phylogenetic reconstruction), and established collections of reference structural alignments are widely used to benchmark and train existing MSA packages (7,8). However, when state-of-the-art packages are applied to sets of distantly related sequences, they deliver alignments that are only partly correct from a structural point of view (8), thus suggesting that sequencebased alignment procedures can still be greatly improved. In the current situation, the best way to produce a high-quality MSA remains the assembly of a multiple structural alignment. Unfortunately, few examples exist where enough related structures are available to carry out such a task. An elegant alternative to the use of many structures is to mix sequences and structures, in the hope that the 3D information contained within the structures will help deliver a better alignment of the other sequences. Such a mix also constitutes a realistic solution, considering the increasing proportion of sequences without a known structure and the decreasing proportion of protein families not associated with at least one structure. However, the problem of combining sequences and structures has not yet been extensively addressed, and only a handful of methods are available that allow the seamless combining of sequences and structures (9) while appropriately using 3D information. Here we present 3DCoffee@igs, a web server especially designed to combine sequences and structures by seamlessly integrating in T-Coffee (10) the three types of alignment methods needed for this purpose: sequence–sequence, sequence–structure and structure–structure alignment methods. When using one or more structures, the alignments thus produced are more accurate than similar alignments based on sequence information alone, as judged by the comparison with reference structure-based alignments (O.O’Sullivan, K.Suhre, D.Higgins and C.Notredame, submitted for publication). The inclusion of a threading method (sequence– structure alignment) makes it possible to use as little as one structure. METHODS Standard T-Coffee sequence alignment assembly We use T-Coffee to mix sequences and structures. Given a set of sequences, the regular T-Coffee procedure involves the *To whom correspondence should be addressed. Tel: +33 491 164 606; Fax: +33 491 164 549; Email: cedric.notredame@gmail.com The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. ª 2004, the authors Nucleic Acids Research, Vol. 32, Web Server issue ª Oxford University Press 2004; all rights reserved W38 Nucleic Acids Research, 2004, Vol. 32, Web Server issue O(N3L2), N being the number of sequences and L their average length. However, in 3D-Coffee, SAP is the limiting factor, with a time complexity in the order of O(L3). USING THE TCOFFEE@IGS SERVER 3D-Coffee is a new service that is available through the previously presented Tcoffee@igs server (17). It is maintained by ´ IGS (Information Genomique et Structurale) and runs on a dedicated Alpha ES45 quadriprocessor server. It supports the analysis of a maximum number of 100 sequences with a maximum of 2000 residues each. The 3D-Coffee service is provided in two versions, a regular and an advanced version. The regular version requires limited input from the user while the advanced version offers more possibilities such as uploading personal PDB structures and controlling the methods used to compute the library. Tcoffee@igs server The homepage of the server (http://igs-server.cnrs-mrs.fr/ Tcoffee/) contains pointers to the four types of computation performed: (i) The Make a Multiple Alignment section opens to the standard computation of a T-Coffee MSA, using the default parameters of the program, as described in (10). (ii) The Evaluate a Multiple Alignment section provides an alignment evaluation using the CORE method as described in (17). (iii) The Combine Multiple Alignments section makes it possible to combine several alignments into one. The advanced section of each server offers extra control on the library computation (choice of the methods) as well as a larger number of output options. These servers have all been previously described in (16). (iv) The last section, Align Structures (3D-Coffee), is new and described in the next paragraph. Align structures and sequences with 3DCoffee::regular The 3DCoffee::regular server inputs a set of sequences in FASTA format. Among the sequences, those with a 3D structure must be named according to their PDB identifier. If the PDB file contains several chains, the chain index (letter or number) must be added to the name (1pptA). If the sequence provided in the FASTA file is a subsequence of the indicated chain, T-Coffee aligns the provided sequence with its full PDB counterpart and makes sure that only the appropriate 3D information is used for alignment computation. This comparison also handles slight sequence discrepancies between the PDB and the user-provided sequence. In the regular mode of 3D-Coffee, the handling of the structures is entirely under T-Coffee control, which uses the FASTA information to gather the structures and chop them to the relevant portion. For users familiar with the stand-alone version of T-Coffee, we give the corresponding command line: t_coffee-in seq:fasta Msap_pair Mfugue_pair Mslow_pair Mlalign_id_pair sap_pair, fugue_pair, slow_pair and lalign_id_pair are pairwise methods used to compute the T-Coffee library. Once the computation of a collection of pairwise alignments where for each possible pair of sequences in the dataset, the program computes the best global alignment and the 10 best local alignments [using the Sim algorithm from the Lalign package (11)]. This collection of pairwise alignments is named a library. The second step of the procedure involves the assembly of an MSA with a high level of consistency with the alignments contained in the library. Since T-Coffee uses a heuristic, the optimality of this process is not guaranteed, although the results are generally satisfactory as judged by comparison with alternative optimization methods (12). The assembly procedure is very similar to that described for ClustalW (13); extensive details are available in the original publication (10). 3D-Coffee protocol The 3D-Coffee protocol takes advantage of the methodindependent manner in which T-Coffee uses its libraries. Rather than filling the library with sequence-based pairwise alignments, 3D-Coffee compiles it using three types of pairwise methods: sequence–sequence, structure–structure and structure–sequence (threading) alignment procedures. From among the vast variety of structure comparison algorithms, we selected SAP (14) for the structure–structure alignments and Fugue (15) for the structure–sequence comparisons. A full validation of these choices is detailed in O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication. Our main criterion was the relatively high accuracy of these two methods and their ease of integration within the T-Coffee framework. It is nonetheless worth pointing out that any method with similar characteristics (i.e. able to deliver a sequence alignment) could easily be added to the procedure we describe here. In practice, given a sequence dataset, the program starts by identifying the sequences associated with a structure and those that are not. It then considers all the possible pairs and applies the appropriate methods to these pairs. For instance, given a pair of structures, the program will successively make a global pairwise alignment, a local pairwise sequence alignment and a structure-based sequence alignment with SAP. If only one of the two sequences has a known structure, Fugue will be used instead of SAP. The resulting pairwise alignments are compiled into a list of weighted pairs of aligned residues found in the individual alignments. Each pair receives a weight equal to the average level of identity within the pairwise alignment where it occurred. When two or more alignments contribute the same pair, their respective weights are added to yield the final weight. The collection of weighted residue-pairs constitutes the T-Coffee library. T-Coffee uses the library to assemble a standard progressive alignment in a ClustalW-like manner. The program starts by computing the distance matrix of the sequences and uses it to estimate a guide tree. The guide tree controls the order in which the sequences are included one by one into the MSA. Each sequence is incorporated using the library in place of a substitution matrix. A recent modification of the T-Coffee algorithm (to be described elsewhere) has made it possible to significantly reduce the time complexity of the algorithm, down to O(N2L2) from the previously reported Nucleic Acids Research, 2004, Vol. 32, Web Server issue W39 Figure 1. Typical output of a standard 3D-Coffee computation. Five structures have been aligned with a sequence (Q53396). The display is the ESPript (18) output produced by the Tcoffee@igs server. The CORE index is displayed on the alignment and indicates the relative reliability of the various sections (color code: blue, unreliable; green, low reliability; red, highly reliable portion of the alignment). DSSP (19) is used to determine the secondary structures from the PDB coordinates. Blue, green and yellow portions are mostly incorrectly aligned, as judged by comparison with HOMSTRAD reference alignment (9). computation is over, the server returns a page of links to the produced result files. An ESPript (18) post-processing step makes it possible to visualize the secondary structure elements within the used structures (Figure 1). The returned alignment is a sequence alignment, albeit generally improved by the use of structural information. Systematic benchmarking, carried out on a subset of HOMSTRAD (O. O’Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication), indicates that the accuracy of mixed sequences/structure alignments increases proportionally to the amount of structural information provided. The 3DCoffee::advanced server The advanced server makes it possible to upload user-defined PDB structures (up to three). The sequences of the uploaded structures should not be included within the FASTA sequences. The limitation to three private structures is arbitrary and will be increased upon request. In case the file contains more than one chain, the program extracts only the first one. It is the user’s responsibility to provide the correct chain. The advanced server also makes it possible to control the computation of the T-Coffee library by selecting the methods one wishes to include. For instance, if all the sequences have a known 3D structure, it is advisable to use only sap_pair, the structure–structure alignment method, to generate a structurebased MSAs. CONCLUSION In this paper, we present 3D-Coffee, a major extension of the Tcoffee@igs server. This new feature of the server makes it possible to combine sequences and structures within an MSA, thus producing high-quality MSAs. The method we present here is versatile and easy to use. It affords the possibility of seamlessly combining structure and W40 Nucleic Acids Research, 2004, Vol. 32, Web Server issue sequence information, private and public data, without the need to install additional programs such as SAP and Fugue locally. It certainly constitutes an adequate means to efficiently use available structural data. Future plans will involve the addition of new modules, rendering easier the mapping of structural information on to sequence data. We strongly encourage users to send us their feedback. REFERENCES 1. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. 2. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. 3. Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202. 4. Phillips,A., Janies,D. and Wheeler,W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. 5. Ng,P.C. and Henikoff,S. (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Res., 12, 436–446. 6. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88. 8. O’Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19, I215–I221. 9. de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. (2001) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17, 748–749. 10. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 11. Huang,X. and Miller,W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. 12. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 13. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. 14. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. 15. Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) FUGUE: sequence– structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243–257. 16. Poirot,O., O’Toole,E. and Notredame,C. (2003) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res., 31, 3503–3506. 17. Notredame,C. and Abergel,C. (2003) Using multiple sequence alignments to assess the quality of genomic data. In Andrade,M. (ed.), Bioinformatics and Genomes: Current Perspectives. Horizon Scientific Press, Norfolk, UK, pp. 30–50. 18. Gouet,P., Robert,X. and Courcelle,E. (2003) ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res., 31, 3320–3323. 19. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577–2637. BIOINFORMATICS Vol. 19 no. 1 2003, pages i1–i7 DOI: 10.1093/bioinformatics/btg1029 APDB: a novel measure for benchmarking sequence alignment methods without reference alignments Orla O’Sullivan 1, Mark Zehnder 3, Des Higgins 1, Philipp Bucher 3, ´ Aurelien Grosdidier 3 and Cedric Notredame 2, 3,∗ 1 Department of Biochemistry, University College, Cork, Ireland, 2 Information ´ ´ Genetique et Structurale, CNRS UMR-1889, 31, Chemin Joseph Aiguier, 13402 Marseille, France and 3 Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges, Switzerland Received on January 6, 2000; revised on Month xx, 2000; accepted on February 20, 2000 Author please check use of A and B heads is correct ABSTRACT Motivation: We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efficiently and objectively benchmark multiple sequence alignment methods. Results: Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages. Availability: APDB is implemented in C, its source code and its documentation are available for free on request from the authors. Contact: cedric.notredame@gmail.com INTRODUCTION We introduce APDB (Analyze alignments with PDB), a new method for benchmarking and improving multiple sequence alignment packages with minimal human intervention. We show how it is possible to avoid the use of reference alignments when PDB structures are available for at least two homologous sequences in a test alignment. Using this method it should become possible to systematically benchmark or train multiple sequence alignment methods using all known structures, in a completely automatic manner. There are strong justifications for improving multiple sequence alignment methods. Many sequence analysis ∗ To whom correspondence should be addressed. techniques used in bioinformatics require the assembly of a multiple sequence alignment at some point. These include phylogenetic tree reconstruction, detection of remote homologues through the use of profiles or HMMs, secondary and tertiary structure prediction and more recently the identification of the nsSNPs (non synonymous Single Nucleotide Polymorphisms) that are most likely to alter a protein function. All of these important applications demonstrate the need to improve existing multiple sequence alignment methods and to establish their true limits and potential. Doing so is complicated, however, because most multiple sequence alignment methods rely on a complicated combination of greedy heuristic algorithms meant to optimize an objective function. This objective function is an attempt to quantify the biological quality of an alignment. Almost every multiple alignment package uses a different empirical objective function of unknown biological relevance. In practice, most of these algorithms are known to perform well on some protein families and less well on others, but it is difficult to predict this in advance. It can also be very hard to establish the biological relevance of a multiple alignment of poorly characterized protein families. See Duret and Abdeddaim (2000) and Notredame (2002) for two recent reviews of the wide variety of techniques that have been used to make multiple alignments. Given such a wide variety of methods and such poor theoretical justification for most of them, the main option for a rational comparison is systematic benchmarking. This is usually accomplished by comparing the alignments produced by various methods with ‘reference’ alignments of the same sequences assembled by specialists with the help of structural information. Barton and Sternberg (1987) made an early systematic attempt to validate a multiple sequence alignment method using structure based alignments of globins and immunoglobulins. Later on, 1 Bioinformatics 19(1) c Oxford University Press 2003; all rights reserved. O.O’Sullivan et al. Notredame and Higgins (1996) used another collection of such alignments assembled by Pascarella and Argos (1992). More recently, it has become common practice to use BAliBASE (Thompson et al., 1999); a collection of multiple sequence alignments assembled by specialists and designed to systematically address the different types of problems that alignment programs encounter, such as the alignment of a distant homologue or long insertions and deletions. In this work, we examined two such reference collections: BaliBase and Homstrad (Mizuguchi et al., 1998), a collection of high quality multiple structural alignments. There are two simple ways to use a reference alignment for the purpose of benchmarking Karplus and Hu (2001). One may count the number of pairs of aligned residues in the target alignment that also occur in the reference alignment and divide this number by the total number of pairs of residues in the reference. This is the Sum of Pairs Score (SPS). The main drawback is that it is not very discriminating and tends to even out differences between methods. The more popular alternative is the Column Score (CS) where one measures the percentage of columns in the target alignment that also occur in the reference alignment. This is widely used and is considered to be a stringent measure of alignment performance. In order to avoid the problem of unalignable sections of protein sequences (i.e. segments that cannot be superimposed), it is common practice to annotate the most reliable regions of a multiple structural alignment and to only consider these core regions for the evaluation. In BaliBase, the core regions make up slightly less than 50% of the total number of alignment columns. Such use of multiple sequence alignment collections for benchmarking is very convenient because of its simplicity. However, a major problem is the heavy reliance on the correctness of the reference alignment. This is serious because, by nature, these reference alignments are at least partially arbitrary. Although structural information can be handled more objectively than sequence information, the assembly of a multiple structural alignment remains a very complex problem for which no exact solution is known. As a consequence, any reference multiple alignment based on structure will necessarily reflect some bias from the methods and the specialist who made the assembly. The second drawback is that given a set of structures there can be more than one correct alignment. This plurality results from the fact that a structural superposition does not necessarily translate unambiguously into one sequence alignment. For instance, if we consider that the residues to be aligned correspond to the residues whose alpha carbons are the closest in the 3-D superposition, it is easy to imagine that sometimes an alpha carbon can be equally close to the alpha carbons of two potential homologous residues. Most structure based sequence i2 alignment procedures break this tie in an arbitrary fashion, leading to a reference alignment that represents only one possible arrangement of aligned residues. This problem becomes most serious when the sequences one is considering are distantly related (less than 30% identity). Unfortunately, this is also the most interesting level of similarity where most sequence alignment methods make errors and where it is important to accurately benchmark existing algorithms. The APDB method that we describe in this work has been designed to specifically address this problem and remove, almost entirely, the need for arbitrary decisions when using structures to evaluate the quality of a multiple sequence alignment. In APDB, a target alignment is not evaluated against a reference alignment. Rather, we measure the quality of the structural superposition induced by the target alignment given any structures available for the sequences it contains. By treating the alignment as the result of some sort of structure superposition, we simply measure the fraction of aligned residues whose structural neighborhoods are similar. This makes it possible to avoid the most expensive and controversial element of the MSA benchmarking methods: the reference multiple sequence alignment. APDB requires just three parameters. This is tiny if we compare it with any reference alignment where each pair of aligned residue can arguably be considered as a free parameter. In this work we show how the APDB measure was designed and characterized on a few carefully selected pairs of structures. Among other things we explored its sensitivity to parameter settings and various sequence and structure properties, such as similarity, length, or alignment quality. Finally, APDB was used to benchmark known methods using two popular data sets: BaliBase and Homstrad. These were either used as standard reference alignments or as collections of structures suitable for APDB. It should be noted that there are several methods for evaluating the quality of structure models and predictions using known structures. The development of these has been driven by the need to evaluate entries in the CASP protein structure prediction competition and have been reviewed by Cristobal et al. (2001). These all depend on generating structure superpositions between the model and the target and evaluating the quality of the match using, for example, RMSD between the two or using some measure of the number of alpha carbons that superimpose well (e.g. MaxSub by Siew et al. (2000)). In principle, this could also be used to benchmark alignment methods. However, one serious disadvantage is the requirement for a superposition, which is itself a difficult problem. A second disadvantage is the way RMSD measures behave with different degrees of sequence divergence and their sensitivity to local or global alignment differences. We APDB have carefully designed APDB so that on the one hand it remains very simple but on the other hand it is able to measure the similarity of the structural environments in a manner that lends itself to measuring alignment quality. correct X :Y (Z : W) is equal to 1 if d(X, Z ) and d(Y,W) are sufficiently similar as set by T 1. aligned(X : Y) is equal to 1 if most pairs Z : W in the X : Y bubble are correct as set by T 2. considered X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad correct X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad and |d(X, Z ) − d(Y, W )| < T 1 (1) (2) Author please resupply fig as not reproducing correctly, thanks SYSTEM AND METHODS The APDB scoring function APDB is a measure designed to evaluate how consistent an alignment is with the structure superposition this alignment implies. Let us imagine that A and B are two homologous structures. If the structure of sequence A tells ˚ us that the residues X and Z are 9A apart, then we expect to find a similar distance between the two residues Y and W of sequence B that are aligned with X and Z. The difference between these two distances is an indicator of the alignment quality. ________9Å_____ _ _ A aaaaaaaaaaaXaaaaaaaaaaaaaaaZaaaaaaa B bbbbbbbbbbbYbbbbbbbbbbbbbbbWbbbbbbb _________________ 9 Å? aligned(X : Y) = 1 Z :W Corr ect X :Y (Z : W ) × 100 > T 2 (3) if Z :W Consider ed X :Y (Z : W ) Finally, the APDB measure for the entire alignment is defined as: APDB Score = X :Y Aligned(X : Y ) N (4) In APDB we take this idea further by measuring the differences of distances between X:Y (X aligned with Y) and Z:W within a bubble of fixed radius centered around X and Y. The bubble makes APDB a local measure, less sensitive than a classic RMSD measure to the existence of non-superposable parts in the structures being considered. Furthermore it ensures that a bad portion of the alignment does not dramatically affect the overall alignment evaluation. The typical radius of this bubble is ˚ 10A, and it contains 20 to 40 amino acids. We consider two residues to be properly aligned if the distances from these two residues to the majority of their neighbors within the bubble are consistent between the two structures. In other words, we check whether a structural neighborhood is supportive of the alignment of the two residues that sit at its center. This can be formalized as follows: X : Y and is a pair of aligned residues in the alignment N Number of aligned pairs of residues d(X, Z) is the distance between the Cα of the two residues X and Z within one structure. Brad is the radius of the bubble set around residues X ˚ and Y (Brad = 10 A by default). T1 is the maximum difference of distance between ˚ d(X, Z ) and d(Y, W ) (T 1 = 1 A by default). T2 is the minimal percentage of residues that must respect the criterion set by T 1 for X and Y to be considered correctly aligned (70% by default). considered X :Y (Z : W) is equal to 1 if the pair Z : W is in the bubble defined by pair X : Y Given a multiple alignment of sequences with known structures, the APDB score can easily be turned into a sum of pairs score by summing the APDB score of each pair of structures and dividing it by the total number of sequence pairs considered. Design of a benchmark system for apdb In order to study the behavior of APDB, we used two established collections of reference alignments: BAliBASE (Thompson et al., 1999) and HOMSTRAD (Mizuguchi et al., 1998). First we extracted 9 structure based pair-wise sequence alignments from HOMSTRAD, which we refer to as HOM 9. These reference alignments were chosen so that their sequence identities (as measured on the HOMSTRAD reference alignments) evenly cover the range 17 to 90%. These alignments are between 200 and 300 residues long and are used for detailed analysis and parameterization of APDB. The PDB names of the pairs of structures are given in the figure legend for Figure 2. Next, in order to assemble a discriminating test set, we selected the most difficult alignments from HOMSTRAD. We chose alignments which had at least 4 sequences and where the average percent identity was 25% or less. This resulted in a selection of 43 alignments, which we refer to as HOM 43. BAliBASE version 1 has 141 alignments divided into 5 reference groups. We chose all alignments where 2 or more of the sequences had a known structure. This resulted in a subset of 91 alignments from the first 4 reference groups of BAliBASE which we refer to as BALI 91. Minor adjustments had to be made to ensure consistency between BAliBASE sequences and the corresponding PDB files. HOM 43 and BALI 91 test sets are available in the APDB distribution. i3 O.O’Sullivan et al. A second method for generating sub-optimal alignments was based on the PROSUP package (Lackner et al., 2000). PROSUP takes two structures, makes a rigid body superposition and generates all the sequence alignments that are consistent with this superposition, thus producing alternative alignments that are equivalent from a structural point of view. Typically PROSUP yields 5 to 25 alternative alignments within a very narrow range of RMSDs. Comparison of apdb with other standard measures In order to compare the APDB measure with more conventional measures, we used the Column Score (CS) measure as provided by the aln compare package (Notredame et al., 2000). CS measures the percentage of columns in a test alignment that also occur in the reference alignment. In BAliBASE this measure is restricted to those columns annotated as core region in the reference. Although alternative measures have recently been introduced (Karplus and Hu, 2001), CS has the advantage of being one of the most widely used and the simplest method available today. Fig. 1. Tuning of Brad, the bubble radius using sub-optimal alignments of two sequences from HOM 9 Each graph represents the correlation between CS and APDB for 4 different Bubble Radius ˚ values (Brad of 6, 8, 10 and 12A). In each graph, each dot represents a sub-optimal alignment from HOM 9, sampled from the genetic algorithm. Generation of multiple alignments We compared the performance of APDB on two different multiple alignment methods. We tested the widely used ClustalW version 1.80 (Thompson et al., 1994). We also tested the more recent T-Coffee version 1.37 (Notredame et al., 2000) using default parameters. Generation of suboptimal alignments In order to evaluate the sensitivity of APDB to the quality of an alignment, we used an improved version of the genetic algorithm SAGA (Notredame and Higgins, 1996) in order to generate populations of sub-optimal alignments. In each case a pair of sequences was chosen in HOM 9 and 50 random alignments were generated and allowed to evolve within SAGA so that their quality gradually improved (as measured by their similarity with the HOMSTRAD reference alignment). Ten alignments were sampled at each generation in order to build a collection of alternative alignments with varying degrees of quality. This algorithm was stopped when optimality was reached, thus typically yielding collections of a few hundred alignments. i4 RESULTS AND DISCUSSION Fine tuning of apdb Three parameters control the behaviour of APDB: Brad (the bubble radius), T1 (the difference of distance threshold) and T2 (the fraction of the bubble neighbourhood that must support the alignment of two residues). We exhaustively studied the tuning effect of each of these parameters using HOM 9 and parameterised APDB so that its behaviour is as consistent as possible with the behaviour of CS on HOM 9. In Figure 1 we show the relationship between CS and APDB for 250 sub-optimal alignments generated by genetic algorithm for one of the 9 test cases from HOM 9 over 4 different settings of Brad, the Bubble Radius. While the two scoring schemes are in broad agreement, the correlation improves dramatically as Brad increases. This trend can be summarised using the correlation coefficient measured on each of the graphs similar to those shown in Figure 1. The overall results for all nine HOM 9 test cases are shown in Figure 2. These results clearly show that the ˚ behaviour of APDB is best for values of Brad of 10 A or above. With these values the level of correlation between CS and APDB increases and so does the agreement across ˚ all 9 test cases. We chose 10 A as the default value in order to ensure a proper behaviour while retaining as much as possible the local character of the measure. Given the ˚ default value of 10 A for Brad, we examined T1 and T2 in a similar fashion and found the most appropriate values as ˚ 1 A for T1 and 70% for T2. APDB Fig. 2. Correlation between the Column Score measure (CS) and APDB on HOM 9 Each HOM 9 test set is labelled according to its average percent sequence identity as measured on the HOMSTRAD reference. The horizontal axis indicates the value of Brad. The vertical axis indicates the correlation coefficient between CS and APDB as measured on a population of sub-optimal alignments similar to the ones in Figure 1. Each dot indicates a correlation coefficient measured on one HOM 9 test set, using the indicated value of Brad. Each HOM 9 test set is an alignment between two sequences whose PDB names are as follows: 17: 2gar versus 1fmt, 18: ljfl versus lb74, 33: 1isi versus 11be, 43: 2cev versus 1d3v, 52: 1aq0 versus 1ghs, 63: 2gnk versus 2pii, 71: 1hcz versus 1cfm, 82: 1dvg versus 1qq8, 89: 1k25 versus 1qme. Sensitivity of apdb to sequence and structure similarity It is important to verify that the behaviour of APDB remains consistent across a wide range of sequence similarity levels. It is especially important to make sure that when two different alignments of the same sequences are evaluated, the best one (as judged by comparison with the HOMSTRAD reference) always gets the best APDB score. In order to check for this, we used the genetic algorithm to generate sub-optimal alignments for each test case in HOM 9. In each case, we gathered a collection of 250 sub-optimal alignments with CS scores of 0–40%, 41–60%, 61–80% and 81–100%. The CS score measures the agreement between an alignments and its reference in HOMSTRAD. We then measured the average APDB score in each of these collections. Each of these measures corresponds to a dot in Figure 3 where vertically aligned series of dots correspond to different measures made on the same HOM 9 test set. Figure 3 clearly shows that regardless of the percent identity within the HOM 9 test set being considered, alignments with better CS scores always correspond to a better APDB score (this results in the lines never crossing one another on Fig. 3). We did a similar analysis using the RMSD as measured on the HOMSTRAD alignment in place of sequence identity. The behaviour was the same and clearly indicates that APDB gives consistent results regardless of the structural similarity between the structures being considered. Suitability of apdb for analysing sub-optimal alignments Collections of sub-optimal alignments for each of the nine HOM 9 test sets were generated using SAGA and evaluated for their CS scores and APDB scores. These results were pooled and are displayed on the graph shown on Figure 4. This Figure indicates good agreement between the CS and the APDB score regardless of the level of optimality within the alignment being considered. It suggests that APDB is informative over the complete range of CS values. It also confirms that APDB is not ‘too generous’ with sub-optimal alignments We also checked whether sequence alignments that are structurally equivalent obtain similar APDB scores even if they are different at the sequence level. For this purpose, we used PROSUP (Lackner et al., 2000). Given a pair of structures, PROSUP generates several alignments that are equally good from a structure point of view (similar RMSD), but can be very different at the sequence level (different Column Score). We manually identified two such test sets in HOMSTRAD and the results are summarized in Table 1. For each of these two test sets, we i5 O.O’Sullivan et al. Fig. 3. Estimation of the sensitivity of APDB to sequence identity On this graph, each set of vertically aligned dots corresponds to a single HOM 9 test set. The 9 HOM 9 test sets are arranged according to their average identity (17–89%, see Figure 2). Each dot represents the average APDB score of a population of 250 sub-optimal alignments (generated by genetic algorithm) with a similar CS score (binned in four groups representing CS of <40%, 41–60%, 61–80% and 81–100%) generated for one of the 9 HOM 9 test sets. Table 1. Evaluating PROSUP suboptimal alignments with APDB Set 1 2 St1 1e96B 1e96B 1cd8 1cd8 St2 1a17 1a17 1qfpa 1qfpa ALN aln1 aln2 aln1 aln2 RMSD ˚ 1.45A 1.50˚ A ˚ 2.95A 2.95˚ A CS 100.0 65.6 100.0 55.1 APDB 80.2 80.7 18.7 17.9 Set indicates the test set index, St1 and St2 indicate the two structures being aligned by PROSUP, ALN indicates the alignment being considered, RMSD shows the RMSD associated with this alignment, CS indicates its CS score, with the CS score of aln1 alignments being set to 100 because they are used as references. APDB indicates the APDB score. Fig. 4. Correlation between CS and APDB on the complete HOM 9 test set Each dot corresponds to a sub-optimal alignment of one of the HOM 9 test cases, generated by genetic algorithm. For each alignment the graph plots the APDB score against its CS counterpart. In both test sets, using aln1 as a reference for the CS measure leads to the conclusion that aln2 is mostly incorrect (cf. CS column of Table 1). This is not true since these alignments are structurally equivalent as indicated by their RMSDs. In such a situation, APDB behaves much more appropriately and gives to each couple aln1/aln2 scores that are nicely consistent with their RMSD, thus indicating that APDB can equally well reward two suboptimal alignments when these are equivalent from a structural point of view. selected in the output of PROSUP two alignments (aln1 and aln2) to which PROSUP assigns similar RMSDs. aln1 is used as a reference and therefore gets a CS score of 100 while the CS score of the second alignment (aln2) is computed by direct comparison with its aln1 counterpart. i6 Using apdb to benchmark alignment methods Table 2 shows the average CS and APDB scores for the test sets in each of the four Bali 91 categories being considered here and in HOM 43. The highest scores in all cases, for both measures, come from the reference column (the last column). This is desirable providing the reference alignments really are consistent with the APDB Table 2. Correlation between APDB and CS on BaliBase and Homstrad Set N CS ClustalW APDB 59.9 26.6 38.5 59.5 60.2 CS T-Coffee APDB 58.3 47.1 46.9 64.5 61.6 Reference CS APDB 100 100 100 100 100 64.7 55.2 53.2 65.7 72.9 B91 R1 B91 R2 B91 R3 B91 R4 H43 35 23 22 11 43 70.1 32.7 46.4 52.0 35.4 67.7 33.9 48.6 52.5 38.9 Test Set: indicates the test set being considered, either one of the BaliBase 91 references (B91R#) or HOM 43(H43), a subset of HOMSTRAD. N indicates the number of test alignments in this category. ClustalW indicates a set of measures made on alignments generated with ClustalW. T-Coffee indicates similar measures made on T-Coffee generated alignments. Reference indicates measures made on the reference alignments as provided in BaliBase or in Homstrad. CS columns are the Column Score measures while APDB indicates similar measures made using APDB. local evaluation and the absence of a reference alignment, the only possible effect of non-superposable regions is to decrease the proportion of residues found aligned in a structurally optimal sequence alignment, thus yielding scores lower than 100 in the case of distantly related structures. A key advantage of APDB is its simplicity. It only requires three parameters and a few PDB files. Most importantly, APDB does not require any arbitrary manual intervention such as the assembly of a reference alignment. In the short term, all the existing collections of reference alignment could easily be integrated and extended with APDB. In the longer term, APDB could also be used to evaluate and compare existing collections of alignments such as profiles, when structures are available. REFERENCES Barton,G.J. and Sternberg,M.J.E. (1987) A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Cristobal,S., Zemla,A., Fischer,D., Rychlewski,L. and Elofsson,A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5. Duret,L. and Abdeddaim,S. (2000) Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. Bioinformatics, Sequence, Structure and Databanks. Higgins,D. and Taylor,W. (eds), Oxford University Press, Oxford. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a refined tool for protein structure alignment. Protein Eng., 13, 745–752. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Notredame,C. (2002) Recent progress in multiple sequence alignments. Pharmacogenomics, 3, 131–144. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel algorithm for multiple sequence alignment. J. Mol. Biol., 302, 205–217. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Siew,N., Elofsson,A., Rychlewski,L. and Fischer,D. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J., Plewniak,F. and Poch,O. (1999) BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15(1), 87–88. underlying structures. If we now compare the columns two by two, we find that every variation on CS from one column to another agrees with the corresponding variation of APDB. For instance in row 1 (Bali 91 Ref1), when T-Coffee/CS is lower than ClustalW/CS, T-Coffee/APDB is also lower. This observation is true for the whole table, regardless of the pair of results being considered. When considering the 134 alignments one by one, this observation remains true in more than 70 % of the cases. CONCLUSION This work introduces APDB, a novel method that makes it possible to evaluate the quality of a sequence alignment when two or more tertiary structures of the sequences it contains are available. This method does not require a reference alignment and it does not depend on any complex procedure such as structure superposition or sequence alignment. We show here that APDB sensitivity is comparable with that of CS, a well-established measure that compares a target alignment with a reference alignment. Our results also indicate that APDB can discriminate better than CS between structurally correct sub-optimal sequence alignments and structurally incorrect sequence alignments, even when the structures being considered are distantly related. Apart from the cost associated with their assembly, a serious problem with reference alignments is that they need to be annotated to remove from the evaluation regions that correspond to non-superposable portions of the structures. This is necessary because otherwise these regions (whose alignment cannot be trusted) will bias a CS evaluation toward rewarding the arbitrary alignment conformation displayed in the reference. Table 2 illustrates well the fact that such an annotation is not necessary in APDB. In our measure, thanks to the combination of i7 BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Vol. 22 no. 19 2006, pages 2439–2440 doi:10.1093/bioinformatics/btl404 APDB: a web server to evaluate the accuracy of sequence alignments using structural information ´ Fabrice Armougom1, Olivier Poirot1, Sebastien Moretti1, Desmond G. Higgins2, 3 1 Phillip Bucher , Vladimir Keduas and Cedric Notredame1,à 1 CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientifique de Luminy, 163 Avenue de Luminy, FR-13288, Marseille cedex 09, France, 2The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and 3Institut Suisse de Recherche et d’Experimentation sur le Cancer, Ch. des Boveresses 155 CH-1066 Epalinges, Switzerland Received on March 31, 2006; revised on June 17, 2006; accepted on July 21, 2006 Associate Editor: Thomas Lengauer ABSTRACT Summary: The APDB webserver uses structural information to evaluate the alignment of sequences with known structures. It returns a score correlated to the overall alignment accuracy as well as a local evaluation. Any sequence alignment can be analyzed with APDB provided it includes at least two proteins with known structures. Sequences without a known structure are simply ignored and do not contribute to the scoring procedure. Availability: APDB is part of the T-Coffee suite of tools for alignment analysis, it is available on www.tcoffee.org. A standalone version of the package is also available as a freeware open source from the same address. Contact: cedric.notredame@gmail.com 1 INTRODUCTION Structure-based sequence alignments have become a key component in the design and the improvement of sequence alignment methods. The extensive usage of structural information to align sequences results mostly from the observation that 3D folds evolve slower than primary sequences (Chothia and Lesk, 1986) and can be used to derive accurate sequence alignments, even when the sequences themselves have diverged beyond recognition. This property is often used to compute sequence alignments of remote homologues or to assemble collections of reference alignments that are used as gold standards to validate, benchmark and improve sequence alignment methods (Edgar, 2004; Thompson et al., 2005; Van Walle et al., 2005). Detailed analysis, however, shows that structure alignment methods often disagree on distantly related proteins (Kolodny et al., 2005). For instance, the alignments delivered by CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1996) only agree on 70% of the positions as judged on the 1682 pairs of homologous structures contained in the Prefab Database (Edgar, 2004). These variations probably explain why established collections of structurebased alignments are sometimes inconsistent with one another. In some previous work, we argued that it may sometimes be more reliable to evaluate sequence alignments by directly using the à structural information they are associated with, rather than depending on an intermediate reference alignment that constitutes a potentially distorted interpretation of the original structural signal. In an attempt to provide such a direct measure, we developed an algorithm named APDB (Analyze PDB) (O’Sullivan et al., 2003) that uses the structural information independently of any structural alignment or superposition. APDB relies on the simple observation that if two similar structures are aligned correctly, the intramolecular distances between equivalent Ca (as defined by the alignment) must be similar. By simply measuring the geometric compatibility of two structures according to their alignment, APDB makes no reference to any authoritative alignment and is therefore independent from any kind of optimization, unlike similar methods like MaxSub (Siew et al., 2000), LSQman (Kleywegt and Jones, 1999) or TMScore (Zhang and skolnick, 2004). It also makes APDB suitable for comparing alternative alignments of the same sequences, as long as corresponding structures are available. 2 USING THE APDB SERVER To whom correspondence should be addressed. The server is available on http://www.tcoffee.org/. It only makes sense to use the APDB server if the alignment one wishes to evaluate contains at least two sequences with a known structure. These sequences must be named according to their structure PDB identifier (with the chain index appended if appropriate). Whenever there is an imperfect match between the user’s and the PDB sequence, the program makes an automatic alignment based reconciliation. This process explicitly fails when the sequences are less than 80% identical. Sequences without a known structure are simply ignored and do not contribute to the scoring procedure. The 1cvl_1tca Prefab dataset has been aligned with Muscle (Edgar, 2004) (a) and Clustalw(Thompson, et al., 1994) (b). The resulting alignments were evaluated with the APDB server and the following figure displays the localAPDB score. Sequences in red and orange are considered correctly aligned by APDB. The server returns the overall APDB scores and a color-coded alignment with local APDB score (Fig. 1). The overall APDB score is an estimation of the fraction of columns likely to be correctly aligned within a pairwise structural alignment and the color code estimates the potential correctness of each individual alignment position (red: very likely; orange: possible; green/blue: unlikely). Ó The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2439 F.Armougom et al. criterion. The score returned by APDB is in broad agreement with these figures (Clustalw APDB: 50.3%, Muscle: 47.5%). ACKNOWLEDGEMENTS The authors thank Prof. Jean-Michel Claverie (head of IGS) for material support. The development of the server was supported by CNRS (Centre National de la Recherche Scientifique), Sanofi´ Aventis Pharma SA, Marseille-Nice Genopole and the French National Genomic Network (RNG). Conflict of Interest: none declared. REFERENCES Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J., 5, 823–826. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 283, 595–602. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta Crystallogr. D Biol. Crystallogr, 55, 941–944. Kolodny,R. et al. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188. O’Sullivan,O. et al. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 (Suppl. 1), i215–i221. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. Siew,N. et al. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D. et al. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. Van Walle,I. et al. (2005) SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267–1268. Zhang,J. and Skolnick,J. (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702–710. Fig. 1. Output of a standard APDB computation. Figure 1 shows the evaluation of two alternative alignments of the same structures. The first one was produced by Muscle (3.52) and is estimated to be 43.8% accurate as compared with its Prefab reference (Edgar, 2004). The second one, produced by ClustalW (1.81), is expected to have an accuracy of 55.7% according to the Prefab 2440 BIOINFORMATICS Vol. 14 no. 5 1998 Pages 407-422 COFFEE: an objective function for multiple sequence alignments Cédric Notredame 1, Liisa Holm 1 and Desmond G. Higgins 2 1EMBL OutstationćThe European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland Received on January 19, 1998; revised and accepted on February 24, 1998 Abstract Motivation: In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. Results: We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments. Availability: The package is available along with the test cases through the WWW: http://www.ebi.ac.uk/∼cedric Contact: cedric.notredame@ebi.ac.uk Introduction Multiple alignments are among the most important tools for analysing biological sequences. They can be useful for structure prediction, phylogenetic analysis, function prediction and polymerase chain reaction (PCR) primer design. Unfortunately, accurate multiple alignments may be difficult to build. There are two main reasons for this. First of all, it is difficult to evaluate the quality of a multiple alignment. Secondly, even when a function is available for the evaluation, E Oxford University Press it is algorithmically very hard to produce the alignment having the best possible score (optimal alignment). Cost functions or scoring functions roughly fall into two categories. First of all, there are those that rely on a substitution matrix. These are the most widely used. They require a substitution matrix (Dayhoff, 1978; Henikoff and Henikoff, 1992) that gives a score to each possible amino acid substitution, a set of gap penalties that gives a cost to deletions/insertions (Altschul, 1989), and a set of sequence weights (Altschul et al., 1989; Thompson et al., 1994b). Under this scheme, an optimal multiple alignment is defined as the one having the lowest cost for substitutions and insertion/deletions. One of the most widely used scoring methods of this type is the ‘weighted sums of pairs with affine (or semi affine) gap penalties’ (Altschul and Erickson, 1986). The main limitation of these scoring schemes is that they rely on very general substitution matrices, usually established by statistical analysis of a large number of alignments. These may not necessarily be adapted to the set of sequences one is interested in. To compensate for this drawback, a second type of scoring scheme was designed: profiles (Gribskov et al., 1987) and Hidden Markov Models (HMMs) (Krogh and Mitchison, 1995). Profiles allow the design of a sequencespecific scoring scheme that will take into account patterns of conservation and substitution characteristic of each position in the multiple alignment of a given family. To some extent, HMMs can be regarded as generalized profiles (Bucher and Hofmann, 1996). In HMMs, sequences are used to generate statistical models. The sequences of interest are then aligned to the model one after another to generate the multiple sequence alignment. The main drawback of HMMs is that to be general enough, the models require large numbers of sequences. However, this can be partially overcome by incorporating in the model some extra information such as Dirichlet mixtures (the equivalent of a substitution matrix in an HMM context) (Sjolander et al., 1996). Whatever scoring scheme one wishes to use, the optimization problem may be difficult. There are two types of optimization strategies: the greedy ones that rely on pairwise alignments and those that attempt to align all the sequences 407 C.Notredame, L.Holm and D.G.Higgins simultaneously. The main tool for making pairwise alignments is an algorithm known as dynamic programming (Needleman and Wunsch, 1970) and is often used for optimizing the sums of pairs. The complexity of the algorithm makes it hard to apply it to more than two sequences (or two alignments) at a time. Nevertheless, it allows greedy progressive alignments as described by Feng and Doolittle (1987) or Taylor (1988). In such a case, the sequences are aligned in an order imposed by some estimated phylogenetic tree. The alignment is called progressive because it starts by aligning together closely related sequences and continues by aligning these alignments two by two until the complete multiple alignment is built. Some of the most widely used multiple sequence alignment packages like ClustalW (Thompson et al., 1994a), Multal (Taylor, 1988) and Pileup (Higgins and Sharp, 1988) are based on this algorithm. They have the advantage of being fast and simple, as well as reasonably sensitive. Their main drawback is that mistakes made at the beginning of the procedure are never corrected and can lead to misalignments due to the greediness of the strategy. It is to avoid this pitfall that the second type of methods have been designed. They mostly involve aligning all the sequences simultaneously. For the sums of pairs, this is a difficult problem that has been shown to be NP-complete (Wang and Jiang, 1994). However, using the Carrillo and Lipman (1988) algorithm implemented in the Multiple Sequence Alignment program MSA (Lipman et al., 1989), one can simultaneously align up to 10 sequences. Other global alignment techniques using the sums of pairs cost function involve the use of stochastic heuristics such as simulated annealing (Ishikawa et al., 1993a; Godzik and Skolnik, 1994; Kim et al., 1994), genetic algorithms (Ishikawa et al., 1993b; Notredame and Higgins, 1996) or iterative methods (Gotoh, 1996). Simulated annealing can also be used to optimize HMMs (Eddy, 1995). The stochastic methods have two main advantages over the deterministic ones. First of all they have a lower complexity. This means that they do not have strong limitations on the number of sequences to align or on the length of these sequences. Secondly, these methods are more flexible regarding the objective function they can use. For instance, MSA is restricted to an approximation of the sums of pairs using semi-affine gap penalties (Lipman et al., 1989) instead of the natural ones shown to be biologically more realistic (Altschul, 1989). This is not the case with simulated annealing (Kim et al., 1994). The main drawback of stochastic methods is that they do not guarantee optimality. However, in some previous work, we showed that with the Sequence Alignment Genetic Algorithm (SAGA), results similar to MSA could be obtained (Notredame and Higgins, 1996). We also showed that the package was able to handle test cases with sizes much beyond the scope of MSA. The robustness of SAGA as an optimizer was confirmed by results obtained on a different objective function for RNA alignment (Notredame et al., 1997) and motivated our choice to use SAGA for optimizing the new objective function described here. The main argument for aligning all the sequences simultaneously instead of making a greedy progressive alignment is that using all the available information should improve the final result. However, one limitation of such methods is that regions of low similarity may induce some noise that will weaken the signal of the correct alignment (Morgenstern et al., 1996). In order to avoid this, one would like a scheme that filters some of the initial information and allows its global use. The approach we propose here is an attempt to do so. The underlying principle is to generate a set of pairwise alignments and look for consistency among these alignments. In this case, we define the optimal multiple alignment as the most consistent one and produce it using the SAGA package. The idea of using the consistency information in a multiple sequence alignment context is not new (Gotoh, 1990; Vingron and Argos, 1991; Kececioglu, 1993). In his scheme, Gotoh (1990) proposed the identification of regions that are fully consistent among all the pairwise alignments. These regions are used as anchor points in the multiple alignment, in order to decrease complexity. A similar strategy was described by Vingron and Argos (1991), allowing the computation of a multiple alignment from a set of dot matrices. Although very interesting, these methods had several pitfalls, including a sensitivity to noise (especially when some sequences are highly inconsistent with the rest of the set) and a high computational complexity. The work of Kececioglu (1993) bears a stronger similarity to the method we propose here. Kececioglu directly addressed the problem of finding a multiple alignment that has the highest level of similarity with a collection of pairwise alignments. Such an alignment is named ‘maximum weight trace alignment’ (MWT), and its computation was shown to be NP-complete. An optimization method was also described, based on dynamic programming and limited to a small number of sequences (six maximum). More recently, a method was described that allows the construction of a multiple alignment using consistent motifs identified over the whole set of sequences by a variation of the dynamic programming algorithm (Morgenstern et al., 1996). This algorithm should be less sensitive to noise than the one described by Vingron and Argos, but its main drawback is that it does rely on a greedy strategy for assembling the multiple alignment. An important aspect of multiple sequence alignment often overlooked is estimation of the reliability. Since all the alignment scoring functions available are known to be intrinsically inaccurate, identifying the biologically relevant portions of a multiple alignment may be more important than increasing the overall accuracy of this alignment. A few tech- 408 COFFEE: an objective function for multiple sequence alignments niques have been proposed to identify accurately aligned positions in pairwise (Vingron and Argos, 1990; Mevissen and Vingron, 1996) and multiple sequence alignments (Gotoh, 1990; Rost, 1997). We show here that our method allows a reasonable estimation of a multiple alignment local reliability. The measure we use for reliability is in fact very simple and could easily be extended much further to incorporate other methods such as the one described by Mevissen and Vingron (1996). Methods The overall approach relies on the definition of an objective function (OF) describing the quality of multiple protein sequence alignments. Given a set of sequences and an ‘allagainst-all’ collection of pairwise alignments of these sequences (library), the score of a multiple sequence alignment is defined as the measure of its consistency with the library. This objective function was optimized with the SAGA package. Sets of sequences with a known structure and for which a multiple structural alignment is available were extracted from the 3D_ali database (Pascarella and Argos, 1992) and used in order to validate the biological relevance of the new objective function. Two other test cases were designed using the DALI server (Holm and Sander, 1996a) and aligned using libraries made of structural pairwise alignments extracted from the FSSP database (Holm and Sander, 1993). Objective function The OF is a measure of quality for multiple sequence alignments. Ideally, the better its score, the more biologically relevant the multiple alignment. The method proposed here requires two components: (i) a set of pairwise reference alignments(library), (ii) the OF that evaluates the consistency between a multiple alignment and the pairwise alignments contained in the library. We named this objective function COFFEE (Consistency based Objective Function For alignmEnt Evaluation). Creation of the library A library is specific for a given set of sequences and is made of pairwise alignments. Taken together, these alignments should contain at least enough information to define a multiple alignment of the sequences in the set. In practice, given a set of N sequences, we included in the library a pairwise alignment for each of the (N2 – N)/2 possible pairs of sequences. This choice is arbitrary since in theory there is no limit regarding the amount of redundancy one can incorporate into a library. For instance, instead of each pair of sequences being represented by a single pairwise alignment, one could use several alternative alignments of this pair, obtained by various methods. In fact, the library is mostly an interface between any method one can invent for generating pairwise alignments, and the COFFEE function optimized by SAGA. However, the method follows the rule ‘garbage in/garbage out’ and the overall properties of the COFFEE function will most likely reflect the properties of the method used to build the library. The amount of time it takes to build the library depends on the alignment method used and increases quadratically with the number of sequences. Inside the evaluation algorithm, the library is stored in a look-up table. If each pair of sequences is represented only once, the amount of memory required for the storage increases quadratically with the number of alignments and linearly with their length. For the analyses presented here, two types of libraries were built. The first one relies on ClustalW. Given a set of N sequences, each possible pair of sequences was aligned using ClustalW with default parameters. The collection of output files obtained that way constitutes the library (ClustalW library). The motivation for using ClustalW as a pairwise method stems from the fact that Clustal is using local gap penalties, even for two sequences. In order to show that COFFEE is not dependent on the method used to construct the library, a second category of library was created using the FSSP database (Holm and Sander, 1996b). FSSP is a database containing all the PDB structures aligned with one another in a pairwise manner. For each test case, a set of sequences was chosen and the (N2 – N)/2 pairwise structure alignments involving these sequences were extracted from the FSSP database to construct an FSSP library. We also used as references the multiple alignments contained in FSSP. An FSSP entry is always based around a guide structure to which all the other structures are aligned in a pairwise manner. This collection of pairwise alignments can be regarded as a pairwise-based multiple alignment. This means that if one is interested in a set of N protein structures, FSSP contains the N corresponding pairwise-based multiple alignments, each using one structure of the set as a guide. Generally speaking, these N multiple alignments do not have to be consistent with one another, but only consistent with the subset of the pairwise alignments that was used to produce them. Evaluation procedure: the COFFEE function Let us assume an alignment of N sequences and an appropriate library built for this set. Evaluation is made by comparing each pair of aligned residues (i.e. two residues aligned with each other or a residue aligned with a gap) observed in the multiple alignment to those present in the library (Figure 1). In such a comparison, residues are identified by their position in the sequence (gaps are not distinguished from one another). In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the 409 C.Notredame, L.Holm and D.G.Higgins multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical). In practice, this scheme needs extra readjustments to incorporate some important properties of the sequence set. For instance, the significance of the information content of each pairwise alignment is not identical. Several schemes have been described in the literature for weighting sequences according to the amount of information they bring to a multiple alignment (Altschul et al., 1989; Sibbald and Argos, 1990; Vingron and Sibbald, 1993; Thompson et al., 1994a). In COFFEE, our main concern was to decrease the amount of noise produced by inaccurate pairwise alignments in the library. To do so, each pairwise alignment in the library is given a weight that is a function of its quality. For this purpose, we used a very simple criterion: the weight of a pairwise alignment is equal to the per cent identity between the two aligned sequences in the library. This may seem counterintuitive since weighting schemes are normally used in order to decrease the amount of redundancy in a set of sequences (i.e. down-weighting sequences that have a lots of close relatives). Doing so makes sense in the context of profile searches (Gribskov et al., 1987; Thompson et al., 1994b) where it is important to prevent domination of the profile by a given subfamily. However, in the case of multiple sequence alignments made by global optimization, it is more important to make sure that closely related pairs of sequences are correctly aligned, regardless of the background noise introduced by other less related sequences. In such a context, a weight can be regarded as a constraint. The consequence is that the alignment of a given sequence will mostly be influenced by its closest relatives. On the other hand, if a sequence lacks any really close relative, its alignment will mostly be influenced by the consistency of its pairwise alignments with the rest of the library. The COFFEE function can be formalized as follow. Given N aligned sequences S1 … SN in a multiple alignment, Ai,j is the pairwise projection (obtained from the multiple alignment) of the sequences Si and Sj , LEN(Ai,j ) is the length of this alignment, SCORE(Ai,j ) is the overall consistency (level of identity) between Ai,j and the corresponding pairwise alignment in the library and Wi,j is the weight associated with this pairwise alignment. Given these definitions, the COFFEE score is defined as follows: COFFEE score + ƪȍ ȍ N*1 N ƪȍ ȍ N*1 N W i,j SCORE(A i,j) ń i+1 j+i)1 W i,j LEN(A i,j) i+1 j+i)1 ƫ ƫ (1) with: SCORE(Ai,j ) = number of aligned pairs of residues that are shared between Ai,j and the library (2) The COFFEE function presents some similarities with the ‘weighted sums of pairs’ (Altschul and Erickson, 1986). Here as well, we consider all the pairwise substitutions in the multiple alignment, and weight these in a way that reflects the relationships between the sequences. The library plays the role of the substitution matrix. The main differences between the COFFEE function and the ‘weighted sums of pairs’ are that (i) no extra gap penalties are applied in our scheme, since this information is already contained in the library, (ii) the COFFEE score is normalized by the value of the maximum score (i.e. its value is between 0 and 1) and (iii) the cost of the substitutions is made position dependent, thanks to the library (i.e. two similar pairs of residues will have potentially different scores if the indices of the residues are different). Under this formulation, an alignment having an optimal COFFEE score will be equivalent to an MWT alignment using a ‘pairwise alignment graph’ (Kececioglu, 1993). The score defined above is a global measure for an entire alignment. It can also be adapted for local evaluation. We have defined two types of local scores: the residue score and the sequence score. The residue score is given below. S xis the i residue x in sequence S i and A x,yis the pair of aligned residues i,j S x and S y in the pairwise alignment A i,j. i j residue score(S x) + i ȍ j+1,j!+1 N W i,j OCCURRENCE(A x,y)ń i,j ȍ j+1,j!+1 N W i,j (3) OCCURRENCE( A x,y) is equal to the number of occuri,j rences of the pair A x,y in the reference library (0 or 1 when i,j using the libraries described here). The sequence score is the natural extension of the residue score. It is defined as the sum of the score of each residue in a sequence divided by the number of residues in that sequence. Optimizing an alignment for its COFFEE score: SAGA-COFFEE The aim is to create an alignment having the best possible COFFEE score (optimal alignment). Doing so is a difficult 410 COFFEE: an objective function for multiple sequence alignments Fig. 1. COFFEE scoring scheme. This figure indicates how a column of an ALIGNMENT is evaluated by the COFFEE function using a REFERENCE LIBRARY. Each pair in the alignment is evaluated (SCORE MATRIX). In the score matrix, a pair receives a score of 0 if it does not appear in the library or a score equal to the WEIGHT of the pair of sequences in which it occurs in the PAIRWISE LIBRARY. S ince the matrix is symmetrical, the column score is equal to the sum of half of the matrix entries, excluding the main diagonal. This value is divided by the maximum score of each entry (i.e. the sum of the weights contained in the library). The residue score is equal to the sum of the entries contained by one line of the matrix, divided by the sum of the maximum score of these entries. task. The computational complexity of a dynamic programming solution is known to be NP-complete (Kececioglu, 1993). For reasons discussed in the Introduction, we used SAGA V0.93 (Notredame and Higgins, 1996). 411 C.Notredame, L.Holm and D.G.Higgins Table 1. Accuracy of the prediction made on the category 5 of substitution Test case Length Nseq. Proportion (H+E) (%) 57 68 43 48 57 74 53 53 67 57 51 / / Avg Id. (%) 21 31 42 17 36 24 24 39 22 61 27 14 12 COFFEE score Clustal 0.48 0.72 0.84 0.49 0.86 0.78 0.63 0.87 0.59 0.96 0.69 / / SAGA 0.56 0.84 0.87 0.62 0.89 0.80 0.67 0.87 0.65 0.97 0.74 / / Accuracy (H+E) % Clustal SAGA 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 / / 50.2 64.5 90.7 47.0 83.1 85.2 78.1 72.3 64.7 96.9 66.6 / / Accuracy (ov.) % Clustal SAGA 35.2 50.0 88.3 35.7 76.7 82.1 65.6 72.3 60.4 93.6 57.7 / / 45.9 61.7 86.1 43.6 80.2 81.7 69.4 72.4 61.4 93.6 61.2 / / CPU time (s) 21 009 1003 699 936 91 28 477 110 453 256 388 644 44 978 13 756 43 568 N.G. ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot ceo vjs 248 500 146 136 52 183 194 213 90 331 229 882 1207 14 7 6 9 8 17 37 6 8 7 15 7 8 535 166 259 480 55 222 132 105 110 127 744 882 1400 Test case: generic name of the test case, taken from 3D_ali for the first 11 (ac prot: acid proteases, binding: sugar/amino acid binding proteins, cytc: cytochrome c’ ss, fniii: fibronectin type III, gcr: crystallins, globin: globins/phycocyanins/collicins, igb: immunoglobulin fold, lzm: lyzozymes/lactalbumin, phenyldiox: dihydroxybiphenyl dioxygenase, sbt: subtilisin, s_prot: serine protease fold) and from the DALI server for the last two. ceo includes: 1cbg, 1ceo, 1edg, 1byb, 1ghr, 1×yzA and vjs includes: 1cnv, 1vjs, 1smd, 2aaa, 1pamA, 2amg, 1ctn, 2ebn. Length: length of the reference alignment. Nseq: number of sequences in the alignment. Proportion (E+H): percentage of the substitutions involving E→E or H→H. Avg. Id.: average level of identity between the sequences. OF score: score measured with the COFFEE function using a ClustalW library on the ClustalW or the SAGA alignments. Accuracy (E+H): percentage of the (E+H) substitutions found identical between the SAGA (or ClustalW alignment) and the reference. Accuracy (ov.): percentage of substitutions similar in the SAGA (or ClustalW) alignment and in the reference. CPU time: cpu time in seconds using an alpha 8300 machine N.G.: number of generations needed by SAGA to find the solution. The results for the two last test cases analysis are presented in Table 6. SAGA follows the general principles of genetic algorithms as described by Goldberg (1989) and Davis (1991). In SAGA, a population of alignments evolves through recombination, mutation and selection. This process goes on through series of cycles known as generations. Every generation, the alignments are evaluated for their score (COFFEE). This score is turned into some fitness measure. In an analogy with natural selection, the fitter an alignment, the more likely it is to survive and produce an offspring. From one generation to the next, some alignments will be kept unchanged (statistically the fittest), others will be randomly modified (mutations), combined with another alignment (cross-over) or will simply die (statistically the less fit). The new population generated that way will once again undergo the same chain of events, until the best scoring alignment cannot be improved for a specific number of generations (typically 100). Operators play a central role in the GA strategy. They can be subdivided between two categories. First the cross-overs, which combine the content of two multiple alignments. Thanks to them, and to the pressure of selection, good blocks tend to be merged into the same alignments. On their own, cross-overs cannot create new blocks, this needs to be done by the second category of operators: the mutations. These are specific programs that input an alignment and modify it by inserting or moving patterns of gaps. Mutations can be slightly greedy (attempt to make some local optimization) or completely random. A key concept in the genetic algorithm strategy is that the fitness-based selection is not absolute but statistical. To select an individual, a virtual wheel is created. On this wheel, each individual is given a number of slots proportional to its fitness. To make a selection, the wheel is spun. Therefore, the best individuals are simply more likely to survive, or to be selected for a cross-over or a mutation. This form of selection protects the GA search from excessive greediness, hence preventing it from converging onto the first local minimum encountered during the search. SAGA V0.93 is mostly similar to the Version 0.91 described in Notredame and Higgins (1996). Most of the changes between Version 0.93 and 0.91 have to do with some improvement in the implementation and the user interface, but do not affect the algorithm itself. To optimize the COFFEE scores, SAGA was run using the default parameters described for SAGA 0.91 in Notredame and Higgins (1996). SAGA was also modified so that it could evaluate any alignment (including a ClustalW alignment) using the COFFEE function. 412 COFFEE: an objective function for multiple sequence alignments Test cases To assess the biological accuracy of the COFFEE function and the efficiency of its optimization by SAGA, 13 test cases were designed. They are based on sequences with known three-dimensional structures, for which a structural alignment is available. This choice was guided by the fact that structure-based alignments are usually biologically more correct than any other alternative, especially when they involve proteins with low sequence similarity. For this reason, we used these structure-based alignments as a standard of truth in our analyses. Eleven test cases were extracted from the 3D_ali Release 2.0 (Pascarella and Argos, 1992). Alignments were selected according to the following criteria: alignments with more than five structures and with a consensus length larger than 50. In 3D_ali, 18 alignments meet this requirement. Among these, we removed those for which ClustalW produces a multiple alignment >95% identical to its structural counterpart (four alignments). We also removed three alignments which were impossible to align accurately using ClustalW or SAGA/ COFFEE. These consist of sets of very distantly related sequences with unusually long insertions/deletions (barrel, nbd and virus in 3D_ali). These alignments are beyond the scope of conventional global sequence alignment algorithms. This leaves a total of 11 alignments used in our analyses. Their characteristics are shown in Table 1. The two last test cases were extracted from the FSSP database. As opposed to the 11 other test cases, they have been specifically designed for making a multiple sequence alignment using a structure-based reference library. This explains their low level of average sequence identity, as can be seen in Table 1 (the two last entries, vjs and ceo). Evaluation of the COFFEE function accuracy When evaluating a new OF with SAGA, two main issues are involved: the quality of the optimization and the biological relevance of the optimal alignment. Another aspect involves the comparison of the new OF with already existing methods. The evaluation of the biological relevance of the COFFEE function required the use of some references. The structural alignments described above were used for this purpose. Comparison between a sample alignment and its reference was made following the protocol described in Notredame and Higgins (1996), inspired by the method used by Vogt et al. (1995) for substitution matrix comparison and Gotoh (1996). All the pairs (excluding gaps) observed in the sample alignment were compared to those present in the reference. The level of similarity is defined as the ratio between the number of identical pairs in the two multiple alignments divided by the total number of pairs in the reference. This procedure gives access to an overall comparison. It does not reflect the fact that in a global structural alignment, some positions are not correctly aligned because they cannot be aligned (this is true of any position where the two structures cannot be superimposed). In practice, structural alignment procedures may deal with these situations in different ways, producing sequence alignments that are sometimes locally arbitrary (especially in the loops). While in DALI these regions are explicitly excluded from the alignment, it is not so obvious to identify them in the multiple sequence alignments contained in 3D_ali. To overcome this type of noise, a procedure was designed that should be less affected by misalignments. For this alternative measure of biological relevance, we only take into account substitutions that involve a conservation of secondary structural state in the reference alignment (helix to helix and beta strand to beta strand). In the text and the tables, this category of substitution is referred to as (E+H). In most of the test cases, the (E+H) category makes up the majority of the observed pairs, as can be seen in Table 1. For each of the first 11 test cases (3D_ali), the evaluation procedure involved making multiple alignments with five different methods (cf. the next section) and a ClustalW library (default parameters). The ClustalW library was used with SAGA for producing a multiple alignment having an optimal COFFEE score. The biological relevance of this alignment was then assessed by comparison with the structural reference, and compared to the accuracy obtained with the other methods on the same sequences. On the two last test cases (vjs and ceo), alignments were made using FSSP libraries. Alignment accuracy was assessed using the DALI scoring measure. Given a pairwise alignment, this is a measure of the quality of the structure superimposition implied by the alignment. The program used for this purpose (trimdali) returns the DALI score (Holm and Sander, 1993) and two other values: the length of the consensus (number of residues that could be superimposed) and the average RMS (the average deviation between equivalent Ca atoms). These values were computed for each possible pairwise projection of the multiple alignments and averaged. The scores obtained that way for the SAGA alignments were compared to similar scores measured on the FSSP pairwise-based multiple alignments. Comparison of COFFEE with other methods In total, six alignment methods where used to align the 3D ali test cases: ClustalW v1.6 (Thompson et al., 1994a), SAGA with the MSA objective function (SAGA-MSA) (Notredame and Higgins, 1996), PRRP (V 2.5.1), the iterative alignment method recently described by Gotoh (1996), PILEUP (Higgins and Sharp, 1988) in GCG v9.1 and SAM (v2.0), a HMM package (Hughey and Krogh, 1996) and SAGA-COFFEE. 413 C.Notredame, L.Holm and D.G.Higgins Apart from SAM, all these methods were used with the default parameters that came along with the package. In the case of SAM, since it is known that HMMs usually require large sets of sequences in order to evaluate a model, we used the Dirichlet mixture regularizer provided in the package, which is supposed to compensate for this type of problems. SAGA-MSA is the package previously described (Notredame and Higgins, 1996), Rationale 2 weights (Altschul et al., 1989) were computed using the MSA package. (Lipman et al., 1989). When possible, MSA was used on the same sequences as SAGA-MSA in order to control the quality of the optimization. Results were consistent with those previously reported. workstation, it takes ∼4 s/generation for the gcr test case and ∼7 min/generation for the igb test case. Unfortunately, establishing the complexity in terms of the number of generations needed to reach a global optimum is much harder. This depends on several factors: number of sequences, length of the consensus, relative similarity of the sequences, complexity of the pattern of gaps needed for optimality, operators used for mutations and cross-overs. Since the pattern of gaps is an unknown factor, it is impossible really to predict how many generations will be required for one specific test case. On the other hand, judging from the data in Table 1 (N generations column), it seems that the length of the alignment has a stronger effect than the number of sequences. Implementation The COFFEE function and the procedure for building ClustalW pairwise libraries have been implemented in ANSI C. These programs have been integrated in Version 0.93 of the SAGA package also written in ANSI C. These are available upon request from the authors (http://www.ebi.ac.uk/∼cedric). Comparison of the COFFEE function with other methods Multiple alignments were produced with SAGA-COFFEE using ClustalW libraries (best scoring alignment out of 10 runs). These alignments were compared to the structural references. Multiple alignments of the same sets, generated with five other methods, were also compared to the references in order to evaluate relative performances. Since in the way it is used here, the COFFEE function depends heavily on ClustalW, special emphasis was given to the comparison of these two methods (Table 1). The results are unambiguous. When considering the overall comparison, nine test cases showed that SAGA makes an improvement over ClustalW (in two of these, the improvement is >10%). The trend is similar when looking only at (E+H) substitutions, where 10 test cases out of 11 present an improvement. In the few cases where it occurs, the degradation made by SAGA is always <2%. The extent of the observed improvements usually correlates well with the differences in the scores measured with the COFFEE function. Degradation is only observed in the cases where the ClustalW alignment already has a high level of consistency with the reference library (>75%), as can be seen with the globin (COFFEE score of the ClustalW alignment = 0.78) and the cytochrome C (COFFEE score of the ClustalW alignment = 0.84). In order to put SAGA-COFFEE in a wider context, comparisons were made using five other different methods (Table 2). These results show that in most of the cases SAGA-COFFEE does reasonably well. When its alignment is not the best, it is usually within 3% of the best (except for the binding and gcr tests, for which the difference is greater). Apart from the HMM method (SAM) that has low performances, it is relatively hard to rank existing methods. PRRP is one of the newest methods available. It has been described as being one of the most accurate (Gotoh, 1996) and happens to be the only one that significantly outperforms SAGACOFFEE on some test cases. It is also interesting to notice that SAGA-COFFEE is always among the best for test cases Results Accuracy and complexity of the optimization Since our approach relies on the ability of SAGA to optimize the COFFEE function, we checked that this optimization was performed correctly. For each test case, a dummy library was created, containing sets of pairwise alignments identical to those observed in the reference multiple structure alignment. In such a case, the structural alignment has a score of 1 since it agrees completely with the library. Therefore, the maximum score that can be reached by SAGA also becomes 1. Since, under these artificial conditions, the score of the optimum is known, we could test the accuracy of SAGA’s optimization. Several runs made on the same set reached the optimum value in an average of 5.4 runs out of 10. The lowest reproducibility was found with the largest test cases of Table 1 (igb or s prot with a score of 1 being reached, respectively, one and two times out of 10 runs). However, even if the optimal score is not reached, we found that it is always possible to produce an alignment with a score better than 0.94. Although they do not constitute a full proof, these results support the assumption that SAGA is a good choice for optimizing the COFFEE function. An important aspect is the complexity of the program and the factors that influence it. As we previously reported when optimizing the sums of pairs with SAGA (Notredame and Higgins, 1996), establishing the complexity is not straightforward. The evaluation of a COFFEE score is quadratic with the number of sequences and linear with the consensus length. Using a given population size, the time required for one generation will vary accordingly. For instance, on a fast 414 COFFEE: an objective function for multiple sequence alignments having a low level of identity. This trend is confirmed by the results shown in Table 3, where the sequences are grouped according to their average similarity with the rest of their family (as measured on the reference structural alignment). In this table, we analysed the overall performance of each method and compared it with SAGA-COFFEE by counting (i) the overall per cent of (E+H) residues correctly aligned and (ii) the number of sequences for which SAGA-COFFEE makes a better (b)/worse (w) alignment than a given method. Overall, the results confirm that SAGA-COFFEE seems to do better than the other methods when the sequences have a low level of identity with the rest of their set. The poor performances of SAM can probably be explained by two reasons: the small number of sequences in each test case and perhaps some inadequate default settings in the program (in practice, SAM is often used as an alignment improver rather than on its own). Sequence identity: minimum and maximum average identity of the sequences of each category with the rest of their alignment as measured on the reference structural alignment. Nseq.: number of sequences in a category. Nres.: number of residues. SAGA-COFFEE percentage of the (E+H) substitutions present in the reference structural alignment observed in the SAGA-COFFEE alignment. ClustalW: (%), similar but using ClustalW alignment; (b), number of Table 2. Method comparison on the 3D_ali test cases Test case Avg. id. (%) 21 31 42 17 36 24 24 39 22 61 27 Nseq. SAGA COFFEE (%) 50.2 64.5 90.7 *47.0 83.1 85.2 *78.1 *72.3 *64.7 96.9 66.6 PRRP (%) 48.8 *76.2 89.4 36.3 *92.8 *87.0 74.9 71.1 49.9 96.7 64.3 sequences for which SAGA-COFFEE produces a better alignment than ClustalW; (w), number of sequences for which SAGA-COFFEE produces a worse alignment than ClustalW. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. [Note that the (b) and (w) categories do not necessarily add up to the overall number of sequences because they do not include sequences having the same score with the two method compared.] Test case: generic name of the test case, taken from 3D_ali (see 3D_ali for PDB identifiers), see Table 1 for more details. Nseq: number of sequences in the alignment. Avg. Id.: average level of identity between the sequences. SAGA-COFFEE accuracy of the alignments obtained with SAGA-COFFEE as judged by comparison with the structural alignment, only considering the (E+H) substitutions. ClustalW: similar but with ClustalW alignments. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. ClustalW (%) 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 PILEUP (%) 40.9 66.6 *94.6 37.8 80.8 72.6 52.4 *72.3 37.4 *97.4 57.9 SAGA MSA (%) *51.2 64.2 67.3 45.2 80.8 78.0 70.1 *72.3 55.6 96.0 *68.5 SAM (%) 27.9 36.9 67.3 16.2 85.7 67.8 67.2 55.3 45.7 90.6 61.7 ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot 14 7 6 9 8 17 37 6 8 7 15 *Indicates the method performing best on a test case. Table 3. Method comparison on the 3D_ali test cases: global results Sequence identity [00.0–20.0] [20.0–40.0] [40.0–100.0] Nseq. Nres. SAGA COFFEE (%) *63.3 *76.2 89.3 PRRP (%) 62.2 74.6 *90.9 ClustalW (%) b 49.7 66.1 84.6 11 68 20 PILEUP (%) b 42.4 60.2 89.8 23 80 3 SAGA MSA (%) b 56.9 69.7 87.8 20 63 16 SAM (%) w 7 24 2 36.4 59.1 64.3 18 84 25 3 2 0 b 18 57 14 w 8 31 4 w 6 20 3 w 3 8 15 b w 28 88 18 3424 12 010 3808 *Indicates the method performing best on a given range of identity. 415 C.Notredame, L.Holm and D.G.Higgins Fig. 2. Correlation between sequence score and alignment accuracy. (a) The average level of identity of each sequence with the rest of its alignment was measured on the reference structural alignment. The average level of accuracy of the SAGA-COFFEE alignment of each of these sequences was also estimated on the (E+H) category. The two values are plotted against one another. (b) For each sequence, the sequence score was measured on the SAGA-COFFEE alignment, this value was plotted against the accuracy of the sequence alignment. The coefficient of linear correlation was estimated on these points (r = 0.65). These results also indicate that there is no such thing as an ideal method. Even if COFFEE seems to do better on average, one can see in Table 2 and III that the alignments it produces are not always the best. In fact, it seems that depending on the test case any method can do better than the others. Unfortunately, as discussed by Gotoh (1996), it is hard to discriminate the factors that should guide the choice of a method. For this reason, being able to identify the correct portions in a multiple sequence alignment may be even more important than being able to do a very accurate alignment. Correlation between COFFEE score and alignment accuracy As mentioned in Methods, the score can be assessed at a local level (sequence score or residue score). One of the benefits of such evaluation is that local score and accuracy can be correlated, thus allowing the identification of potentially correct portions of an alignment with a known risk of error. The 3D_ali structure-based alignments were used once more to validate this approach. Generally speaking, a high residue score will indicate that the pairs in which a given residue is involved are also found in the pairwise library. On the other hand, if none of the pairings in which a given residue is involved are found in the library, this residue will be considered unaligned. We evaluated the COFFEE score of each sequence in each alignment. In each of these sequences the (E+H) average accuracy was also measured. The graph in Figure 2b shows the relationship between sequence score and (E+H) average accuracy. The correlation between these two quantities is reasonable (r = 0.65). When considering the values used for this graph, we found that for >85% of the sequences it is possible to predict the actual accuracy of the alignment with a ±10% error rate. In terms of prediction, this is a substantial improvement over what can be obtained when measuring the average level of identity between one sequence and its multiple alignment, as shown in Figure 2a. 416 COFFEE: an objective function for multiple sequence alignments Table 4. Average accuracy of the alignment of each sequence as a function of its sequence score (3D_ali test cases) Sequence score N. residues (%) ClustalW 5.8 36.8 57.4 19 242 residues N. sequences (%) ClustalW 6.7 40.3 53.0 Accuracy (E+H) (%) ClustalW 14.3 63.2 82.0 SAGA 2.6 33.7 63.7 134 sequences SAGA 3.0 38.1 59.0 SAGA 9.9 67.2 82.5 [0.00–0.33] [0.33–0.66] [0.66–1.00] TOTAL Sequence score: minimum and maximum score of the sequences in each category. N. residues: percentage of residues belonging to each category estimated on SAGA or ClustalW alignments. N. sequences: percentage of the total sequences belonging to each category of score as measured on the SAGA and the ClustalW alignments. Accuracy (E+H): accuracy associated with each category of score in the SAGA and ClustalW alignments. TOTAL: total number of residues and sequences in the comparison. Table 5. Accuracy of the prediction made on the category 5 of substitution Test case Accuracy (%) ClustalW 56.8 64.3 96.2 81.5 75.5 97.2 88.8 91.5 78.0 82.2 89.7 Correct substitution (%) ClustalW SAGA 9.6 31.5 72.1 13.8 63.4 63.1 39.0 61.3 34.3 45.2 85.2 15.7 40.0 73.5 15.6 74.5 66.5 42.3 61.5 40.0 50.1 87.0 SAGA 68.2 61.4 93.9 77.7 77.4 95.0 85.5 91.8 72.5 82.4 89.7 ac prot binding cytc fniii gcr globin igb lzm phenyl s prot sbt Test case: generic name of the test case taken from 3D_ali. Accuracy: ratio between the total number of substitutions in category 5 (in SAGA and ClustalW alignments) and the number of these substitutions present in the reference alignment. % Correct substitutions: percentage of the correct substitutions (over the total number, all substitution categories included) identified in the category 5 of substitution. The correlation between score and accuracy becomes slightly more apparent when looking at the data in a more global way (Table 4). In this case, the sequences have been grouped according to their score, and the accuracy of their alignment was measured. One can see that the higher the score of a sequence, the higher its average alignment accuracy. We also found that the distribution of the sequences among the three categories was modified when using ClustalW instead of SAGA. SAGA produces more sequences with a high score than does ClustalW. This means that not only are SAGA alignments more accurate than ClustalW, it is also possible to identify them for being so. In practice, the sequence score, as imperfect as it may seem, provides a fast and simple way to identify sequences that do not really belong to a set or that are so remotely related to the rest that their alignment should be considered with care. The sequence score is a global measure. It does not reflect the local variations that occur at the residue level. To analyse these kinds of data and assess their utility for predicting correct portions of an alignment, the score of each residue in each multiple sequence alignment was evaluated using equation (3). These scores were scaled in a range varying from 0 to 9. A residue has a score of 9 if >90% of the pairs in which it is involved are also present in the reference library, and so forth for 8 ([80–90%]), etc. Once residue scores have been evaluated, substitution classes can be defined. For instance, the class 5 of substitutions includes all the residues of a multiple alignment having a residue score superior to or equal to 5 (Figure 3a), the class 0 of substitution includes all the residues in the alignment. Figure 3a gives an example of such an evaluation. In this alignment, each residue is replaced by its score, and the residues that belong to the category 5 of substitution are boxed. Figure 3b shows the correctly aligned residues in this category. It is possible to see that using our measure, one can identify some of the correct portions in the SAGA fniii alignment. As can be gathered from Table 1, fniii is a very demanding test case. Except for the two first sequences, which are almost identical, all the other members of this set have a very low level of identity with one another. This is especially true for the sequence 2hft_1 which illustrates well the advantages and limits of our method. This sequence is not the most remotely related to the set. It has an average identity of 14%, whereas two other members (3hhr_c and 2hft) are more distantly related with 12% average identity. Despite this fact, 2hft_1 gets the lowest sequence score in the multiple alignment (0.29). This correlates well with the fact that it also has the lowest alignment accuracy of the set [18% overall, 20% for the (E+H) category]. Similarly, the only non-terminal stretch of this sequence that belongs to the category 5 is one of the only portions to be correctly aligned (Figure 3a and b). The same type of analyses were carried out on the 10 other test cases (Table 5). Our measures indicate that using the category 5 of substitution, a substantial portion of correctly aligned residues can be identified. When comparing Clus- 417 C.Notredame, L.Holm and D.G.Higgins Fig. 3. Evaluation of the accuracy of the fniii test case (fibronectin type III family). (a) Sequences in the fniii test case were aligned by SAGA-COFFEE using a ClustalW library. The alignment obtained that way was evaluated locally. The sequences names are the PDB id entifiers. The suffix _1, _2.. indicates that several portions of the same sequence have been used (cf. 3D_ali for further details). In th is alignment, each residue has been replaced by its score. The gray boxes indicate all the residues that belong to category 5 of substitution (i.e . having a score ≥5). The sequence score box lists the values measured on each sequence. (b) The accuracy in the category 5 of substitution (boxes) was evaluated by comparison with the reference alignment. Residues shadowed in gray in the boxes are correctly aligned to one another. Boxed residues not shadowed are not correctly aligned with each other or with the rest of the category 5 residues. Residues not contained in the boxes are not taken into account for this evaluation. talW and SAGA, we found that more correct residues can be identified with SAGA. This improvement is sometimes achieved at the cost of a slightly lower accuracy (more false positives) in the SAGA alignments. A global estimation was made to evaluate the accuracy that can be expected when using any of the 10 substitution categories on a SAGA alignment. The proportion of correct substitutions predicted that way was also measured. These results are presented in Figure 4a and b, respectively. Residues are grouped in three classes, depending on the score of the sequences in which they occur. Figure 4a confirms that high-scoring residues are usually correctly aligned (high accuracy). However, the higher the substitution category, the smaller the number of residues on which a prediction can be made, as shown in Figure 4b. These graphs confirm that the residue score can be used as a measure for predicting accuracy; they also indicate that the sequence score is informative when making a prediction on a residue. Making a multiple structural alignment The analysis carried out with the ClustalW libraries represents only one possible application for the COFFEE function. Generally speaking, the COFFEE scheme allows the combination of the information contained by any reference library, regardless of the method used for its construction. To illustrate this fact, we show that it is possible to build a structure-based multiple sequence alignment when a library of high-quality pairwise structural alignments is available. We used COFFEE on two sets of proteins (vjs and ceo) using appropriate FSSP libraries. It was impossible to improve significantly over FSSP for the ceo test case, made of endoglucanases and other related carbohydrate degradation enzymes. This can be explained by the fact that the FSSP alignment with the best DALI score (the one using 1ceo as a guide) already has a high level of consistency with the library (COFFEE score = 0.82). This shows quite clearly in the fact that this alignment is 88% similar to the SAGA-COFFEE one. The second set is made of amylases and other carbohydrate degradation enzymes. Table 6 is used to compare the SAGACOFFEE alignment of these sequences with the corresponding FSSP pairwise-based multiple alignments. These results clearly indicate that the alignment produced by SAGA is better than any of the FSSP multiple alignments, regardless of the criterion used to evaluate this improvement (DALI score, consensus length or RMS). This result was to be expected since SAGA makes use of much more information 418 COFFEE: an objective function for multiple sequence alignments Fig. 4. Prediction of correctly aligned residues using the residue COFFEE score. (a) The accuracy of the alignments (number of correct substitutions in one of the categories divided by the total number of substitutions in this category) of each sequence was meas ured. To do so, sequences were divided into three groups, depending on their sequence score. The graph was made for each of the three groups. (b) For each sequence, the number of correct substitutions contained in each category was evaluated and divided by the overall number of substitutions involving that sequence. This value was plotted against the category of substitution. than any of the FSSP alignments. In Table 6, entries are sorted according to the DALI score. This allows one to see that the DALI and COFFEE scores correlate well for the Table 6. Comparison of FSSP and SAGA multiple alignments Guide sequence 2ebn 1cnv 1vjs 1ctn 1smd 2amg 2aaa 1pamA SAGA-COFFEE Average DALI score 1152.6 1205.2 1258.4 1331.2 1667.1 1672.9 1766.8 1786.3 1860.0 FSSP alignments, and supports the idea that the COFFEE score is also a good indicator of the alignment quality when the library is based on structural alignments. Average consensus length 186.5 196.4 198.8 196.9 219.4 217.7 224.9 225.8 230.9 Average RMS (Å) 3.73 3.63 3.62 3.53 3.40 3.42 3.45 3.30 3.20 COFFEE score 0.53 0.59 0.50 0.60 0.65 0.67 0.69 0.70 0.79 Guide sequence: sequence used as a guide in the FSSP multiple alignment (SAGA indicates the alignment obtained with SAGA-COFFEE). Average DALI score: average DALI score of each pair of sequences in the alignment. The table is sorted according to the values of these entries. Average consensus length: average of the number of residues superimposable by DALI in each pair of sequence. Average RMS: the average of the RMS values measured by DALI on each pair of the alignment in Ångströms. COFFEE score: score given by SAGA to the multiple alignments using the same library. 419 C.Notredame, L.Holm and D.G.Higgins In theory, we could have used the DALI score as an objective function, and optimized it with SAGA. In such a context, DALI would be used to evaluate all the pairwise projections in order to give a score similar to the one shown in the ‘DALI score’ column of Table 6. However, this is not possible in practice because the computation of a DALI score is much more expensive than the evaluation of a COFFEE score. DALI score used on a multiple alignment is quadratic with the number of sequences and quadratic with the length of the alignment. The COFFEE score is also quadratic with the number of sequences, but only linear with the length of the alignment. In consequence, even if we were to assume a global DALI score to be biologically more realistic than the FSSP library-based COFFEE score, COFFEE still appears as a good trade-off between approximating DALI and saving on computational cost. Discussion In this work, we show that alignments can be evaluated for their MWT score using the COFFEE function and subsequently optimized with the genetic algorithm package SAGA. Given a heterogeneous, non-consistent collection of pairwise alignments, one can extract the corresponding multiple alignment with COFFEE and SAGA. We have shown here that the SAGA-COFFEE scheme outperforms most of the commonly used alternative packages when applied to sequences having low levels of identity. The comparison made with other global optimization techniques such as SAGA-MSA and PRRP indicates that the method is not only better because it does a global optimization, but also probably because of the way it uses information, filtering some of the noise through the library of pairwise alignments. The weighting scheme also plays a role in this improvement. It helps turning the relationship between the sequences into some of the constraints that define the optimal alignment. It is because all these constraints (library and weights) are unlikely to be consistent that the genetic algorithm strategy proves to be a very appropriate mean of optimization. There is little doubt that the performances of our method will also depend on the relationship between the sequences. Sets with a lot of intermediate sequences (i.e. a dense phylogenetic tree) are likely to lead to more accurate alignments. However, the fact that COFFEE proves able to deal with sequences having a very low level of identity is quite encouraging regarding the robustness of the method. One of the main advantages of the COFFEE strategy is the flexibility given to the user for defining the library. Here, by using two completely different pairwise alignment methods, we managed to produce high-quality multiple alignments in both cases. This is interesting, but constitutes only a first step. The structure of the libraries we have been using is very simple. They only rely on an ‘all-against-all’ comparison using one type of pairwise alignment algorithm per library. In practice, this scheme can easily be extended to much more complex library structures. It is common sense to have a higher confidence in results that can be reproduced using independent methods. Some prediction methods rely on this type of assumption, such as the block definition strategy described by Henikoff et al. (1995).These methods usually limit themselves to identifying highly conserved patterns among a set of solutions. With the COFFEE strategy, we go much further and make it possible to find a consensus solution whatever the number of constraints and whatever their relative compatibility. Of course, it is not enough for a solution to exist, one also needs to know how accurate this solution is. In this work, we have shown that the level of consistency of a solution is a good indicator of such accuracy. This accuracy prediction constitutes the other main aspect of the COFFEE function. Several methods have been proposed that attempt to predict correct portions of a pairwise alignment given a set of sub-optimal alignments (Gotoh, 1990; Vingron and Argos, 1990; Mevissen and Vingron, 1996). Using these methods, libraries could be designed with large numbers of sub-optimal alignments. Here again, the difference between the COFFEE method and other previously proposed approaches is that not only is it possible to predict the correct portions in an alignment, but it is also possible to optimize a multiple alignment for having as many reliable regions as possible. SAGA-COFFEE still needs to be improved on several accounts. For instance, further approaches will involve the definition of more complex libraries that will hopefully lead to more meaningful consistency indices. The main source of inspiration when doing so will be the work done on pairwise alignment stability (Mevissen and Vingron, 1996). The other direction we plan to take has to do with the combination of scoring schemes. We have seen here that there is no uniform solution to the multiple sequence alignment problem. For this reason, it would make sense to generate libraries containing alternative alignments made by all the available methods (PRRP, ClustalW, HMM, etc.). COFFEE could then be used to merge this information and hopefully extract the best of each alignment. This will require some improvement of the COFFEE function and its adaptation to highly redundant library. Another crucial aspect will be increasing the efficiency of the algorithm. At present, SAGA-COFFEE is an extremely slow method; however, we hope to improve on this by using a more appropriate type of seeding. Finally, another important aspect of our approach will involve the refinement of the method used here for building multiple structural alignments. The project will be based on a procedure similar to the one described above: the design of more efficient weights and an attempt to use the alternative 420 COFFEE: an objective function for multiple sequence alignments structural alignments that can be produced by the DALI method, using a wider range of DALI test cases. Acknowledgements The authors wish to thank Miguel Andrade and Thure Etzold for very useful comments and corrections. They also wish to thank the referees for their useful remarks and interesting suggestions. References Altschul,S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol., 138, 297–309. Altschul,S.F. and Erickson,B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603–616. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St Louis, MO. Carrillo,H. and Lipman,D.J. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48, 1073–1082. Davis,L. (1991) The Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. Dayhoff,M.O. (1978) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC. Eddy,S.R. (1995) Multiple alignment using hidden Markov models. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Feng,D.-F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Godzik,A. and Skolnik,J. (1994) Flexible algorithm for direct multiple alignment of protein structures and sequences. Comput. Applic. Biosci., 10, 587–596. Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, New York. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gribskov,M., McLachlan,M. and Eisenberg,D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. Henikoff,S., Henikoff,J., Alford,W. and Pietrokovsky,S. (1995) Automated construction and graphical representation of blocks from unaligned sequences. Gene, 163, GC17–26. Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. and Sander,C. (1996a) Alignment of three-dimensional protein structures: network server for database searching. Methods Enzymol., 266, 653–662. Holm,L. and Sander,C. (1996b) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res., 24, 206–210. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Applic. Biosci., 12, 95–107. Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M. (1993a) Multiple sequence alignment by parallel simulated annealing. Comput. Applic. Biosci., 9, 267–273. Ishikawa,M., Toya,T. and Tokoti,Y. (1993b) Parallel iterative aligner with genetic algorithm. In Artificial Intelligence and Genome Workshop, 13th International Conference on Artificial Intelligence, Chambery, France. Kececioglu,J.D. (1993) The maximum weight trace problem in multiple sequence alignmnet. Lecture Notes Comput. Sci., 684, 106–119. Kim,J., Pramanik,S. and Chung,M.J. (1994) Multiple sequence alignment using simulated annealing. Comput. Applic. Biosci., 10, 419–426. Krogh,A. and Mitchison,G. (1995) Maximum entropy weighting of aligned sequences of proteins or DNA. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. Mevissen,H.T. and Vingron,M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, 127–132. Morgenstern,B., Dress,A. and Wener,T. (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl Acad. Sci. USA, 93, 12098–12103. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Rost,B. (1997) AQUA Server. http://www.ebi.ac.uk/∼rost/Aqua/ aqua.html Sibbald,P.R. and Argos,P. (1990) Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J. Mol. Biol., 216, 813–818. 421 C.Notredame, L.Holm and D.G.Higgins Sjolander,K., Karplus,K., Brown,M., Huguey,R., Krogh,A., Saira,M. and Haussler,D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Applic. Biosci., 12, 327–345. Taylor,W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161–169. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gab excision. Comput. Applic. Biolsci., 10, 19–29. Vingron,M. and Argos,P. (1990) Determination of reliable regions in protein sequence alignment. Protein Eng., 3, 565–569. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. Vingron,M. and Sibbald,P. (1993) Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc. Natl Acad. Sci., 90, 8777–8781. Vogt,G., Etzold,T. and Argos,P. (1995) An assessement of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol., 299, 816–831. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol., 1, 337–348. 422 BIOINFORMATICS Vol. 22 no. 14 2006, pages e35–e39 doi:10.1093/bioinformatics/btl218 The iRMSD: a local measure of sequence alignment accuracy using structural information ´ Fabrice Armougom1, Sebastien Moretti1, Vladimir Keduas1 and Cedric Notredame1,à 1 ´ Laboratoire Information Genomique et Structurale, CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientifique de Luminy, case 934, 163 Avenue de Luminy, FR-13288, Marseille cedex 09 ture based alignments as unambiguous and unquestionable gold standards, a closer look reveals a much less clear cut situation. More than 20 structure alignment packages have been developed (Goldsmith-Fischman and Honig, 2003). All these packages tend to produce different alignments because of their different underlying optimization algorithms. Furthermore, the lack of a universally accepted criterion for describing the quality of a structural alignment makes it difficult to determine the relative merits of all these packages (Kolodny, et al., 2005). The most common procedure to evaluate structure superpositions is to use the root mean square distance deviation (RMSD) of superposed atoms. This measure estimates the mean square distance between the equivalent alpha carbons of the two superposed structures. It can be ambiguous because of its dependence on two critical parameters: the minimization method and the procedure used to exclude structurally non equivalent regions (loops for instance). Having several methods that deliver structure based sequence alignments and not knowing which one does best is a major issue in a context where structure-based alignments are routinely used to improve and guide the development of sequence alignment methods (Wallace, et al., 2005). A direct consequence of this situation has been the development of at least five collections of reference structure based sequence alignments (Edgar, 2004; Mizuguchi, et al., 1998; O’Sullivan, et al., 2004; Raghava, et al., 2003; Thompson, et al., 2005; Van Walle, et al., 2005). These collections are all used for a similar purpose: the benchmark of sequence alignment algorithms. Since it is virtually impossible to compare these datasets and decide whether some are more informative than others, the most common practice is to use them all, and look for common trends in the global results (Katoh, et al., 2005). While results measured on these reference collections tend to agree for datasets with more than 30% identity, variations appear when considering sets of remote homologues (Katoh, et al., 2005). Aside from potential accuracy problems, the simplest explanation for these discrepancies is the possibility for alternative sequence alignments to be structurally equivalent, especially when considering remote homologues (Lackner, et al., 2000). In this context, setting one specific alignment as a reference becomes an arbitrary choice and therefore a bias toward specific alignment methods. In practice, the authors try to minimize that effect by specifying the core regions that should be used for the comparison, but this choice is also difficult and somehow arbitrary. We suggest in this paper that replacing the reference alignments with an RMSD measure would ABSTRACT Motivation: We introduce the iRMSD, a new type of RMSD, independent from any structure superposition and suitable for evaluating sequence alignments of proteins with known structures. Results: We demonstrate that the iRMSD is equivalent to the standard RMSD although much simpler to compute and we also show that it is suitable for comparing sequence alignments and benchmarking multiple sequence alignment methods. We tested the iRMSD score on 6 established multiple sequence alignment packages and found the results to be consistent with those obtained using an established reference alignment collection like Prefab. Availability: The iRMSD is part of the T-Coffee package and is distributed as an open source freeware (http://www.tcoffee.org/). Contact: cedric.notredame@gmail.com; cedric.notredame@igs. cnrs-mrs.fr 1 INTRODUCTION The computation of accurate sequence alignments constitutes a pre-requisite for an ever increasing number of biological analyses. These include phylogenetic reconstruction, structure prediction, domain based analysis, function prediction and comparative genomics. In all these cases, the purpose of the alignment is to exploit evolutionary variations in order to reveal biologically meaningful patterns. The discovery and the proper analysis of these patterns depend entirely on the alignment correctness. In many cases, an alignment is considered to be biologically correct when it accurately reflects the structural relationship between the considered sequences. This result is achieved by matching structurally equivalent residues. Assembling such an alignment is trivial when the sequences are highly similar but becomes harder for remote homologues. When considering alignments of sequences with less than 25% identity (the so-called twilight zone), standard scoring schemes like substitution matrices become uninformative and it can be difficult to determine the alignment accuracy, or even whether the sequences are truly related or not. So far, the most satisfying way of aligning remote homologues has been to use structural information whenever possible (Huang and Bystroff, 2006; Lesk and Chothia, 1980). The use of structural information, however, carries its own peril, and while the sequence analysis community tends to consider strucà To whom correspondence should be addressed. Ó The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org F.Armougom et al. aligned pair Z and W within a sphere of radius (r) centered on X and Y that verifies the equation: dðXWÞ < r AND dðYZÞ < r ð1Þ The ensemble of pairs ZW that verify equation 1 is named the neighborhood and noted N(XY). The default value of r is 10 s (O’Sullivan, et al., 2003), which corresponds to a neighborhood size of 20-40 residues. The local iRMSD can be estimated as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 ZW ðdðXZÞÀdðYWÞÞ iRMSDðXYÞ ¼ ð2Þ NðXYÞ The summation is made over all the aligned ZW pairs within the neighborhood (Equation 1). Pairs XY with an empty neighborhood have their local iRMSD undefined. The global measure is obtained by summing on every pair XY and dividing by the number of pairs with a non empty neighborhood (N): P XY iRMSdðXYÞ ð3Þ iRMSD ¼ N The iRMSD thus defined is not suitable for comparing alternative alignments, as it tends to give a better score to alignments with long gaps and few well aligned residues. In order to simultaneously take into account the superposition accuracy and the extent of the alignment (i.e. the number of matched residues), we adapted the CI formula of Kleywegt and Jones (Kleywegt and Jones, 1994) to turn the iRMSD into a Normalized iRMSD (NiRMSD): NiRMSd ¼ iRMSDà MINðL1‚L2Þ N ð4Þ Fig 1. Basic principle of the iRMSD. Equivalences implied by the sequence alignment are tested on the structure. The assumption is that if XY and ZW are correctly aligned, then the distance between residues XZ and YW must be similar. ZW pairs are only considered if they are within a sphere of radius R, centered on X and Y. be a more objective way to evaluate the sequence alignments of proteins. The RMSD has two advantages over standard methods: no dependence on a reference alignment and the possibility to quantify the structural correctness of any protein sequence alignments (provided the protein structures are known). The main drawback, however, is the reliance of the RMSD on a structure superposition strategy. This key step affords many alternative solutions whose relative merits are difficult to estimate (Kolodny, et al., 2005). We redesigned the RMSD measure to make it independent from any structure superposition procedure. We named this measure iRMSD because it is an RMSD based on intra-molecular distance comparisons. The iRMSD is a follow up of the APDB measure (O’Sullivan, et al., 2003), designed to evaluate alignments for their compatibility with the structural superposition they imply. While APDB was a complex measure depending on three semi arbitrary parameters, the new iRMSD algorithm only requires one parameter. We show here that the iRMSD behaves just like a standard RMSD both numerically (values range) and structurally (similar structural meaning). We finally show that a straightforward normalization makes the iRMSD perfectly suitable for evaluating and comparing sequence alignment methods without the need of pre-established reference alignment collections. L1 and L2 are the respective lengths of the two sequences, and N the number of residue pairs with a non empty neighborhood. This formula amounts to incorporating a gap penalty that deals with indels and aligned pairs whose neighborhood is empty. 2.2 Validation procedure using Prefab 2 2.1 METHODS The iRMD measure We used the Prefab (Edgar, 2004) collection of reference alignments to analyze the iRMSD. Prefab is an extensive collection of 1682 pairwise structural alignments obtained by combining the output of two structure alignment programs: CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1993). In each of these alignments the authors have defined core regions where the DALI and the CE methods agree and have used these regions for evaluation purpose. Given one Prefab reference alignment and an alternative target alignment of the same sequences, the Qscore is defined as the fraction of core columns in the reference alignment found aligned identically in the target. In order to evaluate multiple sequence alignment packages, Prefab also includes in each dataset a collection of about 48 sequences homologous to the two structures. When evaluating an MSA package, the large dataset is aligned and the Qscore is measured on the core regions of the induced alignment of the two structures. We evaluated the RMSD and the iRMSD of Prefab alignments. However, because of various inconsistencies between the ATOM, the SEQRES fields of the PDB entries and the sequences of the Prefab alignments, LSQMAN could only handle 587 of the original Prefab entries. This sample had roughly the same identity distribution as the entire Prefab (243 dataset having with than 20% identity (on the reference Prefab alignment), 172 between 20 and 40% identity and 171 with more than 40% identity). We believe it to be representative and large enough for the purpose of the present analysis. The iRMSD measure follows the underlying principle of APDB: given a correct alignment of two protein sequences A and B (Figure 1), if X is aligned with Y and Z with W, then the XZ distance (d(XZ)) must be similar to d(YW). The better the alignment of A and B, the smaller the average difference between all possible pairs d(XZ) and d(YW). The iRMSD associated with the aligned pair X and Y is estimated by considering every 2.3 Evaluation of the standard RMSD We used the LSQMAN package (Kleywegt and Jones, 1999) to estimate the standard RMSD associated with the Prefab alignments. The local RMSD was estimated by superposing the residues contained in a window of size 21 (2Ã10+1) centered on a pair of aligned residues. The superposition was e36 The iRMSD: a local measure of sequence alignment accuracy made using the Xalignment function of the LSQMAN package. The overall RMSD was obtained by sliding the window and averaging over all the windows. 2.4 Multiple sequence alignment methods We benchmarked the iRMSD measure on the alignments produced using the public distributions of six multiple sequence alignment packages: ClustalW (Version 1.83) (Thompson, et al., 1994), DialignII (Version 2.2.1) (Morgenstern, 1999), Muscle (Version 3.6) (Edgar, 2004), Mafft (Version 5.6) (Katoh, et al., 2005), ProbCons (Version 1.10) (Do, et al., 2005) and T-Coffee (Version 3.75) (Notredame, et al., 2000). 2.5 Availability The iRMSD package is part of the t_coffee package. It is an open source freeware that can be downloaded on http://www.tcoffee.org/. It comes along with an extensive documentation. 3 RESULTS We started by comparing the iRMSD with the standard RMSD. We did so by measuring the scores associated with the 587 Prefab alignments. The measurements were either made on core regions (Figure 2a) or on the entire Prefab Alignments (Figure 2b). Both figures indicate a very strong correlation between the two measures. The core analysis gave an r2 correlation coefficient of 0.92 while the measure on the entire alignments gave an r2 of 0.93. As expected, the dispersion increases with the RMSD values. The Prefab alignments are high quality structure based alignments, but we also checked the behavior of the methods when analyzing alignments of lower quality (Figure 2c). We selected the Dialign method whose alignments have an average Prefab Qscore of 0.65 on the entire dataset (0.32 in the [0-20] identity range). Figure 2c shows that the two measures remain correlated up to an RMSD of 2.5 s (r2 ¼ 0.75), indicating a saturation of the iRMSD measure for values above 1.6 s. This apparent saturation is a consequence of the different local substructures compared by each method (windows for the RMSD and sphere for the iRMSD) and it does not occur when measuring the standard RMSD on spheres of radius 10 s rather than on windows. When doing so the correlation is very good (r2 ¼ 0.91 over the full range, data not shown). We further checked the local aspect of the measures by plotting both the local iRMSD and the local RMSD against several Prefab alignments. The 1aoh_1anu example is displayed on Figure 3 and clearly shows that both measures are well coordinated all along the alignment. While the iRMSD indicates two narrow peaks not found in the RMSD, both methods agree on the final series of peaks. We used LSQMAN to superpose the two structures and were satisfied to find that the peaks showing in the iRMSD curve effectively correspond to regions poorly superposed. Although the iRMSD seems to reveal more sharply these locations, it is fair to say that the standard RMSD could probably be parameterized to yield similar results (for instance by lowering the window size). Having established that the iRMSD behaves like a standard RMSD measure we then estimated whether that measure is suitable for evaluating the relative accuracy of multiple sequence alignment packages. For that purpose, we aligned the Prefab datasets with six MSA methods and for each of these methods we evaluated the Qscore, the Normalized iRMSD (NiRMSD, Equation 3) and estimated the fraction of alignments having a NiRMSD better or Fig 2. Correlation between the iRMSD and a standard LSQMAN RMSD. 1a) RMSD versus iRMSD of 587 of Prefab reference Alignments. The (i)RMSDs were only measured on the regions annotated as core in Prefab. The iRMSD is on the vertical axis and the regular RMSD, as obtained from LSQMAN, is on the horizontal axis. Each dot corresponds to one dataset. 2a) RMSD versus iRMSD on 587 Prefab reference Alignments. The (i)RMSDs were measured on the entire alignments. 2c) RMSD versus iRMSD on 587 Prefab datasets, aligned by Dialign. The dataset is the same as before and the (i)RMSDs were measured on the entire alignments. e37 F.Armougom et al. Fig 3. Local Comparison of the iRMSD against a standard LSQMAN RMSD. The comparison was made on the Prefab reference alignment of 1aohA_1anu. The two structures were superposed by LSQMAN (1aohA: violet, 1anu:blue). The alignment was then evaluated locally using either LSQMAN to measure the RMSD (Blue line) or T-Coffee/iRMSD to measure the local iRMSD. The (i)RMSDs values were plotted on the vertical axis against the alignment positions. Portion of the superposition corresponding to the peak were extracted and encapsulated. Table 1. Average Qscore jRange N 0-20 20-40 40-100 Total 243 171 173 587 Table 2. Consistency between the NiRMSD and the Qscore j Range 0-20 20-40 40-100 Total Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 0.32 0.80 0.96 0.65 0.34 0.83 0.96 0.67 0.43 0.86 0.97 0.71 0.44 0.87 0.98 0.73 0.48 0.88 0.97 0.74 0.49 0.88 0.98 0.75 ------------- Npair 7290 5130 5190 17610 Consistent 0.86à 0.90à 0.94à 0.90à Inconsistent 0.14à 0.10à 0.06à 0.10à a) Average Qscore: Range is the range of identity of the considered Prefab datasets, as measured on the reference alignments. N is the number of Prefab datasets in each range. Dialign, ClustalW, Muscle, TCoffee, ProbCons and Mafft are the average Qscores as measured on the alignments produced by these packages. The entries corresponding to the best performance for each category are underlined and in bold. The best Qscore are the highest. a) Core Regions: Range is the range of identity of the considered Prefab datasets, as measured on the reference alignments, Np is the number of pairs on which the comparison was carried out. Consistent is the fraction of pairs for which the Qscore and the NiRMSD score were consistent. For the purpose of this table, two pairs were considered consistent whenever their Qscore differed by less than 1 point percent and their NiRMSD by less à than 0.05 s. A binomial test was carried out on the results and entries marked with indicate results whose p-value is lower than 0.000001. jRange N 0-20 20-40 40-100 Total 243 171 173 587 Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 3.46 0.91 0.44 1.83 2.10 0.82 0.58 1.28 1.82 0.80 0.44 1.11 2.16 0.79 0.44 1.25 1.85 0.77 0.44 1.12 1.76 0.77 0.43 1.08 0.85 0.67 0.43 0.67 j Range 0-20 20-40 40-100 Total Npair 7290 5130 5190 17610 Consistent 0.79à 0.84à 0.84à 0.82à Inconsistent 0.21à 0.16à 0.16à 0.18à b) Average NiRMSD: The labels are the same. The measure is the average NiRMSD as measured on the core regions of the alignments. The Prefab column corresponds to the evaluation of the Prefab reference alignments. The best NiRMSD scores are the lowest. b) Same as a) but with the NiRMSDs measured on the entire alignments. jRange N 0-20 20-40 40-100 Total 243 171 173 587 Dialign Clustal Muscle TCoffee ProbC. MAFFT PREFAB 0.02 0.36 0.86 0.36 0.10 0.36 0.89 0.40 0.05 0.46 0.89 0.42 0.09 0.56 0.92 0.47 0.06 0.57 0.89 0.45 0.10 0.54 0.91 0.47 ------------- c) Best NirRMSD Fraction: fraction of alignments having a NiRMSD better or equal to the Prefab reference as measured on the core regions. The labels are the same. equal to the Prefab reference (Best NiRMD fraction), as measured on the core regions. The results (Table 1a,b and c) are unambiguous and clearly show a high correlation between the Qscore, the average NiRMSD and the Best NiRMSD fraction. As expected, the Prefab reference alignments outperform every other method (Table1b, Prefab), with a NiRMSD always lower than the rest, especially in the distant homologue category (Table 1b, Prefab, [0-20]). The rankings suggested by each score are in broad agreement when considering equivalent lines in each table. We looked at the statistical signifi- e38 The iRMSD: a local measure of sequence alignment accuracy cance of all these analyses. For doing so we considered every dataset individually and estimated the consistency between the Qscore and the NiRMSD measured on two alternative alignments. For instance, given a dataset and two alignments (aln1 and aln2) generated by two different methods, the Qscore and the NiRMSD are consistent if they indicate the same relationship between the two alignments (e.g. aln1 better than aln2 according to Qscore AND NiRMSD). This measure was used to analyze every possible pair of methods (Table 2a,b). The results show that Qscore and NiRMSD are highly correlated with 90% consistency between the two measures on core regions and 82% when considering entire alignments. The correlation is not affected by the level of identity between the considered sequences. These figures were measured on more than 17000 pairs of alignments. We checked these results for statistical significance, using a binomial test and assuming an equal probability of 0.5 for consistency and inconsistency. The results are highly significant on each category, with P-Values systematically lower than 10-6. These results confirm that the NiRMSD measure is at least as discriminative as Prefab. also thank Dr Phillip Bucher who provided many of the original ideas through useful discussions. REFERENCES Do,C.B., Mahabhashyam,M.S., Brudno,M. and Batzoglou,S. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15, 330–340. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32, 1792–1797. Print 2004. Goldsmith-Fischman,S. and Honig,B. (2003) Structural genomics: computational methods for structure analysis. Protein Sci, 12, 1813–1821. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol, 233, 123–138. Huang,Y.M. and Bystroff,C. (2006) Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics, 22, 413–422. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33, 511–518. Kleywegt,G.J. and Jones,T.A. (1994) Superposition. CCP4/ESF-EACBM Newsletter Protein Crystallog., 31, 9–14. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta Crystallogr D Biol Crystallogr, 55 (Pt 4), 941–944. Kolodny,R., Koehl,P. and Levitt,M. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol, 346, 1173–1188. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a refined tool for protein structure alignment. Protein Eng, 13, 745–752. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol, 136, 225–270. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci, 7, 2469–2471. Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment [In Process Citation]. Bioinformatics, 15, 211–218. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302, 205–217. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J Mol Biol, 340, 385–395. O’Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 Suppl 1, i215–221. Raghava,G.P., Searle,S.M., Audley,P.C., Barber,J.D. and Barton,G.J. (2003) OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, 11, 739–747. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Koehl,P., Ripp,R. and Poch,O. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. Van Walle,I., Lasters,I. and Wyns,L. (2005) SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267–1268. Wallace,I.M., Blackshields,G. and Higgins,D.G. (2005) Multiple sequence alignments. Curr Opin Struct Biol, 15, 261–266. CONCLUSION We describe the iRMSD, a measure with all the advantages and properties of a standard RMSD without requiring any structure superposition. A simple normalization makes it possible to use the iRMSD for evaluating the accuracy of structure based sequence alignments. This measure, named NiRMSD, was applied on the alignments produced by 6 popular multiple sequence alignment packages. In 90 % of the cases the NiRMSD measure was in agreement with the Prefab ranking (Qscore). These findings, highly significant from a statistical point of view, suggest the suitability of this new measure for evaluating sequence alignments accuracy whenever structural information is available. We also expect that the method can easily be extended to sequences having a close homologue with a known structure. Future developments will involve applying the iRMSD to Multiple Structure Alignment analysis. We are also planning to use the NiRMSD measure to compare structure alignment packages and check whether some methods clearly outperform the others or whether some structure alignment meta-method should be designed instead. Further refinement could also involve exploring the capacity of the iRMSD measure to automatically identify and exclude unalignable positions. ACKNOWLEDGEMENTS The development of this project was supported by CNRS (Centre National de la Recherche Scientifique), Sanofi-Aventis ´ Pharma SA., Marseille-Nice Genopole and the French National Genomic Network (RNG). We thank Prof. Jean-Michel Claverie (head of IGS) for useful discussions and material support. We e39 BIOINFORMATICS Vol. 17 no. 0 2001 Pages 1–3 Mocca: semi-automatic method for domain hunting ´ Cedric Notredame Information Genetique et Structurale, CNRS-UMR 1889, 31 Ch. Joseph Aiguier, 13 402 Marseille, France Received on Month xx, 2000; revised and accepted on Month xx, 2000 ABSTRACT Motivation: Multiple OCCurrences Analysis (Mocca) is a new method for repeat extraction. It is based on the TCoffee package (Notredame et al., JMB, 302, 205–217, 2000). Given a sequence or a set of sequences, and a library of local alignments, Mocca extracts every segment of sequence homologous to a pre-specified master. The Q: implementation is meant for domain hunting and makes Please it fast and easy to test for new boundaries or extend known supply recevied repeats in an interactive manner. Mocca is designed to date Q: deal with highly divergent protein repeats (less than 30% Applications amino acid identity) of more than 30 amino acids. Note? Q: Availability: Mocca is available on request (cedric. There are some notredame@gmail.com). The software is free of charge differences in the and comes along with complete documentation. electronic version and hardcopy. We follow electronic version information concerning the whereabouts of one of the repeats (master repeat), it allows the user to tune the parameters describing the repeat family (i.e. start position, length of the master repeat and stringency of the search), and extract other occurrences of that repeat within the dataset. The procedure is fast and simple. INTRODUCTION Many proteins consist of separately evolved, independent structural units called modules or domains. The great diversity of protein functions is partly due to the vast number of possibilities to arrange a finite number of those basic units (Chothia, 1992). It is generally agreed that a domain is a self-folding unit made of a minimum of 25 amino acids (Bairoch et al., 1997; Corpet et al., 1998). Many of these domains appear as homologous subsequences repeated within a sequence or within a set of sequences, hence the importance of repeats identification in the course of domain hunting. Many tools exist for discovering and extracting these repeats and without being exhaustive, one can cite PSi-Blast (Altschul et al., 1997), Dot matrices (Junier and Pagni, 2000); Repro (Heringa and Argos, 1993) and the Gibbs Sampler (Lawrence et al., 1993). More recently, Heger and Holm developed a method meant to scan databases for repeats without manual intervention (Heger and Holm, 2000). These automatic methods all share the same drawback: while none of them is 100% accurate, they give the user little scope for testing his own hypothesis in a seamless manner. Multiple OCCurrences Alignment (Mocca) addresses that specific problem. Given some approximate c Oxford University Press 2001 METHODS Mocca uses a pair-wise sequence alignment algorithm (Durbin et al., 1998). The cost associated with the alignment of each pair of residues uses the ‘library extension’ developed for T-Coffee (Notredame et al., 1998, 2000). Figure 1 outlines the strategy used to generate the T-Coffee scoring scheme. Firstly, a primary library is compiled; it contains a series of local alignments obtained using Lalign, an implementation of the Sim algorithm (Huang and Miller, 1991). Given two sequences, Lalign extracts the N top scoring non-overlapping local alignments. We used a modified version that compares two sequences (or a sequence with itself), and extracts every top scoring alignment having a length longer than ten residues and an average level of identity higher than 30%. Lalign reports each alignment along with a score that indicates its statistical significance. In our primary library, such local alignments appear as a series of pairs of residues where each pair receives a weight equal to the score of the alignment it comes from. Given a set of N sequences, the library contains the result of all the possible pair-wise comparisons (including the self-comparisons). This library is fed into T-Coffee to generate the position specific scoring scheme using the ‘library extension’ algorithm (Notredame et al., 2000). In Mocca, a pre-requisite to repeat extraction is the estimation of at least one basic unit repeat among the sequences being analysed (master repeat). In the context of this work, we made the estimation using dotlet, a Java-based dot matrix method (Junier and Pagni, 2000). The master repeat is a sub-string selected within the sequence(s) used to build the library. Mocca extracts every sub-string homologous to the master in a single pass over the target sequences. It is the library extension that 1 C.Notredame putational requirement is the Lalign library O(N2 L2 ), the motif extraction itself only requires little time (12 s on an IRIX O2 station for 20 sequences totalling 5000 residues). If the position of one of the repeats is known, the procedure can also be run automatically from the command line. It is recommended to use Mocca in conjunction with other means for the initial estimation of the repeat boundaries (PSi-Blast, Altschul et al., 1997; Dotlet, Junier and Pagni, 2000; Dotter, Sonnhammer and Durbin, 1995;. . . ). Our tests show that Mocca can properly deal with sets of repeats whose multiple alignment indicate less than 15% average identity. While we currently use Lalign as a source of local information, any other sensible source could be considered. For instance, structural information could easily be added to our procedure, using off the shelf libraries of local structural similarities such as the Dali Domain Dictionary (Holm and Sander, 1998). The input format of Mocca is straightforward and well documented. Mocca is a refinement tool for the discovery and the establishment of new domains. If the master repeat is replaced with a profile or a collection of known characterized repeats, Mocca could also be used to improve the model of a given repeat family and extend the predictive power of its profiles. Fig. 1. Layout of the Mocca strategy. The main steps required to extract a repeat with Mocca method are shown. Square blocks designate procedures while rounded blocks indicate data structures. ACKNOWLEDGEMENTS The author wishes to thank the following people: Des Higgins for very helpful comments. Jaap Heringa, Philipp Bucher and Kay Hoffmann for useful discussions and advice at an early stage of the project, Hiroyuki Ogata for helpful comments on the program. REFERENCES Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. Chothia,C. (1992) Proteins: 1000 families for the molecular biologist. Nature, 357, 543–544. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of protein domain families. Nucleic Acids Res., 26, 323–326. Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis. 1 vols, Cambridge University Press, Cambridge. Heger,A. and Holm,L. (2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins, 41, 224–237. Heringa,J. and Argos,P. (1993) A method to recognise distant repeats in protein sequences. Proteins: Struct. Funct. Genet., 17, 391–411. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96. makes it possible for a single repeat to ‘recognize’ each of its homologues (even the distant ones). The extraction process relies on a very efficient dynamic programming procedure known as repeated matches (Durbin et al., 1998). This algorithm reports a series of non-overlapping sub-strings each of them having an alignment to the master associated with a score higher than some pre-specified threshold T h. T h is empirically set to be a function of the maser repeat length (L): Th = S ∗ L S has a value between 0 and 1. By default, S = 0.05, but its value can be modified interactively. Two other parameters can also be modified to increase sensitivity and accuracy: the gap opening penalty and the gap extension. Mocca is part of the T-Coffee package. It is written in Perl and ANSI C. It runs on any UNIX or LINUX platform. It is available free of charge along with documentation. Copies can be obtained on request by sending an e-mail to cedric.notredame@gmail.com. The main com2 Mocca and domain hunting Huang,X. and Miller,W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. Junier,T. and Pagni,M. (2000) Dotlet: diagonal plots in a web browser. Bioinformatics, 16, 178–179. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel algorithm for multiple sequence alignment. JMB, 302, 205–217. Sonnhammer,E.L. and Durbin,R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene, 167, GC1-10. To be balanced at final stage 3 Optimization of ribosomal RNA profile alignments     ¢ ¡   ¢ © £     ©  ¢ !     $  $ ¢   © # ¨ § ¥ ¦ " ¥  ¤  £  ¢ ¡ ¡ ! ¢ ¡   Motivation: Large alignments of ribosomal RNA sequences are maintained at various sites. New sequences are added to these alignments using a combination of manual and automatic methods. We examine the use of profile alignment methods for rRNA alignment and try to optimize the choice of parameters and sequence weights. Results: Using a large alignment of eukaryotic SSU rRNA sequences as a test case, we empirically compared the performance of various sequence weighting schemes over a range of gap penalties. We developed a new weighting scheme which gives most weight to the sequences in the profile that are most similar to the new sequence. We show that it gives the most accurate alignments when combined with a more traditional sequence weighting scheme. Availability: The source code of all software is freely available by anonymous ftp from chah.ucc.ie in the directory /home/ftp/pub/emmet, in the compressed file PRNAA.tar. Contact: emmet@chah.ucc.ie, des@chah.ucc.ie Introduction Ribosomal RNA sequences (rRNA) are widely used to estimate the phylogenetic relatedness of groups of organisms (e.g. Sogin et al., 1986; Pawlowski et al., 1996), especially that of the small subunit (SSU rRNA). The SSU rRNA has been sequenced from thousands of different species and large alignments are maintained at several sites (Maidak et al., 1997; Van de Peer et al., 1997). The alignments are large and complex and the addition of new sequences is a demanding task, either for the alignment curators or for individuals who wish to align new sequences with existing aligned sequences. In simple cases, automatic alignment programs such as Clustal W (Thompson et al., 1994a) may be used to align groups of closely related sequences or as a prelude to manual refinement. There may be large stretches of unambiguous alignment with high sequence identity which may be useful for phylogenetic purposes. The fully automated, accurate alignment of rRNA sequences remains a difficult problem, however. In principle, one can use profile alignment methods (Gribskov, 1987) which use dynamic programming algorithms (Needleman and Wunsch, 1970, Gotoh, 1982) to align a new sequence against an existing ‘expert’ alignment. For example, one could take an alignment of all SSU rRNA sequences from one of the rRNA collections and one could use this as a guide; aligning each new sequence in turn, treating the large alignment as a profile. This approach has the advantage of simplicity and speed but the final accuracy may be limited by the lack of any ability to use secondary structure information. The RNALIGN approach (Corpet and Michot, 1994) or the stochastic context free grammar approach (Eddy and Durbin, 1994; Sakakibara et al., 1994) provide elegant methods for the alignment of rRNA sequences taking both primary sequence and secondary structure into account. These methods, however, are very demanding in computer resources and cannot deal easily with pseudoknots so that their immediate application to the alignment of SSU rRNA sequences is not trivial. In this paper, we examine, empirically, the effectiveness of profile alignment methods for the alignment of RNA sequences. We remove test sequences from existing ‘expert’ alignments and measure the extent to which they can be realigned with the original alignment, automatically. We use the eukaryotic SSU rRNA sequences from Van de Peer et al. (1997) as a test case. For a range of test sequences, we measure the number of positions that can be correctly realigned over a range of different parameters (gap opening and gap extension penalties). Sequence weighting has been shown to increase the reliability of profile alignments using amino acid sequences (Thompson et al., 1994b). This can be used to give less weight to clusters of closely related sequences and increased weight to sequences with no close relatives in order to counteract the effect of unequal sampling across a phylogenetic tree of possible sequences. We examine the effectiveness of one commonly used scheme (Thompson et al., 1994b). We also propose a new weighting scheme which is designed to give increased weight to those sequences in the profile (reference alignment) which are closest (highest sequence identity) to the new sequence being aligned. If a new mammalian sequence is being aligned, for example, it makes most sense to give a high weight to other mammalian sequences and decreasing weights to sequences that are more and more distantly related. Some sections of SSU rRNA sequences are from regions whose secondary structure is conserved across many species. These conserved, ‘core’, regions are relatively easy to align 332 Oxford University Press ” • h j 2 i ( h & h ' g 3 ) f U e Q d T ™ BIOINFORMATICS ˜ — — ” • “ ‘ – • ” “ ’ S ‘  5 R Q k P I 2 ( I w 2 c € ( € E B y & A x ) H ‰ b v A a G ˆ ) Y u 3 „ ` ‡ D q Y „ A 5 † & D q F p & & E E i F 3 e s I … D 6 ) e @ f X 0 f 6 „ 1 9 ) ( i & q D C „ 6 A 2 i 2 e B 3 ƒ 0 g A h W @ e ) 2 u 0 6 ‚ 9 V 6  € A 1 € & 0 y & x U 8 0 w 7 6 0 v 3 u 9 6 e 2 5 H t 4 p s 9 3 f 7 0 6 0 r 2 ( q & p 1 ) 1 i 0 3 e ) 4 h g ( 2 e 6 ' f 3 & 6 e d 5 % rRNA profiles with high accuracy but are interspersed with less conserved regions that may be very difficult to align. We empirically determine which regions of the eukaryotic reference alignment can be aligned with high accuracy by a simple jack-knife experiment. We remove each sequence, one at a time, and try to realign it with the rest. It is then a simple matter to count how often each nucleotide of each sequence is correctly realigned. This gives a definition of conserved core regions that is purely empirical and which can be used by users to delimit regions of alignment which can be safely used in phylogenetic research. Finally, we examine the effect of G+C content of each sequence on the accuracy of alignment. Sequences of high or low G+C may be expected to be more difficult to align than those with more balanced nucleotide compositions. umn in the profile (just one of the four residues), with no gaps will get a score of 1.0 when aligned with the same residue in the test sequence and a score of 0 otherwise. Other columns score in proportion to the frequency of each of the four residue types. In positions in the profile where one or more of the sequences has a gap, gaps were treated as a class of residue for frequency calculations. Other methods have been proposed for generating profiles using the natural logarithms of residue frequencies which may be normalized by overall residue frequencies to give log-odds scores (see Henikoff and Henikoff, 1996 for a review). We carried out some tests using the latter scheme and found that performance was comparable although slightly inferior to that using simple frequencies. Therefore we only present results obtained using the frequencies. Gap penalties System and methods Small subunit ribosomal RNA An alignment of eukaryotic, nuclear SSU rRNA sequences (that dated May 6, 1997) was obtained from the World Wide Web server at http://www-rrna.uia.ac.be/ssu/index.html (Van de Peer et al., 1997). After removal of columns which consist only of gaps, the two incomplete sequences of Butomus umbellatus and the unaligned sequence Babesia bovis 4 the alignment contains 1517 sequences and is 5370 characters long. Individual sequences vary widely in length, from<1300 nucleotides to >2500. Sixteen test sequences were removed from and realigned with the reference alignment in order to measure the accuracy with which it was possible to recreate their original alignment. The sequences used were Drosophila melanogaster, Xenopus laevis, Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Oryza sativa, Dictyostelium discoideum, Euglena gracilis, Ammonia beccarii, Physarum polycephalum, Entamoeba histolytica 1, Vahlkampfia lobospinosa, Giardia sp., Naegleria gruberi, Hexamita sp. and Trypanosoma brucei. These sequences were chosen based on a phylogenetic tree of all the sequences in the alignment, in order to give a spread of test cases over a wide range of different positions in the tree. Re-alignment was carried out over a range of gap penalties and using a number of sequence weighting schemes as described below. A range of gap opening and extension penalties were used in alignment generation. For each test sequence and each weighting scheme, a total of 81 alignments were carried out. Gap opening penalties were used ranging from 1 to 9 in increments of 1, and gap extension penalties ranging from 0.1 to 0.9 in increments of 0.1. This range of ratios between gap penalties and residue match scores was chosen as it encompasses values empirically shown to give alignments of biological relevance. Terminal gaps were penalized solely with an extension penalty. Position-specific gap opening penalties were derived from the frequency of gaps at each position along the alignment. At each position, a value equal to the number of residues (nongap characters) in the column divided by the number of sequences in the alignment was derived. This value was then multiplied by the gap opening penalty, as taken from the range above, to give a specific gap opening penalty at each position. This gives gap opening penalties which are higher in positions at which residues mostly occur in comparison with positions which are occupied mostly by gaps. Sequence weighting By default, each sequence in the existing alignment will have an equal effect on the alignment of new sequences with the profile. If additional information is available concerning the relationships of sequences within the alignment to each other and to the sequence being aligned, this may not be optimal. For example, if a new sequence is identical to a sequence already in the alignment, the correctly aligned position of each residue in the new sequence could be deduced solely from that one identical sequence, and no information concerning the other sequences is necessary. Further, sampling bias can lead to an unequal representation of taxa within the alignment (e.g. there might be very many sequences from some taxa and very few from others), and it is possible to use sequence weighting to correct for this also. Three different weighting schemes Dynamic programming The reference alignment was converted into a profile (Gribskov et al., 1987) which contains information on the frequency of each residue and gaps at each position. The test sequences were aligned with this using a dynamic programming algorithm (Needleman and Wunsch, 1970). We used Gotoh’s algorithm (Gotoh, 1982) and maximized the similarity between the sequence and the profile. A homogenous col- 333 E.A.O’Brien, C.Notredame and D.G.Higgins Fig. 1. Tree of the sequences that were used as test cases. The weights for these sequences under different weighting schemes are given in Table 1. were applied to the sequences in the SSU rRNA alignment, and compared with the default of equal weights. The first weighting scheme, referred to as tree-based weights, is based on a phylogenetic tree of the sequences in the alignment. A neighbour-joining tree (Saitou and Nei, 1987) of all the sequences in the profile was generated using the DNADIST and NEIGHBOR programs of the PHYLIP package (Felsenstein, 1989). Weights were then derived from the branch lengths as described by Thompson et al. (1994b). These weights are then normalized to have a mean of 1.0. This gives a total weight for the profile equal to that where each sequence is weighted equally, which is necessary in order to keep the effects of changing gap penalties congruent across the different schemes. The general effect of these tree-based weights is to downweight sequences with many close relatives in order to prevent the more densely populated regions of the tree exerting a disproportionate effect on the alignment of sequences from other regions of the tree. The second weighting scheme is based on the level of similarity between the sequence being aligned and each individual sequence in the alignment, and is referred to as identitybased weighting. The new sequence is first aligned with the profile using equal weights. A distance is then calculated between the new sequence and each other sequence in the alignment equal to the mean number of differences per site in this initial approximate alignment. This is percent difference divided by 100 and there is no correction for multiple hits or unequal rates of transition and transversion. The recip- rocal of this distance is used as a weight for each sequence and these are again normalized to give a mean of 1.0. This weighting scheme has the effect of upweighting sequences more similar to the sequence being added relative to those that are more distantly related. The upweighting effect increases as the sequences become more similar to the sequence being aligned. The third scheme is a combination of these weighting schemes, in which the weight derived for each sequence based on branch lengths is multiplied by the weight derived from sequence identities, and the values are again renormalized. This scheme is referred to as combination weights. Table 1 shows the values given by the various weighting schemes for the case shown in the example tree in Figure 1. The tree-based weights are independent of the new sequence that is to be added, being derived wholly from the structure of the existing data. Weights are calculated using the method of Thompson et al. (1994b), which are then renormalized to give a mean of 1, leaving the values shown. The identitybased weights are derived by taking the distance of each sequence in the tree from the new sequence, defined as the mean number of differences per aligned pair of residues, ignoring any pairs with a gap in either sequence. The reciprocals of these values are renormalized around 1 to give the figures shown. For the final set of combination weights, the product is taken of the weights in each of the preceding columns and again renormalized to give a mean of 1. 334 rRNA profiles Table 1. The weights assigned to the sequences in the test tree shown in Figure 1 when the sequences Mus musculus and Plasmodium gallinaceae were added a Ammonia beccarii Caenorhabditis elegans Dictyostelium discoideum Drosophila melanogaster Entamoeba histolytica Euglena gracilis Giardia sp. Hexamita sp. Homo sapiens Naegleria gruberi Oryza sativa Physarum polycephalum Saccharomyces cerevisiae Trypanosoma brucei Vahlkampfia lobospinosa Xenopus laevis 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 b 0.746 0.974 0.875 0.727 1.194 1.519 1.340 1.266 0.411 1.212 0.511 1.435 0.516 1.488 1.398 0.383 c 0.273 0.289 0.250 0.349 0.225 0.198 0.193 0.206 10.628 0.204 0.390 0.205 0.377 0.211 0.196 1.798 d 1.256 1.008 1.049 1.054 0.984 0.809 0.773 0.854 1.053 0.942 1.235 0.856 1.302 0.846 0.889 1.082 e 0.379 0.522 0.406 0.470 0.500 0.557 0.481 0.484 8.088 0.459 0.370 0.547 0.361 0.583 0.508 1.278 f 0.991 1.038 0.968 0.809 1.241 1.298 1.094 1.141 0.456 1.205 0.667 1.298 0.708 1.329 1.313 0.438 Columns represent the following schemes: (a) equal sequence weights, (b) tree-based sequence weights, (c) identity-derived weights for each sequence for the alignment of Mus musculus, (d) identity-derived sequence weights for each sequence for the alignment of Plasmodium gallinaceae, (e) combination of tree and identity-derived weights for Mus musculus, (f) combination of tree and identity-derived weights for Plasmodium gallinaceae For each of the three defined weighting schemes and the default of equal weights, alignments were generated using position-specific gap-opening penalties across the range of gap extension penalties and base gap opening penalties described above. This procedure was repeated for each of the test sequences. The number of residues correctly placed in each alignment was determined by comparison with the sequence as originally aligned in the reference alignment, and this was then divided by the total number of residues in the sequence to give a percentage score for the alignment. From the scores for the alignments across the range of gap opening and gap extension penalties for each test case, the gap penalties giving the best performance across all or most of the test cases were obtained. Results The performance of a set of weights was judged by its efficacy across the range of gap opening and gap extension penalties used. The peak score and the range of gap penalties giving a comparable score were taken into account in making this judgement (Table 2). For scoring purposes, each residue is counted as distinct, and is only considered correctly aligned if it is in the same position as the same residue in the reference sequence. The score for a sequence is counted as the percentage of the total number of residues in the sequence that have been correctly realigned. The main results are presented in Table 2. In the first column, the percentage accuracy of alignment scores are given for each of the 16 test cases. These scores are the best obtained across the range of gap opening and extension penalties with no sequence weights. The scores are low and range from 43% (Euglena) up to 88% (Oryza). The addition of position specific gap penalties has a dramatic effect. The scores all increase by about 10–15% which represents an improvement of several hundred residues in the original sequences that have been correctly aligned. The use of sequence weights yields further improvements, although not as dramatically as this. It should be noted that an improvement in score of just 1% is the equivalent of 20 residues in a molecule of 2000 nucleotides. We only give the peak scores from across the full range of gap opening and extension penalties. These were all obtained with a gap opening penalty of between 5.0 and 7.0 and a gap extension penalty of either 0.1 or 0.2. Implementation Programs were developed and/or run on DEC Alpha workstations running DEC UNIX. All new code was written in the C programming language and is freely available by anonymous FTP (login as anonymous to chah.ucc.ie and transfer the compressed tar archive PRNAA.tar). The code is not designed for portability and users will have to down load their own rRNA alignments and build their own profiles; a JAVA version of the programs is being developed which will be used to provide future access to all the methods via the Internet. 335 E.A.O’Brien, C.Notredame and D.G.Higgins Table 2.The highest % identity between the reference alignment and the realigned sequence obtained using each of the weighting schemes a A.beccarii C.elegans D.discoideum D.melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberi O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 71.65 69.26 64.42 70.14 55.68 43.12 55.00 56.13 79.88 50.37 88.85 53.62 86.71 47.62 46.23 82.47 b 84.19 83.98 78.95 82.72 73.50 60.22 73.89 73.10 91.01 63.60 97.08 65.02 93.94 62.86 56.20 93.59 c 83.66 83.98 78.95 82.97 74.83 60.22 73.96 73.61 92.88 63.74 97.13 64.66 94.55 63.39 55.69 95.18 d 84.05 86.99 79.59 81.11 75.04 60.22 76.81 78.39 91.49 67.81 96.69 68.64 93.38 64.77 56.20 94.25 e 83.96 87.84 79.06 84.02 78.17 61.08 77.29 77.16 92.30 67.86 97.35 67.52 94.10 65.04 58.96 95.07 (a) Fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based weights, (d) position-specific gap penalties and tree-based weights, (e) position-specific gap penalties and combination weights. The underlined values are the absolute maximum scores obtained for each sequence Table 3. Alignment percentage accuracy scores for various weighting schemes and gap penalties Gap extension penalty (a) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 58 58 59 58 58 58 58 57 57 59 59 60 59 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 62 62 62 62 62 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Cont.... 46 32 16 13 4 2 2 1 1 47 32 17 12 4 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 45 31 17 10 6 5 4 3 3 46 33 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 Trypanosoma brucei gap opening penalty 1 2 3 4 5 6 7 8 9 Vahlkampfia lobospinosa gap opening penalty 1 2 3 4 5 6 7 8 9 336 rRNA profiles Table 3. Continued Gap extension penalty (c) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (d) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (e) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 62 60 61 60 60 59 59 59 59 64 62 62 62 60 61 61 61 62 64 63 63 62 62 63 62 62 62 65 64 64 64 63 63 63 62 62 65 64 64 64 64 63 63 62 62 65 64 64 64 64 62 63 62 62 65 64 64 64 64 62 63 62 62 65 63 64 64 64 63 63 62 62 65 63 64 64 64 63 63 62 62 56 56 55 55 55 55 55 55 55 58 57 57 57 57 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 62 61 60 61 60 59 59 59 59 63 63 63 62 61 61 61 61 61 65 64 63 63 62 62 61 61 61 65 64 64 64 64 63 62 62 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 55 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 58 59 59 58 58 58 58 57 57 59 59 60 58 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 61 62 62 62 62 62 63 63 63 63 62 62 62 61 62 63 63 63 63 62 62 62 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 54 53 53 54 54 55 54 54 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Trypanosoma brucei gap opening penalty Vahlkampfia lobospinosa gap opening penalty Italics represent those regions at or above the highest score attainable with equal sequence weights. Underlining represents the highest score attained across all the different parameters. Parameter sets are: (a) fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based sequence weights, (d) position-specific gap penalties and tree-derived sequence weights, (e) position-specific gap penalties and weights derived from combination of tree-based and identity-based weights. In nine out of the 16 test cases, the single best alignment score generated across the ranges of gap penalties was obtained using the combined weights (the last column of Table 2). In three of the remaining cases, tree-based weights give the best performance (column c). The identity weights give the highest score in three cases, and Ammonia beccarii is aligned most accurately with equal weights. Both identitybased and tree-based methods of sequence weighting are shown to improve over equal weights in most cases, with the combination of both these weights giving the best overall performance. Two examples are shown in detail in Table 3. Here the scores for all values of gap opening and gap extension penalties are given for each weighting scheme for just two of the test cases: Vahlkampfia lobospinosa and Trypanosoma brucei. In both cases, the results with uniform gap penalties, shown in row (a), are very poor and depend strongly on the exact value of the parameters. There is a huge improvement in row (b) where the values for position specific gap penalties are shown. Here, the values are much higher than in row (a) and there is almost no dependence on the exact values chosen for the gap penalties. In the case of Vahlkampfia there is no noticeable difference between the use of tree-based or identity-based weights [the results are shown in rows (c), (d) and (b)]. Use of the combined weighting scheme, as seen in row (e), gives a consistent improvement, showing increase of 2% across the entire range of gap penalties. In the case of Trypanosoma the relative performance of each weighting scheme is more dis- 337 E.A.O’Brien, C.Notredame and D.G.Higgins tinct. In comparing identity weights to equal weights in this case, there is improvement for some values of gap penalty. The effect of using tree-based weights is to produce improvement across a larger range of gap penalties, particularly for gap extension penalties <0.3. The combination of the two weighting schemes again shows a synergistic effect, with a further increase visible across the range of gap penalties. The values of gap opening and gap extension penalties giving the maximum scores for each test case are given in Table 4. These are the optimum parameters when using the combined weighting scheme with position specific gap penalties. They all fall in a very narrow range. Table 4. Gap opening and extension penalties giving optimum alignment scores for each test case using combined weights Gap opening A.beccarii C.elegans D.discoideum D. melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberii O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 6.0 6.0 6.0 5.0 6.0 6.0 7.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 Gap extension 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.2 sequences with average G+C contents (∼50%). As expected, sequences with extreme nucleotide compositions (very high or very low G+C content) tend to be less easy to align accurately. High levels of a particular nucleotide increase the chance that a residue in the sequence being aligned may align with the wrong column in the profile. The test cases cover a range of G+C content from 38.4% (Entamoeba histolytica) to 68.5% (Giardia sp.). Discussion The generation of alignments under various parameters shows that position-specific gap opening penalties have a very strong positive effect on the accuracy with which alignments can be generated. Fixed gap penalties perform extremely poorly, particularly at high values of gap extension penalty. This corresponds to situations in which the long gaps that occur in virtually all sequences in certain regions of the alignment, which correspond to long insertions in a few sequences, are penalized very heavily and do not occur in an alignment giving an optimum score. Experimentation with position-specific gap extension penalties did not give any further improvement. Sequence weighting can have a further positive effect on alignment quality. Both weighting schemes based on sequence identity and those based on the tree structure and branch lengths are seen to have generally positive effects. As expected, the tree-based weights are seen to perform at their best in the case of sequences which are quite distant from the main taxa, with few or no close relatives, such as Hexamita, and to be of least benefit to alignment quality with sequences which have many close relatives such as O.sativa. With identity-based weights the greatest positive effects are seen in sequences within highly represented taxa such as S.cerevisiae. These two weighting schemes have opposing effects on the values of the sequence weights in the case of sequences aligning into densely populated regions of the tree, and so the net result of combining them, in cases such as S.cerevisiae, may not perform any better than either of the weighting schemes used individually. The examples given (Table 3) indicate that there are cases where tree-based and identitybased weights show a synergistic effect when combined, the combination outperforming either of the schemes applied individually. The combined weights give the best result in more than half of the test cases, and the average difference between the score generated with the combined weights and the overall best score is substantially less than the difference between the scores from any of the other weighting schemes and the overall best score in each case. This synergy is seen to occur most strongly in sequences which are distant from the main bulk of the alignment and therefore more difficult to align correctly. Those which are located in highly repre- In order to tell which sections of the reference alignment may be reliably aligned, each of the 1517 sequences in turn was removed from the alignment and re-aligned with the remaining sequences. Each column of the original, reference alignment was scored depending on what percentage of the residues in it can be realigned in the correct positions. Figure 2 shows the estimated secondary structure of the Saccharomyces cerevisiae nuclear SSU rRNA with those positions from the full alignment which can be realigned with ≥95% accuracy marked in black and those which realign with <95% accuracy in grey. Stems forming pseudoknots are not displayed in this representation. This is a conservative estimate of the regions that may be reliably aligned as there are some positions that are not found in this molecule and sequences from some taxonomic groupings may be aligned almost perfectly. Figure 3 shows the accuracy with which each sequence can be realigned compared to its original alignment as a function of G+C content. The re-alignment accuracy is greatest for 338 rRNA profiles Fig. 2. Secondary structure of Saccharomyces cerevisiae SSU rRNA with stable regions indicated in black., generated using the ESSA program (Chetouani et al., 1997). 339 E.A.O’Brien, C.Notredame and D.G.Higgins Fig. 3. Graph of percentage of sequence correctly re-aligned against G+C content for each of the 1517 sequences in the reference alignment. sented taxa do not show such strong effects from any of the weighting schemes, but these tend to be those sequences which have the best alignments initially. We have shown how to improve the accuracy of alignment of rRNA sequences using some simple methods. It is quite possible that alignments of 100% accuracy will not be possible due to the existence of errors introduced manually into the reference alignment. Nonetheless, we can already see that some sequences may be aligned with >95% accuracy (Oryza and Xenopus), and across the entirety of the alignment 89.84% of all residues can be realigned correctly. Some sequences are still disappointing and this can partially be explained by very biased G+C content (e.g. Giardia). Others come from poorly sampled parts of the overall Eukaryote phylogenetic tree and these will become easier to align as new sequences are added. Nonetheless, it may be difficult for users to evaluate the quality of a new alignment. We provide one, extremely simple method for choosing regions of the overall alignment that can be reliably aligned in almost all cases. This covers about half of the positions in any given molecule and provides a selection of sites which can be reliably chosen for phylogenetic purposes. This site selection can be fine-tuned by looking at regions which may be reliably aligned in specific taxa. Finally, it is very obvious that these methods could benefit from some consideration of secondary structure, which could be used for evaluation of alignments or as part of the alignment process. We are investigating the use of genetic algorithms to optimize the quality of profile alignments where secondary structure is considered (Notredame et al., 1997). We will use a genetic algorithm to optimize the quality function of Corpet and Michot (1994) but based on profiles rather than pairs of sequences. Acknowledgements The authors thank Richard Durbin for suggesting the use of the 1/d weights. We also thank Manolo Gouy for his help with rRNA sequences in general. This work was supported by a grant (BIO4-CT95–0130) from the EU Biotechnology programme. References Chetouani,F.,Monestie,P.,Thebault,P.,Gaspin,C. and Michot,B. (1997) ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. Nucleic Acids Res., 25, 2514–3522. Corpet,F. and Michot,B. (1994) RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comput. Applic. Biosci., 10, 389–399. Eddy,S. and Durbin,R. (1994) RNA sequence analysis using covariance models. N ucleic Acids Res., 22, 2079–2088. Felsenstein,J. (1989) Cladistics, 5, 164–166. 340 rRNA profiles Gotoh,O. (1982) J. Mol. Biol., 162, 705. Gotoh,O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput. Applic. Biosci., 11, 543–551. Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,J. and Henikoff,S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Applic. Biosci., 12, 135–143. Luthy,R., Xenarios,I., and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, 139–146. Maidak,B., Olsen,G.,Larsen,N., Overbeek,R.,McCaughey,M. and Woese,C. (1997) The Ribosomal Database Project (RDP). Nucleic Acids Res., 25, 109–111. Needleman,S. and Wunsch,C. (1970)A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Neefs,J.-M., Van de Peer,Y., Hendriks,L. and De Wachter,R. (1990) Database on the structure of small subunit ribosomal RNA. N ucleic Acids Res., 18, 2237–2217. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pawlowski,J., Bolivar,I., Fahrni,J.F., Cavalier-Smith,T. and Gouy,M. (1996) Early origin of Foraminifera suggested by SSU rRNA gene sequences. Mol. Biol. Evol., 13, 445–450. Saitou,N. and Nei,M. (1987) The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Sakakibara,Y.,Brown,M.,Hughey,R., Mian,I.S.,Sjolander,K, Underwood,R.C., and Haussler,D. (1994) Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res., 22, 5112–5120. Sogin,M.,Elwood,H, and Gunderson,J. (1986) Evolutionary diversity of eukaryotic small-subunit rRNA genes. Proc. Natl Acad. Sci. USA, 83, 1383–1387. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Thompson,J., Higgins,D. and Gibson,T. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Applic. Biosci., 10, 19–29. Van de Peer,Y.,Jansen,J.,De Rijk,P. and De Wachter,R. (1997) Database on the structure of small ribosomal subunit RNA. Nucleic Acids Res., 25, 111–116. 341 Long Noncoding RNAs with Enhancer-like Function in Human Cells Ulf Andersson Ørom,1 Thomas Derrien,2 Malte Beringer,1 Kiranmai Gumireddy,1 Alessandro Gardini,1 Giovanni Bussotti,2 Fan Lai,1 Matthias Zytnicki,2 Cedric Notredame,2 Qihong Huang,1 Roderic Guigo,2 and Ramin Shiekhattar1,2,3,* 1The Wistar Institute, 3601 Spruce Street, Philadelphia, PA 19104, USA for Genomic Regulation (CRG), UPF, Barcelona, Spain 3Institucio Catalana de Recerca i Estudis Avancats (ICREA), Barcelona, Spain ´ ¸ *Correspondence: shiekhattar@wistar.org DOI 10.1016/j.cell.2010.09.001 2Centre SUMMARY While the long noncoding RNAs (ncRNAs) constitute a large portion of the mammalian transcriptome, their biological functions has remained elusive. A few long ncRNAs that have been studied in any detail silence gene expression in processes such as X-inactivation and imprinting. We used a GENCODE annotation of the human genome to characterize over a thousand long ncRNAs that are expressed in multiple cell lines. Unexpectedly, we found an enhancer-like function for a set of these long ncRNAs in human cell lines. Depletion of a number of ncRNAs led to decreased expression of their neighboring protein-coding genes, including the master regulator of hematopoiesis, SCL (also called TAL1), Snai1 and Snai2. Using heterologous transcription assays we demonstrated a requirement for the ncRNAs in activation of gene expression. These results reveal an unanticipated role for a class of long ncRNAs in activation of critical regulators of development and differentiation. INTRODUCTION Recent technological advances have allowed the analysis of the human and mouse transcriptomes with an unprecedented resolution. These experiments indicate that a major portion of the genome is being transcribed and that protein-coding sequences only account for a minority of cellular transcriptional output (Bertone et al., 2004; Birney et al., 2007; Cheng et al., 2005; Kapranov et al., 2007). Discovery of RNA interference (RNAi) (Fire et al., 1998) in C. elegans and the identification of a new class of small RNAs known as microRNAs (Lee et al., 1993; Wightman et al., 1993) led to a greater appreciation of RNA’s role in regulation of gene expression. MicroRNAs are endogenously expressed noncoding transcripts that silence gene expression by targeting specific mRNAs on the basis of sequence recognition (Carthew and Sontheimer, 2009). Over 1000 microRNA loci are estimated 46 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. to be functional in humans, modulating roughly 30% of proteincoding genes (Berezikov and Plasterk, 2005). While microRNAs represent a minority of the noncoding transcriptome, the tangle of long and short noncoding transcripts is much more intricate, and is likely to contain as yet unidentified classes of molecules forming transcriptional regulatory networks (Efroni et al., 2008; Kapranov et al., 2007). Long ncRNAs are transcripts longer than 100 nts which in most cases mirror the features of protein-coding genes without containing a functional open reading frame (ORF). Long ncRNAs have been implicated as principal players in imprinting and X-inactivation. The imprinting phenomenon dictates the repression of a particular allele, depending on its paternal or maternal origin. Many clusters of imprinted genes contain ncRNAs, and some of them have been implicated in the transcriptional silencing (Yang and Kuroda, 2007). Similarly, the X chromosome inactivation relies on the expression of a long ncRNA named Xist, which is thought to recruit, in a cis-specific manner, protein complexes establishing repressive epigenetic marks that encompass the chromosome (Heard and Disteche, 2006). There is also a report indicating that a long ncRNA expressed from the HOXC locus may affect the expression of genes in the HOXD locus which is located on a different chromosome (Rinn et al., 2007). More recently, a set of long ncRNAs has been identified in mouse, through the analysis of the chromatin signatures (Guttman et al., 2009). There has also been reports of divergent transcription of short RNAs flanking transcriptional start sites of the active promoters (Core et al., 2008; Preker et al., 2008; Seila et al., 2008). In search of a function for long ncRNAs, we used the GENCODE annotation (Harrow et al., 2006) of the human genome. To simplify our search we subtracted transcripts overlapping the protein-coding genes. Moreover, we filtered out the transcripts that may correspond to promoters of protein-coding genes and the transcripts that belong to known classes of ncRNAs. We identified 3019 putative long ncRNAs that display differential patterns of expression. Functional knockdown of multiple ncRNAs revealed their positive influence on the neighboring protein-coding genes. Furthermore, detailed functional analysis of a long ncRNA adjacent to the Snai1 locus using reporter assays demonstrated a role for this ncRNA in an RNAdependent potentiation of gene expression. Our studies suggest B 1.0 Transcripts GeneID coding potential Cumulative frequency Figure 1. Identification of Novel Long ncRNAs in Human Annotated by GENCODE (A) Analysis of coding potential using Gene ID for ancestral repeats (AR), long ncRNAs annotated by GENCODE and protein-coding genes. (B) Conservation of the genomic transcript sequences for AR, long ncRNAs, protein-coding genes, and (C) of their promoters. (D) Expression analysis of 3,019 long ncRNA in human fibroblasts, HeLa cells and primary human keratinocytes, showing numbers for transcripts detected in each cell line and the overlaps between cell lines. All microarray experiments have been done in four replicates. See also Figure S1 and Table S1 and Table S2. A 200 100 0 AR Long Protein-coding ncRNAs genes 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 Normalized phastCons score 1.0 Protein-coding genes long ncRNAs AR C 1.0 Promoters Cumulative frequency 0.8 0.6 D Fibroblasts 976 HeLa 937 222 126 coding potential, expressed from 2286 unique loci (some loci display multiple 0.4 Protein-coding genes alternative spliced transcripts) of the 91 576 human genome (Experimental Procelong ncRNAs 0.2 38 dures, Table S1 available online). The AR 0.0 average size of the noncoding transcripts 24 52 0.0 0.2 0.4 0.6 0.8 1.0 is about 800 nts with a range from 100 nts Normalized phastCons score Keratinocytes to 9100 nts. Interestingly, the long 690 ncRNAs display a simpler transcription unit than that of protein-coding genes (Figure S1A). Nearly 50% of our long a role for a class of long ncRNAs in positive regulation of protein- ncRNAs contain a single intron in their primary transcript (Figure S1A). Moreover, analysis of their chromatin signatures indicoding genes. cated similarities with protein-coding genes. Transcriptionally active ncRNAs display histone H3K4 trimethylation at their RESULTS 50 -end (Figure S1B) and histone H3K36 trimethylation in the body of the gene (Figure S1C). Noncoding RNAs Are Expressed and Respond to Cellular Analysis of protein coding potential of the ncRNAs using Differentiating Signals To assign a function to uncharacterized human long ncRNAs, we GeneID (Blanco et al., 2007; Parra et al., 2000) shows ncRNAs identified unique long noncoding transcripts using the annota- coding potential comparable to that of ancestral repeats (Lunter tion of the human genome provided by the GENCODE (Harrow et al., 2006), supporting the HAVANA annotation of these tranet al., 2006) and performed by human and vertebrate analysis scripts as noncoding (Figure 1A). Moreover, comparison of and annotation (HAVANA) group at Sanger Institute. Such ncRNAs with protein-coding genes and control sequences corregenomic annotation is being produced in the framework of the sponding to ancestral repeats (Lunter et al., 2006) reveals that ENCODE project (Birney et al., 2007). At the time of our analysis, ncRNA sequence conservation is lower than that of proteinthe GENCODE annotation encompassed about one third of the coding genes, but higher than that of ancestral repeats (Figure 1B). human genome. Such an annotation relies on the human expert A similar case is seen with the promoter regions (Figure 1C). These curation of all available experimental data on transcriptional results are in concordance with previous observations in the evidence, such as cloned cDNA sequences, spliced RNAs and mouse genome (Guttman et al., 2009; Ponting et al., 2009). Next we used custom-made microarrays (Experimental ESTs mapped on to the human genome. We focused on ncRNAs that do not overlap the protein-coding Procedures) which were designed to include an average of six genes in order to simplify the interpretation of our functional anal- probes (nonrepetitive sequences) against each ncRNA transcript ysis of ncRNAs. This included the subtraction of all transcripts to detect their expression. We analyzed the expression pattern mapping to exons, introns and the antisense transcripts overlap- of ncRNAs using three different human cell lines (Figure 1D). ping the protein-coding genes. We also excluded transcripts Overall, we detected 1167 ncRNAs expressed in at least one within 1 kb of the first and the last exons as to avoid promoter of the three cell types and 576 transcripts common among the and 30 -associated transcripts (Fejes-Toth et al., 2009; Kapranov three cell types (Figure 1D). We validated the expression of 16 et al., 2007), that display a complicated pattern of short tran- ncRNAs that mapped to the 1% of the human genome investiscripts (Core et al., 2008; Preker et al., 2008; Seila et al., 2008). gated by the original ENCODE study (Birney et al., 2007) using Furthermore, we excluded all known noncoding transcripts quantitative polymerase chain reaction (qPCR) in three different from our list of putative long ncRNAs. This analysis resulted in cell lines (Table S2). Furthermore, we could find evidence for 3019 ncRNAs, which are annotated by HAVANA to have no expression of 80% of our noncoding transcripts in at least one Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 47 human tissue in a recent high throughput sequencing of the human transcriptome (Wang et al., 2008). To assess whether ncRNAs respond to cellular differentiating signals, we induced the differentiation of human primary keratinocytes using 12-O-tetradecanoylphobol 13-acetate (TPA). We monitored the expression of ncRNAs using custom microarrays. Expression of protein-coding genes was monitored using conventional Agilent arrays containing nearly all human mRNAs. We prepared RNA from human primary keratinocytes before and following treatment with TPA. As shown in Figure 2A and Table S3, we could detect 687 ncRNAs in keratinocytes, where 104 (or 15.1%) respond to TPA treatment by over 1.5-fold. Similarly, 21.3% of protein coding-genes display a change in expression of over 1.5-fold (Figure 2B). While around half of the TPA-regulated protein-coding genes increase and a similar proportion decrease their expression following differentiation, 70% of the TPA-regulated ncRNAs increase their expression whereas only 30% show a decrease (Figures 2A and 2B). Furthermore, analysis of the protein-coding genes in the 500 kb window surrounding the TPA-regulated ncRNAs indicates a significant enrichment in genes involved in differentiation and morphogenesis (Figure 2C). An example of such change in expression of an important gene involved in extra-cellular matrix is shown in Figure 2D. Extracellular Matrix Protein 1 (ECM1) gene and an ncRNA adjacent to it displayed a 5 and 1.7 fold induction following TPA treatment, respectively. (Figure 2D, upper panel). qPCR analysis shows the TPA-mediated induction of ECM1 and the ncRNA as 14 and 4 fold, respectively (Figure 2D, bottom panel). Taken together, we found that many of the GENCODE annotated transcripts are expressed in multiple cell lines and that they display gene expression responsiveness to differentiation signals. Noncoding RNAs Display a Transcriptional Activator Function To assess the function of our set of long ncRNAs, we reasoned that similar to long ncRNAs function at the imprinting loci, our collection of ncRNAs may act to regulate their neighboring genes. To test this hypothesis, we used RNA interference to deplete a set of ncRNAs. We initially chose ncRNAs that showed a differential expression following keratinocyte differentiation. However, to obtain a reproducible knockdown we had to use cell lines that are permissive to transfection by siRNAs. We used five different cell lines for our analyses in which the candidate ncRNAs display a detectable expression (Figure 3). We validated the expression of our experimental set of ncRNAs and the absence of protein-coding potential using rapid amplification of 50 and 30 complementary DNA ends (50 and 30 RACE), PCR and in vitro translation (Figure S3). These experiments confirmed the expression of ncRNAs and showed that they do not yield a product in an in vitro translation assay (Figures S3A and S3B), supporting the noncoding annotation of our set of ncRNAs. In two cases, the ncRNAs adjacent to Snai2 and TAL1 loci, we found evidence of a longer ncRNA transcript than that annotated by HAVANA (Figure S3). We began by examining small interfering RNAs (siRNAs) against the ncRNA next to ECM1 in order to assess its functional role following its depletion (for reasons that will follow, this class of RNA is designated as noncoding RNA-activating1 through 7, 48 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. ncRNA-a1-7). HEK293 cells were used for these experiments because of the ease of functional knockdown and the detectable amounts of ncRNA-a1 and ECM1 in this cell line. We compared the results obtained using two siRNAs against ncRNA-a1 to data obtained following the transfection of two control siRNAs (for the visual simplicity only one siRNA is shown (Figure 3A), the values for both siRNAs can be seen in Table S4). The two siRNAs produced comparable results. We interrogated a 300 kb window around the ncRNA-a1 containing six protein-coding genes using qPCR. Surprisingly, unlike the silencing action of long ncRNAs in imprinting and X-inactivation, depletion of ncRNA-a1 adjacent to ECM1 resulted in a concomitant decrease in expression of the neighboring ECM1 gene (Figure 3A). This effect was specific, as we did not detect any change in the other protein-coding genes surrounding ncRNA-a1 (Figure 3A). To ascertain that ncRNA-a1 is not a component of the ECM1 30 untranslated region, we used primer pairs spanning the ECM1 and ncRNA-a1 genes. We were not able to detect a transcript comprised of the two genes in HEK293 cells, supporting the contention that the two transcripts are independent transcriptional units (Figure S2A). Furthermore, published ChIP experiments (Euskirchen et al., 2007) show the presence of RNA polymerase II and trimethyl H3K4 peaks at the transcription start site of ncRNA-a1 in several cell lines, further attesting to an independent transcriptional start site for ncRNA-a1. Moreover, knocking down the ECM1 gene did not affect the expression level of ncRNA-a1 or any of the other protein-coding genes analyzed in the locus, further supporting the independence of ECM1 transcript from that of ncRNA-a1 (Figure S2B). Next we analyzed ncRNA-a2 flanking the histone demethylase JARID1B/KDAM 5B which also shows increased expression following keratinocyte differentiation. These experiments were performed in HeLa cells as they showed detectable expression of ncRNA-a2. Interestingly, while depletion of ncRNA-a2 did not change JARID1B/KDAM 5B levels, the KLHL12, a gene known for its negative regulation of the Wnt-beta catenin pathway, on the opposite strand displayed a significant reduction (Figure 3B). Although the decrease in KLHL12 was small (about 20%), no other protein-coding gene in the locus displayed a difference in expression (Figure 3B). To extend our findings and to determine whether regulation of neighboring protein-coding genes is a common function of ncRNAs, we interrogated the ncRNA-a3 flanking the stem cell leukemia gene (SCL, also called TAL1). TAL1 is a basic helixloop-helix protein which serves as the master regulator of hematopoiesis (Lecuyer and Hoang, 2004). This locus contains two ncRNAs on different strands of DNA. We used MCF-7 cells to assess the depletion of ncRNA-a3, since the expression of ncRNA-a3 and TAL1 could be readily detected in these cells. However, neither PDZK1IP1 nor ncRNA-a4 could be detected by qPCR in MCF-7 cells. Depletion of ncRNA-a3 resulted in a specific and potent reduction of TAL1 expression (Figure 3C). While depletion of ncRNA-a3 did not affect either STIL or CMPK1 genes, a significant reduction in CYP4A11 gene on the opposite strand of the DNA was detected (Figure 3C). We next turned our attention to ncRNA-a4 which was not expressed at a detectable level in MCF7 cells. We could reliably A Number of transcripts 15.1% 104 120 100 80 60 40 20 0 Repressed 29.8 % Induced 70.2% 687 33.3% 66.7% > ±2 Long ncRNAs > ±1.5 Long ncRNAs B Number of transcripts 21.3% 4107 5000 4000 3000 2000 1000 0 Repressed 47.7% Induced 19275 52.3% > ±1.5 mRNA 43.5% 56.5% > ±2 mRNA C cell differentiation epidermal cell differentiation keratinization keratinocyte differentation tissue development ectoderm development endoderm development epidermis morphgenesis tissue morphogenesis Protein-coding genes around differentiallyexpressed long ncRNAs Protein-coding genes around random positions 0 5 10 15 20 25 Number of genes 30 35 40 EC M nc 1 RN A- AM S2 1 TA R AD CL 5 0 Array quantification 15 10 5 0 qPCR quantification Control + TPA Figure 2. Long ncRNAs Display Responsiveness to Differentiation Signals in Human Primary Keratinocytes (A and B) Distribution of differentially expressed transcripts (dark colors) following TPA treatment for long ncRNAs (A), and mRNAs (B). Lighter colors show total number of transcripts, darker colors and percentage show number of differentially expressed transcripts. Bar-plots show number and fractions of transcripts induced (red) or repressed (green) at different fold-change cut-offs. (C) Gene onthology analysis of genes flanking the differentially expressed long ncRNAs (red) compared to genes flanking random positions (black). (D) Graphic representation of a locus with induction of the long ncRNA ncRNA-a1 and the adjacent ECM1 gene, with expression values from microarrays (upper panel) and qPCR quantification of transcripts (lower panel). Microarray experiments and qPCR validation are done in four replicates. Data shown are mean ± SD. See also Figure S2 and Table S3. detect ncRNA-a4 in Jurkat cells. While we could not efficiently knockdown ncRNA-a3 in Jurkat cells, siRNAs specific to ncRNA-a4 reproducibly reduced its levels by about 50% (Figure 3D). Importantly, reduced levels of ncRNA-a4 resulted in a consistent and significant decrease in the level of the gene CMPK1 which is over 150 kb downstream of ncRNA-a4 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 49 EN M SA D TS L a1 4 -a 1 AD AM M TSL CL 4 1 RP RD 2 TA R EC S2 nc M 1 RN A ST IL * ** EN SA * HEK293 control siRNA ncRNA-a1 siRNA Jurkat control siRNA ncRNA-a4 siRNA D1 B AD IP OR PQ LC 3 CK JA RI *** * * N.D. *** HeLa control siRNA ncRNA-a2 siRNA HeLa control siRNA ncRNA-a5 siRNA P4 A nc 11 ncRNA R PD NA-a3 ZK -a4 1 TA IP1 L1 * * N.D. N.D. ** CM CY ST * MCF-7 control siRNA ncRNA-a3 siRNA A549 control siRNA ncRNA-a6 siRNA 100 kb Figure 3. Stimulation of Gene Expression by Activating RNAs The thick black line representing each gene shows the span of the genomic region including exons and introns. The targeted activating RNAs are shown in red. Bar-plots show RNA levels as determined by qPCR. (A) ncRNA-a1 locus in HEK293 cells. (B) ncRNA-a2 locus in HeLa cells. (C) ncRNA-a3 locus in MCF-7 cells. (D) ncRNA-a4 locus in Jurkat cells. (E) ncRNA-a5 locus in HeLa cells. (F) ncRNA-a6 locus in A549 cells. All values are relative to GAPDH expression and relative to control siRNA transfected cells set to an average value of 1. The scale bar represents 100 kb and applies to all figure panels. Error bars show mean ± SEM of at least three independent experiments. *p < 0.05, **p < 0.01, ***p < 0.001 by two-tailed Student’s t test. See also Figure S3 and Table S4. The results represent at least six independent experiments. See also Figure S3 and Table S4. (Figure 3D). We do not detect any changes in the other proteincoding genes surrounding ncRNA-a4. Next we depleted ncRNA-a5 which is adjacent to the E2F6 gene, an important component of a polycomb-like complex (Ogawa et al., 2002). Knockdown of ncRNA-a5 did not affect the E2F6 gene. However, depletion of ncRNA-a5 resulted in a specific reduction in ROCK2 expression levels in HeLa cells, which is located upstream of ncRNA-a5 (Figure 3E). Finally, we examined the Snai1 and Snai2 loci in A549 cells (Figure 3F and Figure 4). The Snail family of transcription factors are implicated in the differentiation of epithelia cells into mesenchymal cells (epithelial-mesenchymal transition) during embry50 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. onic development (Barrallo-Gimeno and Nieto, 2005; Savagner, 2001). Snai2 shows a significant reduction in expression when the adjacent ncRNA-a6 is depleted, an effect that is not seen on EFCAB1, the only other protein-coding gene within 300 kb of the ncRNA-a6 (Figure 3F). In total, we have examined 12 loci where we were able to efficiently knockdown the ncRNAs using siRNAs (Table S5). We were able to show that in 7 cases, the ncRNA acts to potentiate the expression of a protein-coding gene within 300 kb of the ncRNA. It is possible that the remaining ncRNAs which did not display a positive effect on the neighboring genes within the 300 kb window, exert their action over longer distances which was not assessed in our analysis. Taken Sn ai 2 * RN 1 PK nc EF IL CA C F a6 B1 A- E2 F6 B nc LORNA C6 -a 41 2 51 AX 5 NR71 12 _0 1 RA 02 8 BI 92 F 9 KL HL 12 E O 00 TTH 00 U 32 MT 40 00 50 nc RN Aa5 1 RO 2 CM * PK 1 A CY P4 A1 1 nc ncRN RN APD A- a3 ZK a4 TA 1IP L1 1 D A Figure 4. Knockdown of ncRNA-a7 Specifically Targets Snai1 Expression nc RN Aa7 TM EM 18 9 1.0 0.5 0 ** ** A549 control siRNA (A) As in Figure 3, the ncRNA-a7 locus is depicted showing effects on RNA levels for the surrounding genes with and without knockdown of ncRNA-a7. The results represent mean ± SEM of at least six independent experiments. **p < 0.01 by one-tailed Student’s t test. (B) Migration assay of A549 cells with control (right panel) or ncRNA-a7 (left panel) siRNA transfections. (C) Quantification of the data shown in (B). Experiments in (B) and (C) are done in three replicates and are shown as mean ± SEM. ***p < 0.001 by two-tailed Student’s t test. See also Figure S4 and Table S5. F1 14 E2 VI ai 100 kb B C Number of cells migrated 5000 4000 3000 2000 1000 0 *** *** together, our results indicate that a subset of ncRNAs has activating functions and therefore we have named them ncRNAactivator (ncRNA-a) followed by a number to distinguish each activating long ncRNA. ncRNA-a7 Is a Regulator of Snai1 As mentioned above, Snai1 is a member of the Snail zinc-finger family, which comprises transcription factors with diverse functions in development and disease (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). The Snail gene family is conserved among species from Drosophila to human and has been shown to function as mesodermal determinant genes (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). Snail genes are the regulators of cell adhesion, migration and epithelial-mesenchymal transition (EMT) (Barrallo-Gimeno and Nieto, 2005; Nieto, 2002). Analysis of the ncRNA close to the Snai1 gene provided us with an opportunity to combine our gene expression analysis with analysis of changes in cellular migration. Knockdown of ncRNA-a7 resulted in a specific reduction in Snai1 levels (Figure 4A). The expression of the four other protein-coding genes in this locus does not change following the depletion of ncRNA-a7. Concomitantly, knockdown of ncRNA-a7 has a significant phenotypic effect in cell migration assays, reducing the number of migrating cells to about 10% of that of the control (Figures 4B and 4C), consistent with the phenotypic changes following the depletion of Snai1 (Figures 4B and 4C). Since the knockdown of ncRNA-a7 or Snai1 had similar consequences on cellular migration, we assessed their depletion on gene expression in A549 cells using Agilent arrays. We could not detect the basal level of Snai1 on the array, while Snai1 was readily detectable using quantitative PCR. Interestingly, depletion of Snai1 or ncRNA-a7 resulted in similar changes in gene expression profiles (Figure 5A and Table S6). Not only did we observe a similar trend in genes that were affected upon the knockdown of either gene but also a significant number of genes that were upregulated were in common in both treatments (Figures 5A and 5B). Since Snai1 is a known transcriptional repressor, depletion of Snai1 or ncRNA-a7 should result in an upregulation of Snai1 target genes. Indeed, a number of genes that were commonly upregulated were direct targets of Snai1 (Figure 5C, upper panel) (De Craene et al., 2005). Depletion of either ncRNA-a7 or Snai1 also resulted in downregulation of a set of genes with a partial overlap between the genes downregulated following the two treatments (Figure 5B). Interestingly, Aurora-kinase A a gene that is 6 MB down-stream of ncRNA-a7 was specifically downregulated following the depletion of ncRNA-a7, suggesting a long range effect for ncRNA-a7 (Figure 5C). Taken together, these results indicate that while the depletion of ncRNA-a7 partially mimic the gene expression profile observed following Snai1 depletion, there are a number of gene expression changes resulting from the ncRNA-a7 depletion that occur independently Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 51 ncRNA-a7 siRNA Co nt ro ls iR NA Sn ai 1 siR nc NA RN Aa7 siR NA Snai1 siRNA ncRNA-a7 siRNA Control siRNA CE B RN UB Sn PB 1 ncRNA-a7 siRNA Control siRNA Snai1 siRNA Figure 5. Microarray Analysis of Snai1 and ncRNA-a7 Knockdown Snai1 or ncRNA-a7 were knocked down using siRNA in A549 cells and the isolated RNA analyzed on microarrays in duplicate experiments. (A) All genes differentially expressed (>1.5-fold or <0.6-fold compared to control) in either Snai1 or ncRNA-a7 knockdown, or both, are shown clustered in a heat map according to expression profile. Numbers are log(2) transformed and color scale is shown below the heat map. (B) Analysis of genes showing upregulation (>1.5 fold) or downregulation (<0.6 fold) in both Snai1 and ncRNA-a7 knockdown. Numbers represent number of genes regulated in the indicated condition. (C and D) (C) Validation of microarray data by qPCR and (D) analysis of the Snai1 locus and targets of Snai1 upon overexpression of ncRNA-a7. ncRNA-a7 was overexpressed from a vector in A549 cells and expression of select genes were measured by qPCR. Y-axes show expression value relative to GAPDH of the indicated gene. Values are normalized to those of control siRNA transfected cells, set to 1. **p < 0.01, ***p < 0.001 by one-tailed Student’s t test. See also Table S6. A Differentially expressed in Snai1 or ncRNA-a7 knock-down B Snai1 siRNA Expression > 1.5 fold ncRNA-a7 siRNA C Relative/Gapdh 4 4 4 Control siRNA Snai1 siRNA ncRNA-a7 siRNA 3 3 3 124 135 206 2 2 2 1 1 1 0 0 CDH1 PKP2 0 PLOD2 ncRNA-a7 siRNA Expression < 0.6 Snai1 siRNA 1 Relative/ Gapdh 1 1 1 168 42 112 0 Snai1 0 ncRNA-a7 0 RNF114 0 AURKA Log2 -2 +2 14 VI F1 E2 EM 1 ai RN UB TM CE Sn BP B Control ncRNA-a7 D nc RN Aa7 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 200 150 100 50 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 18 2.0 1.5 1.0 0.5 9 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 tionally dissect the influence of the ncRNA activation on the expression of 0 0 0 0 0 0 0 0 0 0 an adjacent gene, we constructed AURKA CDH1 PKP2 PLOD2 vectors with inserts containing either ncRNA-a3 and -a4 from a bidirectional of changes in Snai1. Therefore, it is likely that depletion of promoter, ncRNA-a5 or ncRNA-a7, and placed them downncRNA-a7 may have other effects on gene expression which stream of Firefly luciferase driven by a thymidine kinase (TK) promoter in a reporter vector (pGL3-TK-ncRNA-a), (Figure 6A). may be mediated through other targets in trans. To specifically address whether ncRNA-a7 may exert its We included 1–1.5 kb upstream of the ncRNA-as to contain effects in trans, we assessed the gene expression changes in their endogenous promoters and 500 bps downstream in the Snai1 locus as well as some of the targets that were changed reporter vector. We also produced a control vector (pGL3by depletion of ncRNA-a7 or Snai1 following the overexpression TK-control) in which 4 kb of DNA without transcriptional potenof ncRNA-a7 (Figure 5D). Overall, we did not observe changes in tial was cloned down-stream of Firefly luciferase similar to the gene expression for any of the ncRNA-a7 targets following its ncRNA activation reporters (Figure 6B). A vector containing Reoverexpression (Figure 5D, ncRNA-a7 was overexpressed 150 nilla luciferase was used to control for transfection efficiency. fold). While these results suggest that ncRNA-a7 exerts its local Importantly, inclusion of either of the three ncRNA-a inserts gene expression changes in cis, it is likely that other targets may result in an enhancement of transcription ranging from 2- to be influenced in trans. Taken together, these experiments reveal 7-fold (Figures 6C–6E). This effect is specific as pGL3-TKa role for ncRNA-a in positive regulation of expression of neigh- control vector do not enhance the basal TK promoter activity boring protein-coding genes and show that this effect is not (Figures 6C–6E). To demonstrate that the observed potentiation specific to any one locus and may represent a general function of gene expression is mediated through the action of ncRNA-a, we knocked down the ncRNA-a in question for each reporter for ncRNAs in mammalian cells. construct using specific siRNAs (Figures 6C–6E). Interestingly while depletion of ncRNA-a7 and ncRNA-a5 completely abolncRNA Activation of Gene Expression of a Heterologous ished the increased transcription, depletion of ncRNA-a3 Reporter Previous studies have shown that distal activating sequences/ and/or ncRNA-a4 resulted in a partial decrease in transcripenhancers can stimulate transcription when placed adjacent tional enhancement (Figures 6C–6E). These results suggest to a heterologous promoter, a methodology widely used to vali- that while ncRNA-a play a major role in transcriptional activadate potential enhancers (Banerji et al., 1983, 1981; Gillies tion, other DNA elements in the cloned ncRNAa-3/4 region et al., 1983; Heintzman et al., 2009; Kong et al., 1997). To func- may also contribute to increased transcription. 52 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. A ncRNA insert TK promoter Firefly luciferase pGL3-TK-ncRNA-a reporter B pGL3-TK No insert SV40 p(A) pGL3-TK-Control 4 kb insert with no known transcription C pGL3-TK pGL3-TK-control Control siRNA ** pGL3-TK-ncRNA-a3/4 pGL3-TK-ncRNA-a7 pGL3-TK-control ncRNA-a3 ncRNA-a4 2.7 kb insert including ncRNA-a3 and ncRNA-a4 pGL3-TK pGL3-TK-ncRNA-a7 0 1 2 FL/RL (normalized units) 3 ** ncRNA-a7 siRNA pGL3-TK-ncRNA-a5 ncRNA-a5 4 kb insert including ncRNA-a5 E pGL3-TK pGL3-TK-Control Control siRNA *** pGL3-TK-ncRNA-a7 ncRNA-a7 2.7 kb insert including ncRNA-a7 pGL3-TK-ncRNA-a3/4 pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a3/4 *** ncRNA-a3 siRNA D pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a5 pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a5 0 1 FL/RL (normalized units) 2 * ncRNA-a5 siRNA Control siRNA * pGL3-TK pGL3-TK-Control pGL3-TK-ncRNA-a3/4 pGL3-TK-ncRNA-a3/4 0 2 4 FL/RL (normalized units) siRNA ncRNA-a3 and ncRNA-a4 6 8 ncRNA-a4 siRNA Figure 6. ncRNA-Activators Potentiate Transcription of a Reporter Gene (A) ncRNA-a 3/4, 5 and 7 were cloned and inserted downstream of luciferase driven by a TK-promoter in a reporter plasmid as shown. (B) Graphical representation of the inserts in the various vectors used. The pGL3-TK-Control vector contains an insert of approximately 4 kb containing no annotated evidence of transcription. The depicted inserts show exons and transcriptional direction of the ncRNA-a. (C–E) Luciferase reporter assays. The Firefly luciferase vectors were cotransfected with a Renilla luciferase vector (pRL-TK) for transfection control. (C) The vector containing ncRNA-a3 and ncRNA-a4 from a bidirectional promoter, with control siRNA or siRNAs toward either of the two ncRNA-a, or both. (D) Reporter with ncRNA-5, and (E) the reporter with the ncRNA-a7 inserted downstream of luciferase. X axes show relative Firefly (FL) to Renilla (RL) luciferase activity. Cotransfected siRNAs are indicated to the right of the bars. All data shown are mean ± SE from six independent experiments. *p < 0.05, **p < 0.01, ***p < 0.001 by onetailed Student’s t test. Dissection of the ncRNA-a7 in a Reporter Construct An important property of enhancing sequences is their orientation independence (Imperiale and Nevins, 1984; Khoury and Gruss, 1983; Kong et al., 1997). We designed reporter constructs (Figure 7A) in which the ncRNA-a7 sequence is reversed (pGL3TK-ncRNA-a7-RV) in order to assess its orientation independence. The ncRNA-a7-RV construct displayed a similar transcriptional enhancing activity as the construct containing the Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 53 A ncRNA-a7 insert Figure 7. RNA-Dependent Activation of a Reporter Gene by ncNRA-a7 (A) Properties of the ncRNA-a7 containing luciferase reporter vector. (B, C, E, and F) Luciferase reporter assays. The Firefly luciferase vectors were cotransfected with a Renilla luciferase vector (pRL-TK) for transfection control. (D) Semiquantitative PCR of ncRNA-a7. (B) Reporter experiments with the ncRNA-a7 insert reversed as indicated in the left panel. (C) The TK-promoter driving luciferase expression was deleted from the construct and expression values are shown relative to the pGL3-TK control plasmid as a reference. (E) Truncated reporter constructs containing the ncRNA-a7 promoter and downstream sequences, but not the ncRNA-a7 sequence [pGL3-TK-delta(ncRNA-a7)], or one with a poly(A) signal in the beginning of the ncRNA-a7 to induce premature polyadenylation [pGL3-TK-ncRNA-a7-p(A)]. See also (D) for analysis of expression from these plasmids. (F) Protein coding sequences were inserted in place of ncRNA-a7 downstream of the ncRNA-a7 promoter. Full-length GTSF1L or ID1 sequences are used. X axes show relative Firefly (FL) to Renilla (RL) luciferase activity. All data shown are mean ± SE from six independent experiments. ***p < 0.001 by one-tailed Student’s t test. TK promoter Firefly luciferase pGL3-TK-ncRNA-a7 B SV40 p(A) Exon 2 Exon 1 No insert pGL3-TK *** *** pGL3-TK-control pGL3-TK-ncRNA-a7 pGL3-TK-ncRNA-a7-RV 0 2 3 FL/RL (normalized units) 1 4 C D No promoter pGL3-Basic PCR ncRNA-a7 pG pG L3 -T K- pG L3-T pG nc L3 RN -T K-n L3-T A- K-n cRN K a7 Ac a7 + RN nc Aa7 RN A- -p( a7 A) siR NA pGL3-TK *** pGL3-TK-ncRNA-a7 *** *** pGL3-TK-delta(ncRNA-a7) pGL3-TK-ncRNA-a7-p(A) 0 2 3 FL/RL (normalized units) 1 pGL3-Basic-ncRNA-a7 pGL3-Basic-ncRNA-a7-RV pGL3-TK 0 10 FL/RL (arbitrary units) 20 E No insert SV40 p(A) 4 F No insert pGL3-TK *** pGL3-TK-control pGL3-TK-ncRNA-a7 *** *** ORF ORF pGL3-TK-GTSF1L pGL3-TK-ID1 0 2 3 FL/RL (normalized units) 1 54 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. ncRNA-a7 insert in its endogenous orientation with respect to the regulated gene (Figure 7B). To show that luciferase expression requires a promoter and that ncRNA-7a cannot act as a proximal promoter, we deleted the TK promoter from the reporter vectors. As shown in Figure 7C, ncRNA-a7 cannot drive transcription of the Firefly luciferase in the absence of a proximal TK promoter. These experiments demonstrate that sequences corresponding to ncRNA-a7 transcription unit can function to activate expression of a heterologous promoter in an orientation-independent manner, but cannot act as a promoter itself. To further verify that ncRNA-a7 is the active component of the transcriptional enhancement, we constructed two reporters in which ncRNA-a7 sequences are either deleted or shortened by placing a strong polyadenylation signal within the ncRNAa genomic sequence but close to the transcriptional start site, to induce premature polyadenylation (Figures 7D and 7E). Both modifications result in loss of the increased gene expression (Figure 7E) compared to constructs where ncRNA-a7 is expressed. Finally, to assess whether the RNA corresponding to ncRNA-7a is critical for increased gene expression, we developed constructs where DNA sequences corresponding to two different protein-coding genes were positioned in the place of ncRNA-a7 (Figure 7F), keeping the endogenous ncRNA-a7 promoter. Neither of these constructs displayed an increased gene expression compared to that of the control constructs (Figure 7F). Taken together, these experiments demonstrate that the potentiation of gene expression is signaled by the ncRNA-a and is not merely the result of the transcription of the ncRNA. DISCUSSION We used the annotation of the human genome performed by GENCODE to arrive at a collection of long ncRNAs that are expressed from loci independent of those of protein-coding genes or previously described nc RNAs. GENCODE annotation encompasses both protein-coding and noncoding transcripts and relies on experimental data obtained through the analysis of cDNAs, ESTs and spliced RNAs. Our collection of 3,000 transcripts correspond to the manual curation of about a 1/3 of the human genome. Analysis of the GENCODE data indicates that nearly all of their noncoding annotated transcripts are spliced (Figure S1A). Importantly, the median distance of an ncRNA transcript to a protein-coding gene is over a 100 kb making it an unlikely scenario for the ncRNA to be an extension of protein-coding transcripts (Figures S2C and S2D). Moreover, transcriptionally active ncRNAs display similar chromatin modifications seen with expressed protein-coding genes (Figures S1B and S1C). Furthermore, the analyzed ncRNAs display RNA pol(II), p300 and CBP occupancy at levels similar to those of the surrounding protein coding genes, consistent with their transcriptional independence (Figure S4). Although our analysis is focused on understanding the function of a set of ncRNAs annotated by GENCODE, the human transcriptome includes other forms of ncRNAs with important regulatory functions that have not been included in our study. These include the antisense transcripts arising from protein-coding genes, precursors of microRNAs as well as a wealth of unspliced transcripts described in multiple studies (Guttman et al., 2009; Kapranov et al., 2007; Rinn et al., 2007). Taken together, the novelty of our work lies in the following. First, we show that at multiple loci of the human genome depletion of a long ncRNA leads to a specific decrease in the expression of neighboring protein-coding genes. Previous studies analyzing the function of long ncRNAs in X-inactivation or the imprinting phenomenon point to their role in silencing of gene expression (Mattick, 2009). Second, we show that the enhancement of gene expression by ncRNAs is not cell specific as we observe the effect in five different cell lines. Third, this enhancement of gene expression is mediated through RNA, as depletion of such activating ncRNAs abrogate increased transcription of the neighboring genes. Fourth, through the use of heterologous reporter assays, we suggest that activating ncRNAs mediate this RNA-dependent transcriptional responsiveness in cis. Fifth, we show that similar to classically defined distal activating sequences, ncRNA-mediated activation of gene expression is orientation independent. Sixth, we present evidence that similar to defined activating sequences, ncRNAs cannot drive transcription in the absence of a proximal promoter. Finally, we demonstrate that the activation of gene expression in the heterologous reporter system is mediated through RNA as multiple approaches depleting the RNA levels lead to abrogation of the stimulatory response. Therefore, we have uncovered a new biological function in positive regulation of gene expression for a class of ncRNAs in human cells. There are previous reports of individual ncRNAs having a positive effect on gene expression. The 3.8 kb Evf-2 ncRNA was shown to form a complex with the homeodomain-containing protein Dlx2 and lead to transcriptional enhancement (Feng et al., 2006). Similarly, the ncRNA HSR1 (heat-shock RNA-1) forms a complex with HSF1 (heat-shock transcription factor 1), resulting in induction of heat-shock proteins during the cellular heat-shock response (Shamovsky et al., 2006) and an isoform of ncRNA SRA (steroid receptor RNA activator) functions to coactivate steroid receptor responsiveness (Lanz et al., 1999). Our findings that activating ncRNAs positively regulate gene expression extend these previous studies and demonstrate that the activation of gene expression by long ncRNA may be a general function of a class of long ncRNAs. Moreover, whether ncRNA effects seen in our study are mediated through association with specific transcriptional activators is not known. However, this is a likely scenario given previous examples of an RNA-mediated responsiveness. Other possibilities include a formation of an RNA-DNA hybrid at the locus of the ncRNA or the protein-coding gene which may result in enhanced binding of the sequence specific DNA binding proteins or chromatin modifying complexes. A recent study uncovers a set of bidirectional transcripts (termed eRNA) that are derived from sites in the human genome that show occupancy by CBP, RNA polymerase II and are decorated by monomethyl Histone H3 lysine 4 (H3K4) (Kim et al., 2010). Moreover, they show that the expression of such transcripts is correlated with their nearest protein-coding genes. There are fundamental differences between their collection of 2000 transcripts and our GENCODE set of transcripts. First, Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 55 while all their eRNAs are bidirectional, only about 1% of our ncRNAs show evidence of bidirectionality (see the example shown in the TAL1 locus). Second, our analysis of the histone modifications of a subset of ncRNAs that are expressed in lymph (Barski et al., 2007) indicates the presence of H3K4 trimethylation at the transcriptional start sites and H3K36 trimethylation at the body of the gene (Figures S1B and S1C). This is in stark contrast to eRNA loci where there is an absence of H3K4 trimethyl marks and the predominant chromatin signature is the monomethyl H3K4. Third, eRNAs are reported to be predominantly not polyadenylated. The majority of our collection of ncRNAs show evidence of polyadenylation as they were amplified using oligo-dT-primed reactions and furthermore 41% display the presence of a canonical polyadenylation site. Analysis of the protein-coding transcripts revealed that a similar proportion (52%) to that of our ncRNAs contain the canonical polyadenylation sites. Finally, while we show that a set of our ncRNAs function to enhance gene expression, there is no evidence provided for eRNAs exerting a biological function. While we believe that eRNAs designate a different class of ncRNAs than ncRNA-a described in our study, it is temping to speculate that many of the ncRNA-a and their promoters may correspond to mammalian enhancers or polycomb/trithorax response elements (PRE/TREs). In such a scenario, binding of polycomb or trithorax proteins to proximal promoters of ncRNA-a will regulate the expression of ncRNA-a which in turn impact the expression of the protein-coding gene at the distance. Another set of recently published ncRNAs were termed long intervening noncoding RNA or lincRNAs (Guttman et al., 2009). The comparison of our ncRNAs and the lincRNAs show that about 13% of the ncRNAs defined by ENCODE overlap the broad regions encoding a set of recently identified human lincRNAs (Khalil et al., 2009). The overlap between our ncRNAs and lincRNAs are even smaller (4%) if one considers only the exons corresponding to lincRNAs. Importantly, the studies with lincRNAs did not reveal any transcriptional effects in neighboring genes (Khalil et al., 2009). Therefore, it is likely that lincRNAs describe a distinct set of ncRNAs compared to those annotated by GENCODE. Similar to the diverse functions for proteins, ncRNAs such as lincRNAs may play other roles in regulating gene expression. The GENCODE annotation used in this study encompasses only a third of the human genome. Therefore, the number of ncRNAs in human cells is likely to grow and may equal or even surpass the number of protein-coding genes. Our considerations for selection of ncRNAs excluded all ncRNAs associated with protein-coding genes and their promoters, as well as known ncRNAs. Therefore, the repertoire of the noncoding transcripts in human cells contains many more transcripts than those cataloged in this study. Specifically, there have been reports of pervasive amount of antisense transcription as well as transcription mapping to promoter regions of protein-coding genes (Core et al., 2008; Denoeud et al., 2007; Kapranov et al., 2007; Preker et al., 2008; Seila et al., 2008). Whether such transcripts will have biological functions similar to those described for activating ncRNAs in our study is not known. However, it is clear that future genome-wide genetic analysis of ncRNAs in mammalian cells will begin to shed light on different classes of the ncRNAs. 56 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. The precise mechanism by which our ncRNAs function to enhance gene expression is not known. We envision a mechanism by which ncRNAs by virtue of sequence or structural homology targets the neighboring protein-coding genes to bring about increased expression. Our experimental evidence using a heterologous promoter point to the mechanism of action for activating ncRNAs operating in cis. However, genome-wide analysis following depletion of ncRNA-a7 suggested changes in gene expression that may not be related to the action of ncRNA-a7 on its local environment and may be a result of wider trans-mediated effects of ncRNA-a7. Such regulatory functions of ncRNAs could be achieved through an RNA-mediated recruitment of a transcriptional activator, displacement of a transcriptional repressor, recruitment of a basal transcription factor or a chromatin-remodeling factor. While we favor a transcriptional based mechanism for ncRNA activation, effects on RNA stability cannot be excluded. Taken together, the next few years will bring about new prospects for the long ncRNAs as central players in gene expression. EXPERIMENTAL PROCEDURES Extracting Long ncRNA Data The HAVANA annotation has been downloaded using the DAS server provided by the Sanger institute (version July,16th 2008). We removed all annotated biotypes or functional elements belonging to specific categories such as pseudogenes or protein-coding genes. We excluded all transcripts overlapping with known protein coding loci annotated by HAVANA, RefSeq or UCSC. Transcripts falling into a 1 kb window of any protein-coding gene were also removed. Finally, we excluded all transcripts covered by known noncoding RNAs such as miRNAs (miRbase version 11.0 April 2008). To estimate the evolutionary constraints among mammalian sequences we constructed the cumulative distribution of PhastCons scores for ancestral repeats (ARs), RefSeq genes and long ncRNAs. The cumulative distributions of these transcripts or repeats are plotted using a log-scale on the y axis. Cell Culture and siRNA Transfections Human primary keratinocytes from four different biological donors were grown in Keratinocyte medium (KFSM, Invitrogen). Differentiation was induced by 2.5 ng/ml 12-O-tetradecanoylphorbol-13-acetate (TPA) during 48 hr. HEK293, A549, HeLa, and MCF-7 cells were cultured in complete DMEM media (GIBCO) containing 10% FBS, and 13 Anti/Anti (GIBCO). Jurkat cells were cultured in complete RPMI media (GIBCO) containing 10% FBS and 13 Anti/Anti (GIBCO). Migration assays were performed as previously described(Gumireddy et al., 2009). For transfections of 293, HeLa, A549, and MCF-7 cells we used Lipofectamine 2000 (Invitrogen) according to the manufacturer’s recommendations and an siRNA concentration of 50 nM. Jurkat cells were transfected using HiPerFect (QIAGEN) according to the manufacturer’s recommendations and an siRNA concentration of 100 nM. RNA Purification, cDNA Synthesis, and Quantitative PCR Cells were harvested and resuspended in TRIzol (Invitrogen) and RNA extracted according to the manufacturer’s protocol. cDNA synthesis was done using MultiScribe reverse transcriptase and random primers from Applied Biosystems. Quantitative PCR was done using SybrGreen reaction mix (Applied Biosystems) and an HT7900 sequence detection system (Applied Biosystems). For all quantitative PCR reactions Gapdh was measured for an internal control and used to normalize the data. Cloning of pGL3-TK Reporters and Luciferase Assay pGL3-Basic was digested with BglII and HindIII and the TK promoter from pRL-TK was inserted into these sites. Inserts were amplified from genomic DNA and cloned into the BamHI and SalI sites 50 to the luciferase gene. Luciferase assays were performed in 96-well white plates using Dual-Glo (Promega) according to the manufacturer’s protocol. Microarrays Custom-made microarrays (Agilent) were designed based on the library of 3019 long ncRNA sequences, with on average six probes targeting each transcript. Human whole genome mRNA arrays were from Agilent (G4112F). Total RNA samples were converted to cDNA using oligo-dT primers. Labeling of the cDNA and hybridization to the microarrays were performed according to Agilent standard dye swap protocols. Data analysis was done using the AFM 4.0 software. All microarrays were done in four biological replicates. SUPPLEMENTAL INFORMATION Supplemental Information includes Extended Experimental Procedures, four figures, and six tables and can be found with this article online at doi:10. 1016/j.cell.2010.09.001. ACKNOWLEDGMENTS Thanks to the HAVANA team for use of their genome annotation. We also thank the CRG Genomic Facility and the Functional Genomics Core Facility at Wistar and UPenn for expertise in microarray analysis. We thank Dr. Ken Zaret for helpful discussions. U.A.O. is supported by a grant from the Danish Research Council; M.B. is supported by an HFSPO fellowship; A.G. was supported by a fellowship from the American Italian Cancer Foundation; R.G. was supported through Spanish ministry, GENCODE U54 HG004555-01, and NIH; and R.S. was supported by a grant from NIH, GM 079091. Received: April 23, 2010 Revised: July 1, 2010 Accepted: August 13, 2010 Published: September 30, 2010 REFERENCES Banerji, J., Olson, L., and Schaffner, W. (1983). A lymphocyte-specific cellular enhancer is located downstream of the joining region in immunoglobulin heavy chain genes. Cell 33, 729–740. Banerji, J., Rusconi, S., and Schaffner, W. (1981). Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308. Barrallo-Gimeno, A., and Nieto, M.A. (2005). The Snail genes as inducers of cell movement and survival: implications in development and cancer. Development 132, 3151–3161. Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837. Berezikov, E., and Plasterk, R.H. (2005). Camels and zebrafish, viruses and cancer: a microRNA update. Hum. Mol. Genet 14, Spec No. 2, R183-190. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. (2004). Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242–2246. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Blanco, E., Parra, G., and Guigo, R. (2007). Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4 3. Carthew, R.W., and Sontheimer, E.J. (2009). Origins and Mechanisms of miRNAs and siRNAs. Cell 136, 642–655. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. (2005). Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154. Core, L.J., Waterfall, J.J., and Lis, J.T. (2008). Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322, 1845–1848. De Craene, B., Gilbert, B., Stove, C., Bruyneel, E., van Roy, F., and Berx, G. (2005). The transcription factor snail induces tumor cell invasion through modulation of the epithelial cell differentiation program. Cancer Res. 65, 6237–6244. Denoeud, F., Kapranov, P., Ucla, C., Frankish, A., Castelo, R., Drenkow, J., Lagarde, J., Alioto, T., Manzano, C., Chrast, J., et al. (2007). Prominent use of distal 50 transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759. Efroni, S., Duttagupta, R., Cheng, J., Dehghani, H., Hoeppner, D.J., Dash, C., Bazett-Jones, D.P., Le Grice, S., McKay, R.D., Buetow, K.H., et al. (2008). Global transcription in pluripotent embryonic stem cells. Cell Stem Cell 2, 437–447. Euskirchen, G.M., Rozowsky, J.S., Wei, C.L., Lee, W.H., Zhang, Z.D., Hartman, S., Emanuelsson, O., Stolc, V., Weissman, S., Gerstein, M.B., et al. (2007). Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res. 17, 898–909. Fejes-Toth, K., Sotirova, V., Sachidanandam, R., Assaf, G., Hannon, G.J., Kapranov, P., Foissac, S., Willingham, A.T., Duttagupta, R., Dumais, E., and Gingeras, T.R. (2009). Post-transcriptional processing generates a diversity of 50 -modified long and short RNAs. Nature 457, 1028–1032. Feng, J., Bi, C., Clark, B.S., Mady, R., Shah, P., and Kohtz, J.D. (2006). The Evf2 noncoding RNA is transcribed from the Dlx-5/6 ultraconserved region and functions as a Dlx-2 transcriptional coactivator. Genes Dev. 20, 1470–1484. Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E., and Mello, C.C. (1998). Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806–811. Gillies, S.D., Morrison, S.L., Oi, V.T., and Tonegawa, S. (1983). A tissuespecific transcription enhancer element is located in the major intron of a rearranged immunoglobulin heavy chain gene. Cell 33, 717–728. Gumireddy, K., Li, A., Gimotty, P.A., Klein-Szanto, A.J., Showe, L.C., Katsaros, D., Coukos, G., Zhang, L., and Huang, Q. (2009). KLF17 is a negative regulator of epithelial-mesenchymal transition and metastasis in breast cancer. Nat. Cell Biol. 11, 1297–1304. Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk, O., Carey, B.W., Cassady, J.P., et al. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227. Harrow, J., Denoeud, F., Frankish, A., Reymond, A., Chen, C.K., Chrast, J., Lagarde, J., Gilbert, J.G., Storey, R., Swarbreck, D., et al. (2006). GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7 (Suppl 1), S4 1–9. Heard, E., and Disteche, C.M. (2006). Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev. 20, 1848–1867. Heintzman, N.D., Hon, G.C., Hawkins, R.D., Kheradpour, P., Stark, A., Harp, L.F., Ye, Z., Lee, L.K., Stuart, R.K., Ching, C.W., et al. (2009). Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112. Imperiale, M.J., and Nevins, J.R. (1984). Adenovirus 5 E2 transcription unit: an E1A-inducible promoter with an essential element that functions independently of position or orientation. Mol. Cell. Biol. 4, 875–882. Kapranov, P., Cheng, J., Dike, S., Nix, D.A., Duttagupta, R., Willingham, A.T., Stadler, P.F., Hertel, J., Hackermuller, J., Hofacker, I.L., et al. (2007). RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488. Khalil, A.M., Guttman, M., Huarte, M., Garber, M., Raj, A., Rivea Morales, D., Thomas, K., Presser, A., Bernstein, B.E., van Oudenaarden, A., et al. (2009). Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci. U S A. Khoury, G., and Gruss, P. (1983). Enhancer elements. Cell 33, 313–314. Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 57 Kim, T.K., Hemberg, M., Gray, J.M., Costa, A.M., Bear, D.M., Wu, J., Harmin, D.A., Laptewicz, M., Barbara-Haley, K., Kuersten, S., et al. (2010). Widespread transcription at neuronal activity-regulated enhancers. Nature. Kong, S., Bohl, D., Li, C., and Tuan, D. (1997). Transcription of the HS2 enhancer toward a cis-linked gene is independent of the orientation, position, and distance of the enhancer relative to the gene. Mol. Cell. Biol. 17, 3955– 3965. Lanz, R.B., McKenna, N.J., Onate, S.A., Albrecht, U., Wong, J., Tsai, S.Y., Tsai, M.J., and O’Malley, B.W. (1999). A steroid receptor coactivator, SRA, functions as an RNA and is present in an SRC-1 complex. Cell 97, 17–27. Lecuyer, E., and Hoang, T. (2004). SCL: from the origin of hematopoiesis to stem cells and leukemia. Exp. Hematol. 32, 11–24. Lee, R.C., Feinbaum, R.L., and Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854. Lunter, G., Ponting, C.P., and Hein, J. (2006). Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput. Biol. 2, e5. Mattick, J.S. (2009). The genetic signatures of noncoding RNAs. PLoS Genet. 5, e1000459. Nieto, M.A. (2002). The snail superfamily of zinc-finger transcription factors. Nat. Rev. Mol. Cell Biol. 3, 155–166. Ogawa, H., Ishiguro, K., Gaubatz, S., Livingston, D.M., and Nakatani, Y. (2002). A complex with chromatin modifiers that occupies E2F- and Myc-responsive genes in G0 cells. Science 296, 1132–1136. Parra, G., Blanco, E., and Guigo, R. (2000). GeneID in Drosophila. Genome Res. 10, 511–515. Ponting, C.P., Oliver, P.L., and Reik, W. (2009). Evolution and functions of long noncoding RNAs. Cell 136, 629–641. Preker, P., Nielsen, J., Kammler, S., Lykke-Andersen, S., Christensen, M.S., Mapendano, C.K., Schierup, M.H., and Jensen, T.H. (2008). RNA exosome depletion reveals transcription upstream of active human promoters. Science 322, 1851–1854. Rinn, J.L., Kertesz, M., Wang, J.K., Squazzo, S.L., Xu, X., Brugmann, S.A., Goodnough, L.H., Helms, J.A., Farnham, P.J., Segal, E., and Chang, H.Y. (2007). Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell 129, 1311–1323. Savagner, P. (2001). Leaving the neighborhood: molecular mechanisms involved during epithelial-mesenchymal transition. Bioessays 23, 912–923. Seila, A.C., Calabrese, J.M., Levine, S.S., Yeo, G.W., Rahl, P.B., Flynn, R.A., Young, R.A., and Sharp, P.A. (2008). Divergent transcription from active promoters. Science 322, 1849–1851. Shamovsky, I., Ivannikov, M., Kandel, E.S., Gershon, D., and Nudler, E. (2006). RNA-mediated response to heat shock in mammalian cells. Nature 440, 556–560. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476. Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855–862. Yang, P.K., and Kuroda, M.I. (2007). Noncoding RNAs and intranuclear positioning in monoallelic gene expression. Cell 128, 777–786. 58 Cell 143, 46–58, October 1, 2010 ª2010 Elsevier Inc. 8VLQJ PXOWLSOH DOLJQPHQW PHWKRGV WR DVVHVV WKH TXDOLW\ RI JHQRPLF GDWD DQDO\VLV Cédric Notredame and Chantal Abergel Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr, chantal.abergel@igs.cnrs-mrs.fr 1 $%675$&7 The analysis of multiple sequence alignments can generate essential clues in genomic data analysis. Yet, to be informative such analyses require some mean of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence alignment algorithm. We provide two examples of applications: one where the CORE index is used to identify correct blocks within a difficult multiple alignment and another where the CORE index is used on genomic data to identify the proper start codon and a frameshift within one of the sequence. 2  ,QWURGXFWLRQ Biological analysis largely relies on the assembly of elaborate models meant to summarize our knowledge of life complex mechanisms. For that purpose, vast amounts of data are collected, analyzed, validated and then integrated within a model. In an ideal world, an existing model would be available to explain every bit of experimental data. In the real world, this is rarely the case, and every day, existing models need to be modified to accommodate new findings. Sometimes, data that cannot be explained is kept at bay until the accumulation of new evidences prompts the design of an entirely new model. Unaccountable data can be viewed as the stuff inflating an inconsistency bubble. Eventually, the bubble bursts and a new model is designed. A multiple alignment is nothing less than such a model. Given a series of sequences and an alignment criteria (structure similarity, common phylogenetic origin) the multiple alignment contains a series of hypothesis regarding the relationship between the sequences it is made of. This alignment can accommodate data generated experimentally (e.g. alignment of two homologous catalytic residues) or combine the results of various sequence analysis methods. The importance of the use of multiple sequence alignments in the context of sequence analysis has been recognized for a long time and it is so well established that most bioinformatics protocols make use of it. Multiple alignments have been turned into profiles (Gribskov et al., 1987) and hidden Markov models (Krogh et al., 1994) to enhance the sensitivity and the specificity of database searches (Altschul et al., 1997). State of the art methods for protein structure prediction depend on the proper assembly of a multiple 3 sequence alignment (Jones, 1999) as do phylogenetic analysis (Duret et al., 1994). Over the last years multiple sequence alignment techniques have been instrumental to improvements made in almost every key area of sequence analysis. Yet, despite its importance, the accurate assembly of a multiple sequence alignment is a complex process, the biological knowledge and the computational abilities it requires are far beyond our current capacities. As a consequence, biologists are left to use approximate programs that attempt to assemble proper alignments without providing any guaranty they may do so. The lack of a ‘perfect’ or at least reasonably robust method explains why so many multiple sequence alignment packages exist. The variations among these packages are not only cosmetic; they include the use of very different algorithms, different parameters and generally speaking different paradigms. For a recent review of state-of-the-art techniques, see (Duret and Abdeddaim, 2000). Database searches, structure predictions, phylogenetic analysis are enough on their own to make multiple alignment compulsory in a genome analysis task. Yet, thanks to the sanity checks they provide, multiple alignments can also be instrumental at tackling the plague of genomic analysis: faulty data. When dealing with genomes, faulty data arises from two major sources: sequencing errors and wrong predictions. The consequence is that a predicted protein sequence may have accumulated errors both at the DNA level and when its frame was predicted (this will be especially true in eukaryotic genes where exons may be missed, added or improperly predicted). In the worst cases, the effect of such errors will be amplified in the high level analysis, leading to an improper analysis of the available data. On the other hand, once they have been identified, these errors are usually easily corrected either by extra sequencing or data extrapolation. Therefore, any method providing a reasonable sanity-check that earmarks areas of a genome likely to be problematic would be a major improvement. In this chapter we will show how multiple sequence alignments can be used to carry out part of 4 this task. For that purpose we will focus on the applications of T-Coffee, a recently described method (Notredame et al., 2000).  *HQHUDWLQJ 0XOWLSOH $OLJQPHQWV :LWK 7&RIIHH Despite the large variety of multiple sequence alignment methods publicly available, the number of packages effectively used for data analysis is surprisingly small and a vast majority of the alignments found in the literature are produced using only two programs: ClustalW (Thompson et al., 1994) and its X-Window implementation ClustalX. ClustalW uses the progressive alignment strategy described by Taylor (Taylor, 1988) and Doolitle (Feng and Doolittle, 1987), refined in order to incorporate sequence weights and a local gap penalty scheme. Recently, the ClustalW algorithm was further modified in order to improve the accuracy of the produced alignments by making the evaluation of the substitution costs position dependant. This improved algorithm is implemented in the T-Coffee package (Notredame et al., 2000). The aim of T-Coffee is to build a multiple alignment that has a high level of consistency with a library of pre-computed pair-wise alignments. This library may contain as many alignments as one wishes and it may also be redundant and inconsistent with itself. For instance it may contain several alternative alignments of the same sequences aligned using various gap penalties. It may also contain alternative alignments obtained by applying different methods onto the sequences. Overall, the library is a collection of alignments believed to be correct. Within this library, each alignment receives a weight that is an estimation of its biological likeliness (i.e. how much trust does one have in this alignment to be correct). For that 5 purpose, one may use any suitable criteria such as percent identity, P-Value estimation or any other appropriate method. The T-Coffee algorithm uses this library in order to compute the score for aligning two residues with one another in the multiple alignment. This score is named the extended weight because it requires an extension of the library. The extended weight takes into account the compatibility of the alignment of two residues with the rest of the alignments observed within the library, its derivation is extensively described in (Notredame et al., 2000). The principle is straightforward: in order to compute the extended weight associated with two residues R and S of two different sequences, one will consider whether when R is found aligned in the library with some residue X of a third sequence, S is also found aligned with that same residue X in another entry of the library. If that is the case, then the weight associated with R and S will be increased by the minimum of the two weights RX and SX. The final extended weight will be obtained when every possible X has been considered and the resulting contributions summed up. Although this operation seems to be very expensive from a computational point of view, its effective computational cost is kept low thanks to the scarceness of the primary library (i.e. for most pairs of residues RS, very few Xs need to be considered). In the end, a pair of residues is highly consistent (and has a high extended weight) if most of the other sequences contain at least one residue that is found aligned both to R and to S in two different pair-wise alignments. A key property of this weight extension procedure is to concentrate information: the extended score of RS incorporates some information coming from all the sequences in the set and not only from the two sequences contributing R and S. The main advantage of the extended weights is that they can be used in place of a substitution matrix. While standard substitution matrices do not discriminate between two identical residues (e.g. all the cysteins are the same for a Pam (Dayhoff et al., 1979) or a Blosum 6 (Henikoff and Henikoff, 1992)), the extended weights are truly position specific and make it possible to discriminate between two identical residues that only differ by their positions. Once the library has been assembled (potential ways of assembling that library are described later) and the extended weights computed, T-Coffee closely follows the ClustalW procedure using the extended weights instead of a substitution matrix. The overall T-Coffee strategy is outlined in Figure 1. All the sequences are first aligned two by two, using dynamic programming (Needleman and Wunsch, 1970) and the extended library in place of a substitution matrix. The distance matrix thus obtained is then used to compute a neighborjoining tree (Saitou and Nei, 1987). This tree guides the progressive assembly of a multiple sequence alignment: the two closest sequences are first aligned by normal dynamic programming using the extended weights to align the residues in the two sequences, no gap penalty is applied (because it has already been applied to generate the alignments contained in the library). This pair of sequences is then fixed and any gaps that have been introduced cannot be shifted later. Then the program aligns the next closest two sequences or adds a sequence to the existing alignment of the first two sequences, depending which is suggested by the guide tree. The procedure always joins the next two closest sequences or pre-aligned group of sequences. This continues until all the sequences have been aligned. To align two groups of pre-aligned sequences one uses the extended weights, as before, but taking the average library scores in each column of the existing alignments. The key feature of T-Coffee is the freedom given to the user to build his own library following whatever protocol may seem appropriate. For this purpose, one may mix structural information with database results, knowledge-based information or pre-established collections of multiple alignments. It may also be necessary to explore a wide range of parameters given some computer package. A simple library format was designed to fit that purpose, it is shown 7 on Figure 2. A library is a straightforward ASCII file that contains a listing of every pair of aligned residue that needs to be described. Any knowledge-based information can easily be added manually to an automatically generated library or the other way round. This figure also shows clearly that the library can contain ambiguities and inconsistencies (i.e. two alignments possible for the first residue of Seq1 with Seq2). These ambiguities will be resolved while the alignment is being assembled, on the basis of the score given by the extended weights. The library does not need to contain a weight associated with each possible pair of residues. On the contrary, an ideal library only contains pairs that will effectively occur in the correct multiple alignment (i.e. N2L pairs rather than N2L2 pairs). While this flexibility to design and assemble one’s own library is a very desirable property, in practice it is also convenient to have a standard automatic protocol available. Such a protocol exists and is fully integrated within the T-Coffee package. It is ran with the default mode and does not require the user to be aware of T-Coffee underlying concepts (Library, extension, progressive alignment). This default protocol extensively described and validated in (Notredame et al., 2000) requires two distinct libraries to be compiled and combined within the primary library before the extension. The first one contains a ClustalW pair-wise alignment of each possible pair of sequence within the dataset. For that purpose, ClustalW (Thompson et al., 1994) is run using default parameters. This library is global because it is generated by aligning the sequences over their whole length (global alignments) using a linear space version of the Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). The second library is local: for each possible pair of sequences, it contains the ten best non-overlapping local alignments as reported by the Lalign program (Huang and Miller, 1991) run with default parameters. In the local and the global libraries, each pair of residues found aligned is associated with a weight equal to the average level of identity within the alignment it came from. When a specific pair is found more than once, the weights associated with each occurrence are added. The main strength of 8 this protocol is to combine local and global information within a multiple alignment. The level of consistency within the library will depend on the nature of the sequences. For instance, if the sequences are very diverse, the requirement for long insertions/deletions will often cause the global alignments to be incorrect and inconsistent, while the local alignments will be less sensitive to that type of problems. In such a situation, the measure of consistence will enhance the local alignment signal and let it drive the multiple alignment assembly. Inversely, if the global alignments are good enough they will help removing the noise associated with the collection of local alignments (local alignments do not have any positional constraints). Overall, the current default T-Coffee protocol contains three distinct elements that lead to the collection of extended weights: the global library, the local library and the library extension that turns the sum of the two libraries into an extended library. Earlier work demonstrated that each of these components plays a significant part in improving the overall accuracy of the program. Table 1 shows that the current version of T-Coffee (Version 1.29) outperforms other popular multiple sequence alignment methods, as judged by comparison on BaliBase (Thompson et al., 1999), a database of hand made reference structural alignments that are based on structural comparison (See Table 1 legend for a description of BaliBase and the comparison protocol). These results illustrate well the good performances of T-Coffee on the wide range of situations that occur in BaliBase. It is especially interesting to point out that T-Coffee is the only method equally well suited to situations that require a global alignment strategy (categories 1, 2 and 3) and situations that are better served with a local alignment strategy (categories 4 and 5 with long internal and terminal insertions/deletions). The other methods are either good for global alignments (like ClustalW) or for local alignments (like Dialign2 (Morgenstern et al., 1998)). It should be noted that T-Coffee still uses ClustalW 1.69 to 9 construct the primary global library, because this was the last ‘naïve’ version of ClustalW, not tuned up on BaliBase. The latest version (1.81) has been tuned on the BaliBase references (hence its improved performances over the results originally reported for ClustalW). Using this ClustalW 1.81 version when benchmarking T-Coffee would make the process circular. Nonetheless, as good as it may seem, T-Coffee still suffers from the same shortcoming as any other package available today: LW LV QRW DOZD\V WKH EHVW PHWKRG . Even if on average it does better than any of its counterparts, one cannot guaranty that T-Coffee will always generate the best alignment. For instance, although Dialign2 is significantly less good, it T-Coffee outperforms on 17 test sets (11%). ClustalW is better than T-Coffee in 24% of the cases. We may conclude from this that in practice, there will always be situations where some alternative method beats T-Coffee. Furthermore, even in cases where the T-Coffee improvement over any alternative method is very significant, it may lead to an alignment much less than 100% correct. This may not be so helpful since for practical usage, it would be much more helpful to know where the correctly aligned portions lie. This is so true that a method 20% correct and a proper estimation of its reliability would be much more useful than a method more accurate ‘on average’. Several situations exist in which a biologist can make use of this reliability information. For instance, if the purpose of the alignment is to extrapolate some experimental data onto an otherwise un-characterized genomic sequence, one will need to be very careful not to deduce anything from an unreliable portion of the alignment. More generally, unreliable positions within a multiple sequence alignment should not be used for predictive purpose. For instance, when turning a multiple alignment into a profile in order to scan databases for remote homologues, it is essential to exclude regions whose alignment cannot be trusted and that may obscure some otherwise highly conserved position. Used in this fashion, reliability 10 information allows a significant decrease of the noise induced by locally spurious alignments. The other important application of a reliability measure is the identification of regions within a multiple alignment that are properly aligned without being highly conserved. These regions are extremely important when the alignment is used in conjunction with a predictive method that bases its analysis on mutation patterns. For instance, structure and phylogeny prediction methods require the presence of non-conserved positions to yield informative results. Any scheme that allows discriminating between positions that are degenerated but correctly aligned and positions that are simply misaligned may induce a dramatic improvement in the accuracy of these prediction methods. Furthermore a reliability measure will help identifying faulty data and provide some clues on how to correct it. In the next section, we show how consistency can be measured on a T-Coffee alignment and how that measure provides a fairly accurate reliability estimator.  0HDVXULQJ 7KH &RQVLVWHQF\ 2Q $ 0XOWLSOH 6HTXHQFH $OLJQPHQW T-Coffee is a heuristic algorithm that attempts to optimize the consistency between a multiple alignment and a list of pre-computed pair-wise alignments known as a library (Figure 2). By consistency we mean that a pair of residues described aligned in the library will also be found aligned in the multiple alignment. While the theoretical maximum for the consistency is 100%, the score of an optimal alignment will only be equal to the level of self-consistency within the library. Figure 2 shows the example of a library that is not self consistent because it 11 is ambiguous regarding the alignment of some of the residues it contains. Of course, the more ambiguous the library, the less consistency it will yield. For instance, given two residues and U T taken from two different sequences   ¡   ¢ 6 and 6 , one can easily measure the consistency (CS( 5 1 , 5 2 ) ) between the alignment of these two residues and all the other alignments contained in the library by comparing ES( 5 1 , 5 2 ), the extended score of the pair ¡ ¢     T and U, with the sum of the extended scores of all the other potential pairs that involve 6 and 6 and either U or T. If we want to use it as a quality factor, this measure suffers from two major drawbacks. Firstly it is expensive to compute: given a multiple alignment of N sequences and of length L, each pair of residues found in the multiple alignment needs O(L) operations of extension that require a minimum of O(N) operations each. “O(L)” is a standard notation called QRWDWLRQ ELJ2 , meaning that the computation time is proportional to L, up to a constant factor. Since there are L*N2 pairs of residues in a multiple alignment, this leads to O(L2N3) operations for an estimate of the CS of every pair. This cubic complexity becomes problematic with large numbers of sequences. The second limitation of this measure is that with sequences rich in repeats, the summation factor can become artificially high and cause a dramatic decrease of the consistency score. In practice, we found it much more effective to use the extended score of the best scoring pair contained in the alignment as a normalization factor. This defines the aCS (approximate Consistency measure). 12 ©     aCS 5 1 , 5  2 5 1 ,5 2 0D[ (6 5 ,5 ¨ §  § § § § § ( ) = ES( ) { ( )} (2) ¤ = 1 = 2 £ £ ¤  ¤ ¥ ¦ ¤ ¦ ¥ ¦ CS 5 1 , 5 ¥ 2 5 1 ,5 2 5 1 ,5 2 5 1 ,5 £ £ £ £ £ £ £ £ ( ) = ES( )/  ∑ ES(  ) + ∑ ES( 2 )  (1)   Our measurements on the BaliBase dataset indicate that the CS and the aCS are well correlated. An important criteria, when using the aCS as a reliability measure, is its ability to discriminate between correct and incorrect alignments within the so-called twilight zone (Sander and Schneider, 1991). Given two sequences, the twilight zone is a range of percent identity (between 0 and 30%) that has been shown to be non-informative regarding the relationship that exist among two sequences. Two sequences whose alignment yields less than 30% identity can either be unrelated or related and incorrectly aligned or related and perfectly aligned. To check how good the aCS is when used as an accuracy measure, every 142 BaliBase dataset was aligned using T-Coffee 1.29 and the similarity of each pair of sequence was measured within the obtained alignments. Pairs of sequences with less than 30% identity (5088) were extracted and the accuracy of their alignment was assessed by comparison with their counterparts in the reference BaliBase alignment, the aCS score was also assessed on each pair of aligned residues and averaged along the sequences. Figure 3a shows the scattered graph Identity Vs Accuracy (See Figure legend for the definitions). Despite a weak correlation between these two measurements, the percent identity is a poor predictor of the alignment accuracy. For 75% of the sequence pairs (identity lower than 25%) the accuracy indication given by the percent identity falls in a 40% range (i.e. the average identity indicates the average accuracy +/- 20%). On the other hand, when the accuracy is plotted against the aCS score (Figure 3b) the correlation is improved and for pairs of sequences having an aCS higher than 20 (this is true for 60% of the 5088 pairs) this measure is a much better alignment accuracy predictor than the percent identity. While they do not solve the twilight zone   With 5 , 5 any two residues found aligned in the multiple alignment.     13 problem, these results indicate that the aCS measure provides us with a powerful mean of assessing an alignment reliability within the twilight zone. Nonetheless, from a practical point of view, the aCS measure is not so useful since one is often more concerned by the overall quality (i.e. is residue r of sequence S correctly aligned to the rest of the sequences?) than by pair-wise relationships. In order to answer this type of questions, the aCS measure was used to derive three very useful non pair-wise indexes. 7KH &RQVLVWHQF\ RI WKH 2YHUDOO 5HVLGXH (YDOXDWLRQ (CORE) is obtained by averaging the scores of each of the aligned pairs involving a residue within a column. Where T and U are two residues found aligned in the same column. The CORE index and equivalent approaches have been shown in the literature to be good indicators of the local quality of a multiple sequence alignment (Heringa, 1999; Notredame et al., 1998), as judged by comparison with reference biological alignments. In the T-Coffee package, an option makes it possible to output multiple alignments with the CORE index (a rounded value between 0 and 9) replacing each residue. It is also possible to produce a colorized version (pdf, postscript or html) of that same multiple alignment where residues receive a background coloration proportional to their CORE index (blue/green for low scoring residues and orange/red for the high scoring ones). Such an output is shown on Figure 5 and 6. ! =1, ≠ " $ % " % CORE 5 D&6 5 ,5 " # ! # ( )= ∑ ! # ( )/ ( 1 − 1) (3) 14 The CORE index described in equation (3) is merely an average aCS measure, and whether that measure provides some indication on the multiple alignment quality is a key question. We tested that hypothesis on the complete BaliBase dataset. Given each T-Coffee alignment, residues were divided in 4 categories: (i) WUXH SRVLWLYHV (TP) are correctly aligned residues rightfully predicted to be so, (ii) WUXH QHJDWLYH (TN) are incorrectly aligned residues rightfully predicted to be so, (iii) IDOVH SRVLWLYH (FP) are residues predicted to be well aligned when they are not, (iv) IDOVH QHJDWLYH (FN) are residues wrongly predicted to be misaligned. Following previously described definitions (Notredame et al., 1998), a residue is said to be correctly aligned if at least 50% of the residues to which it was aligned in the reference alignment are found in the same column in the T-Coffee alignment. Each of the 10 CORE indexes (0 to 9) was used in turn as threshold to discriminate correctly and non-correctly aligned residues on the T-Coffee alignments. The BaliBase reference alignments were then used to evaluate the TP, TN, FP and FN. Sensitivity and the specificity were then computed according to Sneath and Sokal (Sneath and Sokal, 1973) and plotted on a graph (Figure 4). Our results indicate that the best trade off between sensitivity and specificity is obtained when CORE=3 is used as a threshold (i.e. every residue with a score higher or equal to 3 is considered to be properly aligned). In that case the specificity is 84% and the sensitivity is 82%. These high figures partly reflect the overall quality of the T-Coffee alignments in which 80.5% of the residues are correctly aligned according to the criteria used here. It is therefore more interesting to note that when the CORE index reaches 7, the specificity is 97.7% and the sensitivity is close to 50%. This means that thanks to the CORE index, half of the residues properly aligned in a multiple alignment can unambiguously be identified (e.g. more than 40 % of all the residues contained in BaliBase). In the next section we will see that this proper identification sometimes occurs in cases that are far from being trivial, even for an expert eye. Similar results were observed when applying the CORE index on multiple alignments obtained using 15 another method (i.e. ClustalW alignments evaluated with a standard T-Coffee library). This suggests that the CORE measure may be used to evaluate the local quality of a multiple alignment produced by any source. However, one should be well aware that the relevance of the CORE measure regarding the reliability of an alignment is entirely dependant on the way in which the library was derived. All the conclusions drawn here only apply to libraries derived using the standard T-Coffee protocol. 7KH VHTXHQFH &25( V&25( is obtained by averaging the CORE scores over all the residues contained within one sequence. =1 That measure can be helpful for identifying among the sequences an outlier, a sequence that should not be part of the set either because it is not homologous or because it is too distantly related to the other members to yield an informative alignment. 7KH DOLJQPHQW &25( (alCORE) may be obtained by averaging the sCOREs over all the sequences. Our analysis suggest that this index may be a reasonable indicator of the alignment overall accuracy. Yet, to be fully informative, it requires the sequence set to be homogenous (i.e. the standard deviation of the sCOREs should be as low as possible). ' 52& 5 ( ) sCORE(6[ ) = ∑ ' ( ( ))/ & / (4) 16  8VLQJ WKH &25( 0HDVXUH 7R $VVHVV /RFDO $OLJQPHQW 4XDOLW\ The driving force behind the development of the CORE index is the identification of correctly aligned blocks of residues within a multiple sequence alignment. It is common practice to identify these blocks by scanning the multiple alignment and marking highly conserved regions as potentially meaningful. ClustalW and ClustalX provide a measure of conservation that may help the user when carrying out this task. Unfortunately, situations exist where it is difficult to make a decision regarding the correct alignment of some residues within an alignment. Such an example is provided in Figure 5 with the BaliBase alignment known as 1pamA_ref1, made of 6 alpha-amylases. This set is difficult to align because it contains highly divergent sequences. Not only have these sequences accumulated mutations while they were diverging but they have also undergone many insertion/deletions that make it difficult to reconstruct their relationships with accuracy. The average level of identity measured on the BaliBase reference is 18%, the two closest sequences being less than 20% identical. As such, 1pamA_ref1 constitutes a classic example of a test set deceptive for most multiple sequence alignment methods. The fact that less than one third of the 1pam_ref1 reference alignment is annotated as trustable in BaliBase confirms that suspicion. When ran on this test-set, existing alignment programs generate different results, Prrp finds 37% of the columns annotated as trustable in BaliBase, ClustalW (1.81) 40%, T-Coffee 54% and Dialign2 56%. Regardless of the methods used, such an alignment is completely useless unless correctly aligned portions can be identified. It is exactly the information that the CORE index provides us with. An alignment colorized according to its CORE indexes is shown on Figure 5. 17 The results are in good agreement with those reported in Figure 4. Out of the 905 correctly aligned residues (42% of the total), 267 have a score higher than 7. No incorrectly aligned residue has a score higher or equal to 7. Using 7 as a prediction threshold gives a sensitivity of 29% and a specificity of 100%. Residues with a CORE index of 3 or higher (yellow pale) yield a sensitivity of 65% and a specificity of 79%. In this alignment, the main features are the red/dark-orange blocks: they are 100% correct. These blocks could be fed as they are to any suitable method (structure prediction, phylogeny….). They are not very well conserved at the sequence level and are therefore very informative for structural and phylogenetic analysis. For instance, the block II in Figure 5 is perfectly aligned although within that block, the average pair-wise identity is lower than 30% (41 % for the two most closely related sequences). The measure of consistency can also help questioning positions that may seem unambiguous from a sequence point of view. In the column annotated as I, the position marked with a “*” could easily be mistaken to be correct: it is within a block, aromatic positions are usually fairly well conserved and owing to their relative rarity, unlikely to occur by chance. Yet the green color code indicates that this position may be incorrectly aligned (the green tyrosine has a CORE index of 1). This is confirmed by comparison with the reference that shows the correct alignment to incorporate another tyrosine at this position. When analyzing these patterns, one should always keep in mind that the consistency information only has a positive value. In other words, inconsistent regions are those where the library does not support the alignment. This does not mean they are incorrectly aligned but rather that no information is at hand to support or disprove the observed alignment. 18  ,GHQWLI\LQJ )DXOW\ *HQH 3UHGLFWLRQV Another possible application of the T-coffee CORE index is to reveal and help resolving sequence ambiguities in predicted genes. In the structural genomic era, many projects involve hypothetical proteins, for which an accurate prediction of the start and stop codon is needed to properly express the gene product. Since over-predicted N or C-terminal are rarely conserved at the amino acid level, sequence comparison provides us with a very powerful mean of identifying this type of problems. A simple procedure consists of multiply aligning the most conserved members of a protein family before measuring the T-Coffee CORE index on the resulting alignment. Inspection of the CORE patterns offers a diagnostic regarding the correctness of the data. This approach can also be applied to frame-shift detection where the identification of abnormally low scoring segments may lead to their correction. Such an alignment will make it possible to decide if the abnormal length of a coding region could be due to a sequencing error (and the resulting frame-shift). At least the CORE measure will indicate that a thorough examination is needed. Of course, one could also detect these frameshifts using standard pair-wise comparison methods such as Gene-wise (Birney and Durbin, 2000), but the advantage of using a multiple sequence alignment is that the simultaneous comparison of several sequences can strengthen the evidence that the frame-shift is real. Furthermore, thanks to the multiple alignment, one may be able to detect mistakes in sequences that lack a very close homologue. To illustrate this potential usage of T-Coffee, we chose the example of an (VFKHULFKL FROL . gene (Accession # U00096) predicted to encode a protein of unknown function, yifB. Orthologous genes were found in complete genomes using BLAST (Altschul et al., 1997) and the four most conserved sequences (identity >70% relative to the (VFKHULFKLD FROL . gene, 19 see figure for ID numbers) were retrieved along with their flanking regions (80 nucleotides on the N-terminus side) in order to check whether these supposedly non coding regions did not contain any coding information. The ‘elongated’ sequences were translated in the same frame as their core coding region, their multiple alignment was carried out using T-Coffee and the CORE indexes were measured. The resulting alignment is displayed on Figure 6 with the CORE indexes color-coded (low CORE in blue and green, high CORE in orange and red). The main feature on the N-terminus is an abrupt transition (II) from low to high CORE indexes. This position is also a conserved methionine. The combination of these two observations suggests that the starting point of these five sequences is probably where the transition occurs, ruling out other methionines as potential starting points in the first sequence (I). Another discrepancy occurs in this alignment that is also emphasized by the CORE analysis: the sequence yifB_SALTY_1 yields a very low N-terminal CORE index, relatively to the other family members. The CORE score of this sequence becomes abruptly in phase with the other sequences at the position marked III. This pattern is a clear indication of a frame-shift: a protein highly similar to the other members of its family but locally unrelated. To verify that hypothesis, we used some data provided by SwissProt (Bairoch and Boeckmann, 1992) and found that in the corresponding entry, the nucleotide sequence has been corrected to remove the frame-shift we observed (entry P57015). The corrected sequence has been added to the bottom of the alignment on Figure 6 (non-colored sequence). The position where yifB_SALTY_1 and its corrected version start agreeing is also the position where the CORE score changes abruptly from a value of 2 (yellow) to a value of 7 (orange). That position also turns out to be the one where the frame-shift occurs in the genomic sequence. 20  &RQFOXVLRQ In this chapter, we introduced an extension of the T-Coffee multiple sequence alignment method: the CORE index. The CORE index is a mean of assessing the local reliability of a multiple sequence alignment. Using the CORE index, correct blocks within a multiple sequence alignment can be identified. This measure also makes it possible to detect potential errors in genomic data, and to correct them. The CORE index is a relatively add hoc measure and even if it may prove extremely useful from a practical point of view, it still needs to be attached to a more theoretical framework. One would really need to be able to turn the consistency estimation into some sort of P-Value. For instance, to assess efficiently the local value of an alignment, one would like to ask questions of the following kind: what is the probability that library X was generated using dataset Y? What is the probability that alignment A yields p% consistency with library X? Altogether these questions may open more venues to the automatic processing of multiple alignments. That issue may prove crucial for the maintenance of resources that rely on a large scale usage of multiple sequence alignments such as Hobacgene (Perriere et al., 2000), Hovergene (Duret et al., 1994)or Prodom (Corpet et al., 2000). 21 )LJXUH /HJHQGV Figure 1 /D\RXW RI WKH 7&RIIHH DOJRULWKP 22 This figure indicates the chain of events that lead from unaligned sequences to a multiple sequence alignment using the T-Coffee algorithm. Data processing steps are boxed while data structures are indicated by rounded boxes. 23 Figure 2 /LEUDU\ )RUPDW An example of a library used by T-Coffee. The header contains the sequences and their names. ‘# 1 2’ indicates that the following pairs of residues will come from sequences 1 and 2. Each pair of aligned residues contains three values: the index of residue 1, the index of residue 2 and the weight associated with the alignment of these two residues. No order or consistency is expected within the library. 24 Figure 3 a) 3HUFHQWDJH LGHQWLW\ 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH: the 5088 pairs of sequences that have less than 30% identity in the BaliBase reference alignments were extracted. The accuracy of 25 their alignment was measured by comparison with the reference, and the resulting graph was plotted. b) $SSUR[LPDWH &RQVLVWHQF\ 6FRUH D&6 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH : the aCS was measured on the 5088 pairs of sequences previously considered and was plotted against the average accuracy previously reported. The vertical line indicates aCS=25 and separates the pairs for which the aCS is informative from those whose aCS seems to be non-informative. 26 27 Figure 4 6SHFLILFLW\ DQG 6HQVLWLYLW\ RI WKH &25( PHDVXUH 28 The sensitivity and the specificity of the CORE index used as an alignment quality predictor were evaluated on the BaliBase test-sets. Measures were carried out on the entire BaliBase dataset. The sensitivity (z) and the specificity („) were measured on the T-Coffee alignments after considering that every residue with a CORE index higher than x was properly aligned (see text for details). 29 Figure 5 ,GHQWLI\LQJ FRUUHFW EORFNV ZLWK WKH &25( PHDVXUH 30 An example of the T-Coffee output on a BaliBase test set (1pamA_ref1) that contains five alpha amylases. This alignment was produced using T-Coffee 1.29 with default parameters and requesting the score_pdf output option. The color scale goes from blue (CORE=0, bad) to red (CORE=9, good). The residues in capital are correctly aligned (as judged by comparison with the BaliBase reference). Those in lower case are improperly aligned. Box I indicates a conserved position that is not properly aligned, box II indicates a block of distantly related segments that is correctly aligned by T-Coffee. 31 Figure 6 ,GHQWLI\LQJ IUDPH VKLIWV DQG VWDUW FRGRQV 32 The chosen sequences came are YifB_ECOLIA YifB_SALTY_1 (6DOPRQHOOD (+DHPRSKLOXV LQIOXHQ]DH WLSK\ (VFKHULFKLD FROL , accession # AE005174), , C18 chromosome, Sanger Center), YifB_HAIN PXOWRFLGD Accession # L42023), YifB_PASMU (3DVWHXUHOOD DHUXJLQRVD , Accession # AE004439) and YifB_PSEAE (3VHXGRPRQDV , Accession # AE004091), they were aligned using the standard T-Coffee alignment procedure and requesting the score_pdf output option. The corrected sequence of 6DOPRQHOOD WLSK\ YifB protein sequence was later added for further reference (YifB_SALTY, SP: P57015) but it was not used for coloring the residues (Non colored sequence) or improving the multiple alignment. The figure only shows the N-terminal portion of the alignment, and the arrow indicates the positions annotated as starting codons in SwissProt (except for salmonella tiphy). Box I indicates a putative starting codon in YifB_ECOLIA, Box II indicates the true starting codon in most sequences, and Box III indicates the position where the frame-shift occurs in YifB_SALTY_1. 33 7DEOH  cat 1 cat 2 cat 3 cat 4 cat 5 avg 1 avg 2 ------------------------------------------------------------------------cw prrp dialign2 T-Coffee 79.53 78.62 70.99 32.91 32.45 25.21 48.72 50.14 35.12 74.02 51.12 74.66 67.84 82.72 80.38 67.89 66.45 61.54 61.82 60.25 57.99 To produce this table each dataset contained in BaliBase was aligned using one of the methods (cw: ClustalW 1.81 (Thompson et al., 1994), Prrp (Gotoh, 1996), dialign2 (Morgenstern et al., 1998) and T-Coffee 1.29 (Notredame et al., 2000). In BaliBase, reference alignments are classified in 5 categories: category 1 contains closely related sequences, category 2 contains a group of closely related sequences and an outsider category 3 contains two groups of sequences that are distantly related, category 4 contains families with long internal indels, Category 5 contains sequences with long terminal indels. The resulting alignments were then compared to their reference counterpart in BaliBase, only using the regions annotated as trustable in BaliBase. Under this scheme we define the accuracy of an alignment to be the percentage of columns that are found totally conserved in the reference divided by the total number of columns within that reference. The comparison is restricted to the portions annotated as trustworthy in the reference alignment. results obtained on each of the 142 test cases, DYJ  DYJ  is the average of the results obtained in each category. Bold numbers indicate the best performing method. 34 E E 3 S 5 QVB2TQDU87 0 C 3 H Q64QTQDGQFPIG2F8D3BA9@864120 S 5 3 C S C R 3 9 0 1 H 3 9 E E C 5 7 5 3 is the average of the %LEOLRJUDSK\ Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. nucleic acids res. 25: 2289-3402. Bairoch, A. and Boeckmann, B., 1992. The SWISS-PROT protein sequence data bank. Nucleic Acids Res: 2019-2022. Birney, E. and Durbin, R., 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res 10: 547-548. Corpet, F., Servant, F., Gouzy, J. and Kahn, D., 2000. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. nucleic acids res 28: 267269. Dayhoff, M.O., Schwarz, R.M. and Orcutt, B.C., 1979. A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: M.O. Dayhoff (Editor), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C., pp. 353-358. Duret, L. and Abdeddaim, S., 2000. Multiple Alignment for Structural, Functional, or phylogenetic analyses of Homologous Sequences. In: D. Higgins and W. Taylor (Editors), Bioinformatics, Sequence, structure and databanks. Oxford University Press, Oxford. Duret, L., Mouchiroud, D. and Gouy, M., 1994. HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22: 2360-2365. 35 Feng, D.-F. and Doolittle, R.F., 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25: 351-360. Gotoh, O., 1996. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264: 823-838. Gribskov, M., McLachlan, M. and Eisenberg, D., 1987. Profile analysis: Detection of distantly related proteins. Proceedings of the National Academy of Sciences 84: 43555358. Henikoff, S. and Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Computers and Chemistry 23: 341364. Huang, X. and Miller, W., 1991. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12: 337-357. Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195-202. Krogh, A., Brown, M., Mian, I.S., Sjölander, K. and Haussler, D., 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235: 1501-1531. Morgenstern, B., Frech, K., Dress, A. and Werner, T., 1998. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Needleman, S.B. and Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453. 36 Notredame, C., Higgins, D.G. and Heringa, J., 2000. T-Coffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302: 205-217. Notredame, C., Holm, L. and Higgins, D.G., 1998. COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422. Perriere, G., Duret, L. and Gouy, M., 2000. HOBACGEN: database system for comparative genomics in bacteria. Genome Research 10: 379-385. Saitou, N. and Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425. Sander, C. and Schneider, R., 1991. Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9: 56-68. Sneath, P.H.A. and Sokal, R.R., 1973. Numerical Taxonomy. Freeman, W.H., San Francisco. Taylor, W.R., 1988. A flexible method to align large numbers of biological sequences. Journal of Molecular Evolution 28: 161-169. Thompson, J., Higgins, D. and Gibson, T., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4690. Thompson, J.D., Plewniak, F. and Poch, O., 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682-2690. 37 8VLQJ *HQHWLF $OJRULWKPV IRU 3DLUZLVH DQG 0XOWLSOH 6HTXHQFH $OLJQPHQWV Cédric Notredame Information Génétique et Structurale CNRS-UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr  ,QWURGXFWLRQ 1.1 Importance of Multiple Sequence Alignment The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Given a set of homologous sequences, multiple alignments are used to help predict the secondary or tertiary structure of new sequences [51]; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families [6]; to suggest primers for the polymerase chain reaction (PCR) and as an essential prelude to phylogenetic reconstruction [19]. These alignments may be turned into profiles [25] or Hidden Markov Models (HMMs) [27, 9] that can be used to scour databases for distantly related members of the family. Multiple alignment techniques can be divided into two categories: global and local techniques. When making a global alignment, the algorithm attempts to align sequences chosen by the user over their entire length. Local alignment algorithms automatically discard portions of sequences that do not share any homology with the rest of the set. They constitute a greater challenge since they increase the amount of decision made by the algorithm. Most multiple alignment methods are global, leaving it to the user to decide on the portion of sequences to be incorporated in the multiple alignment. To aid that decision, one often uses local pairwise alignment programs such as Blast [3] or Smith and Waterman [56]. In the context of this chapter, we will focus on global alignment methods with a special emphasis on the alignment of protein and RNA sequences. Despite its importance, the automatic generation of an accurate multiple sequence alignment remains one of the most challenging problems in bioinformatics. The reason behind that complexity can easily be explained. A multiple alignment is meant to reconstitute relationships (evolutionary, structural, and functional) within a set of sequences that may have been diverging for millions and sometimes billions of years. To be accurate, this reconstitution would require an in-depth knowledge of the evolutionary history and structural properties of these sequences. For obvious reasons, this information is rarely available and generic empirical models of protein evolution [18, 28, 8], based on sequence similarity must be used instead. Unfortunately, these can prove difficult to apply when the sequences are less than 30% identical and lay within the so-called “twilight zone” [52]. Further, accurate optimization methods that use these models can be extremely demanding in computer resources for more than a handful of sequences [12, 62]. This is why most multiple alignment methods rely on approximate heuristic algorithms. These heuristics are usually a complex combination of ad hoc procedures mixed with some elements of dynamic programming. Overall, two key properties characterize them: the optimization algorithm and the criteria (objective function) this algorithm attempts to optimize. 1.2 Standard Optimization Algorithms Optimization algorithms roughly fall in three categories: the exact, the progressive, and the iterative algorithms. Exact algorithms attempt to deliver an optimal or a sub-optimal alignment within some well defined bounds [40], [57]. Unfortunately, these algorithms have very serious limitations with regard to the number of sequences they can handle and the type of objective function they can optimize. Progressive alignments are by far the most widely used [30, 14, 45]. They depend on a progressive assembly of the multiple alignment [31, 20, 58] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [43]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity even if it is by nature a greedy heuristic that does not guarantee any level of optimization. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvement can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one from a multiple alignment and realigning them to the remaining sequences [7] [24] [29]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [39], simulated annealing (SA) [37, 36, 33] and evolutionary computation such as genetic algorithms (GAs) [44, 47, 34, 64, 5, 23] and evolutionary programming [11, 13]. Their main advantage is to allow for a good separation between the optimization process and evaluation criteria (objective function). It is the objective function that defines the aim of any optimization procedure and in our case, it is also the objective function that contains the biological knowledge one tries to project in the alignment. 1.3 The Objective Function In an evolutionary algorithm, the objective function is the criteria used to evaluate the quality (fitness) of a solution (individual). To be of any use, the value that this function associates to an alignment must reflect its biological relevance and indicate the structural or the evolutionary relation that exists among the aligned sequences. In theory, a multiple alignment is correct if in each column the aligned residues have the same evolutionary history or play similar roles in the three-dimensional fold of RNA or proteins. Since evolutionary or structural information is rarely at hand, it is common practice to replace them with a measure of sequence similarity. The rationale behind this is that similar sequences can be assumed to share the same fold and the same evolutionary origin [52] as long as their level of identity is above the so-called "twilight zone" (more than 30% identity over more than 100 residues). Accurate measures of similarity are obtained using substitution matrices [18, 28]. A substitution matrix is a pre-computed table of numbers (for proteins, this matrix is 20*20, representing all possible transition states for the 20 naturally occurring amino acids) where each possible substitution/conservation receives a weight indicative of its likeliness as estimated from data analysis. In these matrices, substitutions (conservations) observed more often than one would expect by chance receive positive values while under-represented mutations are associated with negative values. Given such a matrix the correct alignment is defined as the one that maximizes the sum of the substitution (conservations) score. An extra factor is also applied to penalize insertions and deletions (Gap penalty). The most commonly used model for that purpose is named ’affine gap penalties’. It penalizes an insertion/deletion once for its opening (gap opening penalty, abbreviated GOP) and then with a factor proportional to its length (gap extension penalty, abbreviated GEP). Since any gap can be explained with one mutational event only, the aim of that scheme is to make sure that the best scoring evolutionary scenario involves only a small number of insertions or deletions (indels) in the alignment. This will result in an alignment with few long gaps rather than many short ones. The resulting score can be viewed as a measure of similarity between two sequences (pairwise). This measure can be extended for the alignment of multiple sequences in many ways. For instance, it is common practice to set the score of the multiple alignment to be the sum of the score of every pairwise alignment it contains (sums of pairs)[1]. While that scoring scheme is the most widely used, its main drawback stems from the lack of an underlying evolutionary scenario. It assumes that every sequence is independent and this results in an overestimation of the number of substitutions. It is to counterbalance that effect that probability based schemes were introduced in the context of HMMs. Their purpose is to associate each column of an alignment with a generation probability [39]. Estimations are carried out in a Bayesian context where the model (alignment) probability is evaluated simultaneously with the probability of the data (the aligned sequences). In the end, the score of the complete alignment is set to be the probability of the aligned sequences to be generated by the trained HMM. The major drawbacks of this model are its high dependency on the number of sequences being aligned (i.e. many sequences are needed to generate an accurate model) and the difficulty of the training. More recently, new methods based on consistency were described for the evaluation of multiple sequence alignments. Under these schemes, the score of a multiple alignment is the measure of its consistency with a list of pre-defined constraints [42, 46, 10, 45]. It is common practice for these pre-defined constraints to be sets of pairwise, multiple or local alignments. Quite naturally, the main limitation of consistencybased schemes is that they make the quality of the alignment greatly dependent on the quality of the constraints it is evaluated against. An objective function always defines a mathematical optimum, that is to say an alignment in which the sequences are arranged in such a manner that they yield a score that cannot be improved. The mathematically optimal alignment should never be confused with the correct alignment, the biological optimum. While the biological optimum is by definition correct, a mathematically optimal alignment is biologically only as good as it is similar to the biological optimum. This depends entirely on the quality of the objective function that was used to generate it. In order to achieve this result, there is no limit to the complexity of the objective functions one may design, even if in practice the lack of appropriate optimization engines constitutes a major limitation. What is the use of an objective function if one cannot optimize it and how is it possible to tell if an objective function is biologically relevant or not? Evolutionary algorithms come in very handy to answer these questions. They make it possible to design new scoring schemes without having to worry, at least in the first stage, about optimization issues. In the next section, we introduce one of these evolutionary techniques known as genetic algorithms (GA). GAs are described along with another closely related stochastic optimization algorithm: simulated annealing. Three examples are reviewed in details, in which GAs were successfully applied to sequence alignment problems.  (YROXWLRQDU\ $OJRULWKPV DQG 6LPXODWHG $QQHDOLQJ An evolutionary algorithm is a way of finding a solution to a problem by means of forcing sub-optimal solutions to evolve through some perturbations (mutations and recombination). Most evolutionary algorithms are stochastic in the sense that the solution space is explored in a random rather than ordered manner. In this context, randomness provides a non-null probability to sample any potential solution, regardless of the solution space size, providing that the mutations allow such an exploration. The drawback of randomness is that all potential solutions may not be visited during the search (including the global optimum). In order to correct for this problem, a large number of heuristics have been designed that attempt to bias the way in which the solution space is sampled. They aim at improving the chances of sampling an optimal solution. For that reason, most stochastic strategies (including evolutionary computation) can be regarded as a tradeoff between greediness and randomness. Two stochastic strategies have been widely used for sequence analysis: simulated annealing and genetic algorithms. Strictly speaking, SA does not belong to the field of evolutionary computation, yet, in practice, it has been one of the major source of inspiration for the elaboration of genetic algorithms used in sequence analysis. 2.1 Simulated Annealing Simulated annealing (SA) [38] was the first stochastic algorithm used to attempt solving the multiple sequence alignment problem [33, 37]. SA relies on an analogy with physics. The idea is to compare the solving of an optimization problem to some crystallization process (cooling of a metal). In practice, given a set of sequences, a first alignment is randomly generated. A perturbation is then applied (shifting of an existing gap or introduction of a new one) and the resulting alignment is evaluated with the objective function. If that new alignment is better than the previous one, it replaces it, otherwise it replaces it with a probability that depends on the difference of score and on the current temperature. The higher the temperature the more likely an important score difference will be accepted. Every cycle the temperature decreases slightly until it reaches 0. From the perspective of an evolutionary algorithm, SA can be regarded as a population with one individual only. Perturbations are similar to the mutations used in evolutionary algorithms. Apart from the population size of one, the main difference between SA and any true evolutionary algorithm is the extrinsic annealing schedule. While the principle is very sound and the adequacy to multiple alignment optimization and objective function evaluation is obvious, SA suffers from a very serious drawback: it is extremely slow. Most of the studies conducted on simulated annealing and multiple alignments concluded that although it does reasonably well, SA is too slow to be used for abinitio multiple alignments and must be restricted to being used as an alignment improver (i.e. improvement of a seed alignment). This serious limitation makes it much harder to use it as the black box one needs to evaluate the design new objective functions. 2.2 Genetic Algorithms It is in an attempt to overcome the limits of SA that evolutionary algorithm were adapted to the multiple sequence alignment problem. Evolutionary algorithms are parallel stochastic search tools. Unlike SA, which maintains a single line of descent from parent to offspring, evolutionary algorithms maintain a population of trials for a given objective function. Evolutionary algorithms are among the most interesting stochastic optimization tools available today. One of the reason why these algorithms have received so little attention in the context of multiple sequence alignment is probably due to the fact that the implementation of an evolutionary algorithm dedicated to multiple alignment is much less straightforward than with simulated annealing. In other areas of computational biology, evolutionary algorithms have already been established as powerful tools. These include RNA [26, 55, 48] and protein structure analysis [53, 60, 41]. Among all the existing evolutionary algorithms (genetic algorithms, genetic programming, evolution strategies, and evolutionary programming) genetic algorithms have been by far the most popular in the field of computational biology. Although one could argue on who exactly invented GAs, the algorithms we use today were formally introduced by Holland in 1975 [32] and later refined by Goldberg to give the Simple Genetic Algorithm[22]. GAs are based on a loose analogy with the phenomenon of natural selection. Their principle is relatively straightforward. Given a problem, potential solutions (individuals within a population) compete with one another (selection) for survival. These solutions can also evolve: they can be modified (mutations), or combined with one another (crossovers). The idea is that acting together, variation and selection will lead to an overall improvement of the population via evolution. Most of the concepts developed here about GAs are taken from [22, 16]. Two ingredients are essential to the GA strategy: the selection method and the operators. Selection is established in order to lead the search toward improvement. It means that the best individuals (as judged using the objective function) must be the most likely to survive. To serve the GA purpose, this selection strategy cannot be too strict. It must allow some variety to be maintained all along the search in order to prevent the GA population from converging toward the first local minimum it encounters. Evolution toward the optimal solution also requires the use of operators that modify existing solutions and create diversity (mutations) or optimize the use of the existing diversity (crossovers) by combining existing motifs into an optimal solution. Given such a crude layout, the potential for variation is infinite and the study of new GA models is a very active field in its own right. This being said, the main difficulty to overcome when adapting a GA to a problem like multiple sequence alignment is not the choice of a proper model, but rather the conception of a well suited series of operators. This is a well known problem that has also received some attention in the field of structure prediction both for proteins [50] and RNA [54]. A simple justification is that the operators (and the problem representation) largely control the manner in which a solution landscape is being analyzed. For instance, the neighborhood of a solution is mostly defined by the exploration capabilities of the operators. Well chosen they can smoothen very rugged landscapes. Yet, on the other hand, if they are too ’smart’ and too greedy, they may prevent a proper exploration to be carried out. Finding the right trade off can prove rather a complex task. When applying GAs to sequence alignments, previous work on SA proved instrumental. It provided researcher with a set of operators well tested and perfectly suitable for integration within most evolutionary algorithms. Attempts to apply evolutionary algorithms to the multiple sequence alignment problem started in 1993 when Ishikawa et al. published a hybrid GA [34] that does not try to directly optimize the alignment but rather the order in which the sequences should be aligned using dynamic programming. Of course, this limits the algorithm to objective functions that can be used with dynamic programming. Even so, the results obtained that way were convincing enough to prompt the development of the use of GAs in sequence analysis. The first GA able to deal with sequences in a more general manner was described a few years later by Notredame and Higgins[44], shortly before a similar work by Zhang [64]. In these two GAs, the population is made of complete multiple sequence alignments and the operators have direct access to the aligned sequences: they insert and shift gaps in a random or semi-random manner. In 1997, SAGA was applied to RNA analysis [47] and parallelized for that purpose using an island model. This work was later duplicated by Anabrasu et al. [5] who carried out an extensive evaluation of this model, using ClustalW as a reference. Over the following years, at least three new multiple sequence alignment strategies based on evolutionary algorithms have been introduced [23], [13] and [11]. Each of these relies on a principle similar to SAGA: a population of multiple alignments evolves by selection, combination and mutation. The population is made of alignments and the mutations are string-processing programs that shuffle the gaps using complex models. The main difference between SAGA and these recent algorithms has been the design of better mutation operators that improve the efficiency and the accuracy of the algorithms. These new results have strengthened the idea that the essence of the adaptation of GAs to multiple sequence alignments is the design of proper operators, reflecting as well as possible the true mechanisms of molecular evolution. In order to expose each of the many ingredients that constitute a GA specialized in sequence alignments, the example of SAGA will now be reviewed in details.  6$*$ D *$ 'HGLFDWHG WR 6HTXHQFH $OLJQPHQW 3.1 The Algorithm. SAGA is a genetic algorithm dedicated to multiple sequence alignment [44]. It follows the general principles of the simple genetic algorithms (sGA) described by Goldberg [22] and later refined by Davis [17]. In SAGA, each individual is a multiple alignment. The data structure chosen for the internal representation of an individual is a straightforward twodimensional array where each line represents an aligned sequence and each cell is either a residue or a gap. The population has a constant size and does not contain any duplicate (i.e. identical individuals). The pseudo-code of the algorithm is given on figure 1. Each of these steps is developed over the next sections.  ,QLWLDOL]DWLRQ The challenge of the initialization (also known as seeding) is to generate a population as diverse as possible in terms of ’genotype’ and as uniform as possible in terms of scores. In SAGA, generation 0 consists of a 100 multiple alignments randomly generated that only contain terminal gaps. These initial alignments are less than twice the length of the longest sequence of the set (longer alignments can be generated later). To create one of these individuals, a random offset is chosen for each sequence (between 0 and the length of the longest sequence); each sequence is shifted to the right, according to the offset and empty spaces are padded with null signs in order to give the same length L to all the sequences. Seeding can also be carried out by generating sub-optimal alignments using an implementation of dynamic programming that incorporates some randomness. This is the case in RAGA [47], an implementation of SAGA specialized in RNA alignment.  (YDOXDWLRQ Fitness is measured by scoring each alignment according to the chosen objective function. The better the alignment, the better its score and the higher its fitness (scores are inverted if the OF is meant to be minimized). To minimize sampling errors, raw scores are turned into a normalized value known as the expected offspring (EO). The EO indicates how many children an alignment is likely to have. In SAGA, EOs are stochastically derived using a predefined recipe: ’the remainder stochastic sampling without replacement’ [22]. This gives values that are typically between 0 and 2. Only the weakest half of the population is replaced with the new offspring while the other half is carried over unchanged to the next generation. This practice is known as overlapping generations [16].  %UHHGLQJ It is during the breeding that new individuals (children) are generated. The EO is used as a probability for each individual to be chosen as a parent. This selection is carried out by weighted wheel selection without replacement [22] and an individual’s EO is decreased by one unit each time it is chosen to be a parent. An operator is also chosen and applied onto the parent(s) to create the newborn child. Twenty-two operators are available in SAGA. They all have their own usage probability and can be divided in two categories: mutations that only require one parent and crossovers that require two parents. Since no duplicate is allowed in the population, a newborn child is only accepted if it differs from all the other members of the generation already created. When a duplicate arises, the whole series of operations that lead to its creation is canceled. Breeding is over when the new generation is complete, and SAGA proceeds toward producing the next generation unless the finishing criterion is met.  7HUPLQDWLRQ Conditions that could guarantee optimality are not met in SAGA and there is no valid proof that it may reach a global optimum, even in an infinite amount of time (as opposed to SA). For that reason an empirical criterion is used for termination: the algorithm terminates when the search has been unable to improve for more than 100 generations. That type of stabilization is one of the most commonly used condition to stop a GA when working on a population with no duplicate (i.e. a population where all the individuals are different from one another) [17]. 3.2 Designing the Operators As mentioned earlier, the design of an adequate set of operators has been the main point of focus in the work that lead to SAGA. According to the traditional nomenclature of genetic algorithms [22], two types of operators coexist in SAGA: crossover and mutation. An operator is designed as an independent program that inputs one or two alignments (the parents) and outputs one alignment (the child). Each operator requires one or more parameters that specify how the operation is to be carried out. For instance, an operator that inserts a new gap requires three parameters: the position of the insertion, the index of sequence to modify and the length of the insertion. These parameters may be chosen completely at random (in some pre-defined range). In that case, the operator is said to be used in a stochastic manner [44]. Alternatively, all but one of the parameters may be chosen randomly, leaving the value of the remaining parameter to be fixed by exhaustive examination of all possible values. The value that yields the best fitness is kept. An operator applied this way is said to be used in a semi-hill climbing mode. Most operators may be used either way (stochastic or semi-hill climbing). For the robustness of the GA, it is also important to make sure that the operators are completely independent from any characteristic of the objective function, unless one is interested in creating a very specific operator for the sake of efficiency. .  7KH &URVVRYHUV Crossovers are meant to generate a new alignment by combining two existing ones. Two types of crossover coexist in SAGA: the one point crossover that combines two parents through a single point of exchange (Figure 2a) and the uniform crossover that promotes multiple exchanges between two parents by swapping blocks between consistent bits (Figure 2b). The uniform crossover is much less disruptive than its one-point counterpart, but it can only be applied if the two parents share some consistency, a condition rarely met in the early stages of the search. Of the two children produced by a crossover, only the fittest is kept and inserted into the new population (if it is not a duplicate). Crossovers are essential for promoting the exchange of high quality blocks within the population. They make it possible to efficiently use existing diversity. However, the blocks present in the original population only represent a tiny proportion of all the possibilities. They may not be sufficient to reconstruct an optimal alignment, and since crossovers cannot create new blocks, another class of operators is needed: mutation.  0XWDWLRQV ([DPSOH RI WKH *DS LQVHUWLRQ 2SHUDWRU SAGA’s mutation operators are extensively described in [44]. We will only review here the gap insertion operator, a crude attempt to reconstitute backward some of the events of insertion/deletions through which a set of sequences might have evolved. When that operator is applied, alignments are modified following the mechanism shown on Figure 3. The aligned sequences are split into two groups. Within each group, every sequence receives a gap insertion at the same position. Groups are chosen by randomly splitting an estimated phylogenetic tree (as given by ClustalW [59]). The stochastic and the semi-hill climbing versions of this operator are implemented. In the stochastic version, the length of the inserted gaps and the two insertion positions are randomly chosen while in the semi-hill climbing mode the second insertion position is chosen by exhaustively trying all the possible positions and comparing the scores of the resulting alignments.  '\QDPLF 6FKHGXOLQJ RI WKH 2SHUDWRUV When creating a child, the choice of the operator is just as important as the choice of the parents. Therefore, it makes sense to allow operators to compete for usage, just as the parents do for survival, in order to make sure that useful operators are more likely to be used. Since one cannot tell in advance the good operators from the bad ones, they initially all receive the same usage probability. Later during the run, these probabilities are dynamically reassessed to reflect each operator individual performances. The recipe used in SAGA is the dynamic scheduling method described by Davis [16]. It easily allows the adding and removal of new operators without any need for retuning. Under this model, an operator has a probability of being used that is a function of its recent efficiency (i.e. improvement generated over the 10 last generations). The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement over its parents, the operator that was directly responsible for its creation gets the largest part of the credit (e.g. 50%); then the operator(s) responsible for the creation of the parents also get their share on the remaining credit (50% of the remaining credit); and so on. This credit report goes on for some specified number of generation (e.g. 4). Every 10 generations, results are summarized for each operator and the usage probabilities are reassessed based on the accumulated credit. To avoid early loss of some operators, each of them has a minimum usage probability higher than 0. It is common practice to set these minimal usage probabilities so that they sum to 0.5. To that effect one can use for each operator a minimum probability of 1/(2*number of operators)). A very interesting property of that scheme is that it helps using operators only when they are needed. For instance, uniform crossovers are generally more efficient than their one point counterpart. Unfortunately, they cannot be properly used in the optimization early stages because at that point there is not enough consistency within the population. The dynamic scheduling adapts very well to that situation by initially giving a high usage probability to the one point crossover, and by shifting that credit to the uniform crossover when the population has become tidy enough to support its usage. It is interesting to notice that these two operators are competing with one another although the GA does not explicitly know they both belong to the crossover category. 3.3 Parallelisation of SAGA Long running times were SAGA’s main limitation. This became especially acute when aligning very long sequences such as ribosomal RNAs (>1000 nucleotides). It is common practice to use parallelisation in order to alleviate such problems. The technique applied on SAGA is specific of GAs and known as an island parallelisation model [21]. Instead of having a single GA running, several identically configured GAs run in parallel on separate processors. Every 5 generations they exchange some of their individuals. The GAs are arranged on the leaves and the nodes of a N-branched tree and the population exchange is unidirectional from the leaves to the root of the tree (Figure 4). By default, the individuals migrating from one GA to another are those having the best score. The GA node where they come from keeps a copy of them but they replace low scoring individuals in the accepting GA [44]. Initially implemented in RAGA, the RNA version of SAGA, this model was also extended to SAGA, using a 3-branched trees with a depth of 3 that requires 13 GAs. These processes are synchronous and wait for each other to reach the same generation number before exchanging populations. This distributed models benefits from the explicit parallelisation and is about 10 times faster than a non-parallel version (i.e. about 80% of the maximum speedup expected when distributing the computation over 13 processors). It also benefits from the new constraints imposed by the tree topology on the structure of the population. It seems to be the lack of feedback that makes it possible to retain within the population a much higher degree of diversity than a single unified population could afford. These are the terminal leaves that behave as a diversity reservoir and yield a much higher accuracy to the parallel GA than a non-parallel version with the same overall population. Nonetheless, these preliminary observations remain to be firmly established, using some extra thorough benchmarking.  $SSOLFDWLRQV &KRRVLQJ DQ 2EMHFWLYH )XQFWLRQ The main motivation behind SAGA’s design was the creation of a robust platform or a black box on which any OF one could think of could be tested in a seamless manner. Such a black box allows discriminating between the functions that are biologically relevant and those that are not. For instance, let us consider the weighted sums of pairs. This function is one of the most widely used. It owes its popularity to the fact that algorithmic methods exist that allow its approximate optimization [43, 40]. Yet we know this function is not very meaningful from a biological point of view [4]. The three main limitations that have caught biologists’ attention are the crude modeling of the insertions/deletions (gap), the assumed independence of each position and the fact that the evaluation cannot be made position dependant. Thanks to SAGA, it was possible to design new objective functions that make use of more complex gap penalties, take into account non-local dependencies or use position specific scoring schemes and to ask if this increased sophistication results in an improvement of the alignments biological quality. The following section reviews three classes of objective functions that were successfully optimized using SAGA [44, 47, 46]. 4.1 The Weighted Sums of Pairs MSA [40] is an algorithm that makes it possible to deliver an optimal (or a very close suboptimal) multiple sequence alignment using the sums of pairs measure. This sophisticated heuristic performs multi-dimensional dynamic programming in a bounded hyper-space. It is possible to assess the level of optimization reached by SAGA by comparing it to MSA while using exactly the same objective function. The sums-of-pairs principle is to associate a cost to each pair of aligned residues in each column of an alignment (substitution cost), and another similar cost to the gaps (gap cost). The sum of these costs yields the global cost of the alignment. Major variations involve: i) using different sets of costs for the substitutions (PAM matrices [18] or BLOSUM tables [28]); ii) different schemes for the scoring of gaps [1]; iii) different sets of weights associated with each pair of sequence [2]. Formally, one can define the cost of a multiple alignment (A) as:   ¡ ¡   =1 = +1 ¤ £ $/,*10(17 &267 $ ΣΣ ¡ ¢ −1 ¢ : , &267 $  $ Where N is the number of sequences, Ai the aligned sequence i, COST is the alignment score between two aligned sequences (Ai and Aj) and Wi,j is the weight associated with that pair of sequences. The COST includes the sum of the substitution costs as given by a substitution matrix and the cost of the insertions/deletions using a model with affine gap penalties (a gap opening penalty and a gap extension penalty). Two schemes exist for scoring gaps: natural affine gap penalties and quasi-natural affine gap penalties [1]. Quasi-natural gap penalties are the only scheme that the MSA program can efficiently optimize. This is unfortunate since these penalties are known to be biologically less accurate than their natural counterparts [1] because of a tendency to over-estimate the number of gaps. Under both scheme, terminal gaps are penalized for extension but not for opening. It is common practice to validate a new method by comparing the alignments it produces with references assembled by experts. In the case of multiple alignments, one often uses structure based sequence alignments that are regarded as the best standard of truth available [24]. For SAGA, validation was carried out using 3Dali [49]. Biological validation should not be confused with the mathematical validation also required for an optimization method. In the case of SAGA, both validations were simultaneously carried out, and a summary of the results obtained when optimizing the sums of pairs is shown of Table 1. Firstly, SAGA was used to optimize the sums of pairs with quasi-natural gap penalties, using MSA derived alignments as a reference. In two thirds of the cases, SAGA reached the same level of optimization as MSA. In the remaining test sets, SAGA outperformed MSA, and in every case that improvement correlated with an improvement of the alignment biological quality, as judged by comparison with a reference alignment. Although they fall short of a demonstration, these figures suggest that SAGA is an adequate optimization tool that competes well with the most sophisticated heuristics. In a second aspect of that validation, SAGA was used to align test cases too large to be handled by MSA, and using as an objective function the weighted sums of pairs with natural gap penalties. ClustalW was the nonstochastic heuristic used as a reference. As expected, the use of natural penalties lead to some improvement over the optimization reached by ClustalW, and that mathematical improvement was also correlated with a biological improvement. Altogether, these results are indicative of the versatility of SAGA as an optimizer and of its ability to optimize functions that are beyond the scope of standard dynamic programming based algorithmic methods. 4.2 Consistency Based Objective Functions: The COFFEE Score Ultimately, a multiple alignment aims at combining within a single unifying model every piece of information known about the sequences it contains. However, it may be the case that a part of this information is not as reliable as one may expect. It may also be the case that some elements of information are not compatible with one another. The model will reveal these inconsistencies and require decisions to be made in a way that takes into account the overall quality of the alignment. A new objective function can be defined that measures the fit between a multiple alignments and the list of weighted elements of information. Of course, the relevance of that objective function will depend greatly on the quality of the pre-defined list. This list can take whatever forms one wishes. For instance, a convenient source is the generation of a list of pair wise alignments [46, 45] that given a set of N sequences will contain all the N2 possible pair wise alignments. In the context of COFFEE (Consistency Based Objective Function For alignmEnt Evaluation), that list of alignments is named a library, and the COFFEE function measures the level of consistency between a multiple alignments and its corresponding library. Evaluation is made by comparing each pair of aligned residues observed in the multiple alignments to the list of residue pairs that constitute the library. During the comparison, residues are only identified by their index within the sequences. The consistency score is equal to the number of pairs of residues that are simultaneously found in the multiple alignment and in the library, divided by the total number of pairs observed in the multiple sequence alignment. The maximum is 1 but the real optimum depends on the level of consistency found within the library. To increase the biological relevance of this function, each pair of residues is associated with a weight indicative of the quality of the pair-wise alignment it comes from (a measure of the percentage of identity between the two sequences). The COFFEE function can be formalized as follows. Given N aligned sequences S1...SN in a multiple alignment. Ai,j is the pair wise projection (obtained from the multiple alignment) of the sequences Si and Sj. LEN (Ai,j) is the number of ungapped columns in this alignment. SCORE (Ai,j) is the overall consistency between Ai,j and the corresponding pair-wise alignment in the library and Wi,j is the weight associated with this pair-wise alignment. COFFEE score= [Σ Σ W i=1 j= i+1 Ν -1 Ν i, j * SCORE (A i, j) / ] [Σ Σ W i=1 j =i +1 Ν -1 Ν i, j *LEN(A i, j) ] If we compare this function to the weighted sums of pairs developed earlier, we will find that the main difference is the library that replaces the substitution matrix and provides a position dependant mean of evaluation. It is also interesting to note that under this formulation an alignment having an optimal COFFEE score will be equivalent to a Maximum Weight Trace alignment using a ’pair-wise alignment graph’ [35]. Table 2 shows some of the results obtained using SAGA/COFFEE on 3Dali. For that experiment, the library of pair wise alignments had been generated using ClustalW alignments, and the resulting alignments proved to be of a higher biological quality than those obtained with alternative methods available at the time. Eventually, these results were convincing enough to prompt the development of a fast non-GA based method for the optimization of the COFFEE function. That new algorithm, named T-Coffee, was recently made available to the public [45]. 4.3 Taking Non-Local Interactions Into Account: RAGA. So far, we have reviewed the use of SAGA for sequence analysis problems that consider every position as independent from the others. While that approximation is acceptable when the sequence signal is strong enough to drive the alignment, this is not always the case when dealing with sequences that have a lower information content than proteins but carry explicit structural information, such as RNA or DNA. To illustrate one more usage of GAs it will now be interesting to examine a case where SAGA was used to optimize an RNA structure superimposition in which the OF takes into account local and non-local interactions altogether. RNA was chosen because its fold, largely based on Watson and Crick base pairings [63], generates characteristic structures (stems-loops) that are easy to predict and analyze [65]. Since the pairing potential of two RNA bases can be predicted with reasonable accuracy, the evaluation of an alignment can easily take into account structure (Se) and sequence (Pr) similarities altogether. The version of SAGA in which that new function is implemented is named RAGA (RNA Alignment by Genetic Algorithm) [47]. In RAGA, the OF evaluates the alignment of two RNA sequences, one with a known secondary structure (master) and one that is homologous to the master but whose exact secondary structure is unknown (slave). It can be formalized as follow: Alignment score = Pr + (λ* Se) - Gap Penalty (2) λ is a constant (1-3) and *DS SHQDOW\ is the sum of the affine gap penalties within the alignment. Pr is simply the number of identities. Se uses the secondary structure of the master sequence and evaluates the stability of the folding it induces onto the slave sequence. If two bases form a base pair (part of a stem) in the master, then the two ’slave’ bases they are aligned to should be able to form a Watson and Crick base pair as well. Se is the sum of the score of these induced pairs. The energetic model used in RAGA is very simplified and assigns 3 to GC pairs and 2 to UA and UG. Assessing the accuracy and the efficiency of RAGA is a problem very similar to the one encountered when analyzing SAGA. In this case, the reference alignments were chosen from mitochondrial ribosomal small subunit RNA sequence alignments established by experts [61]. The human sequence was used as a master and realigned by RAGA to seven other homologous mitochondrial sequences used as slaves. Evaluation was made by comparing the optimized pairwise alignments to those contained in the reference alignment. The results on Table 3 indicate very clearly, that a proper optimization took place and that the secondary structure information was efficiently used to enhance the alignment quality. This is especially sensible for very divergent sequences that do not contain enough information at the primary level for an accurate alignment to be determined on these bases alone. It is also interesting to point out that RAGA could also take into account some elements of the tertiary structure known as pseudoknots, that were successfully added to the objective function. These elements, that are beyond the scope of most dynamic programming based methods, lead to even more accurate alignment optimization [47].  &RQFOXVLRQ *$ YHUVXV +HXULVWLF 0HWKRGV Section 4 of this chapter illustrates three situations in which GAs proved able to solve very complex optimization problems with a reasonable level of accuracy. On its own, this clearly indicates the importance and the interest of these methods in the field of sequence analysis. Yet, when applied to that type of problems, GAs suffer from two major drawbacks: they are very slow and unreliable. By unreliable, we mean that given a set of sequences, a GA may not deliver twice the same answer, owing to the stochastic nature of the optimization process and to the difficulty of the optimization. This may be a great cause of concern to the average biologist who expects to use his multiple alignment as a prediction tool and possibly as a decision aide for the design of expensive wet lab experiments. How severe is this problem? If we consider the protein test cases analyzed here, SAGA reaches its best score in half of the runs on average. For RAGA, maybe because the solution space is more complex, this proportion goes down to 20%. If one is only interested in validating a new objective function, this is not a major source of concern since even in the worse cases the sub-optimal solutions are within a few percent of the best found solution. However, this instability is not unique to GAs and is not as severe as the second major drawback: the efficiency. Although much more practical than SA, GAs slowness means that they cannot really be expected to become part of any of the very large projects that require millions of alignments to be routinely made over a few days [15]. More robust, if less accurate, techniques are required for that purpose. Is the situation hopeless then? The answer is definitely no since two important fields of application exist for which GAs are uniquely suited. The first one is the analysis of rare and very complex problems for which no other alternative is available, such as the folding of very long RNAs. The second aspect is more general. GAs provide us with a unique way of probing very complex problems with little concern, at least in the first stages, for the algorithmic issues involved. It is quite remarkable that even with a very simple GA one can quickly ask very important questions and decide weather a thread of investigation is worth being pursued or should simply be abandoned. The COFFEE project is a good example of such a cycle of analysis. It followed this three steps process: (i) an objective function was first designed without any concern for the complexity of its optimization and the algorithmic issues. (ii) SAGA was used to evaluate the biological relevance of that function. (iii) This validation was convincing enough to prompt the conception of a new dynamic programming algorithm, much faster and appropriate for this function[45]. This non-GA based algorithm was named T-Coffee (Tree based COFFEE). The mere evocation of these two projects respective developing time makes a good case for the use of SAGA: the COFFEE project took four months to be carried out, while completion of the T-Coffee project required more than a year and a half for algorithm development and software engineering.  $YDLODELOLW\ SAGA, RAGA, COFFEE and T-Coffee are all available free of charge from the author either via Email (cedric.notredame@igs.cnrs-mrs.fr) or via the WWW (http://igs-server.cnrsmrs.fr/~cnotred).  $FNQRZOHGJHPHQWV The author wishes to thank Dr Hiroyuki Ogata and Dr Gary Fogel for very helpful comments and for an in-depth review of the manuscript.  5HIHUHQFHV [1] S. F. Altschul, *DS FRVWV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, J. Theor. Biol., 138 (1989), pp. 297-309. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] S. F. Altschul, R. J. Carroll and D. J. Lipman, :HLJKWV IRU GDWD UHODWHG E\ D WUHH, Journal of Molecular Biology, 207 (1989), pp. 647-653. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, %DVLF ORFDO DOLJQPHQW VHDUFK WRRO, Journal of Molecular Biology, 215 (1990), pp. 403-410. S. F. Altschul and D. J. Lipman, 7UHHV VWDUV DQG PXOWLSOH ELRORJLFDO VHTXHQFH DOLJQPHQW, SIAM J. Appl. Math., 49 (1989), pp. 197-209. L. A. Anabarasu, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ SDUDOOHO JHQHWLF DOJRULWKPV, , 7KH 6HFRQG $VLD3DFLILF &RQIHUHQFH RQ 6LPXODWHG (YROXWLRQ 6($/ , Canberra,australia, 1998. A. Bairoch, P. Bucher and K. Hofmann, 7KH 3526,7( GDWDEDVH LWV VWDWXV LQ , Nucleic Acids Research, 25 (1997), pp. 217-221. G. J. Barton and M. J. E. Sternberg, $ VWUDWHJ\ IRU WKH UDSLG PXOWLSOH DOLJQPHQW RI SURWHLQ VHTXHQFHV FRQILGHQFH OHYHOV IURP WHUWLDU\ VWUXFWXUH FRPSDULVRQV, Journal of Molecular Biology, 198 (1987), pp. 327-337. S. A. Benner, M. A. Cohen and G. H. Gonnet, 5HVSRQVH WR %DUWRQ V OHWWHU &RPSXWHU VSHHG DQG VHTXHQFH FRPSDULVRQ, Science, 257 (1992), pp. 1609-1610. P. Bucher, K. Karplus, N. Moeri and K. Hofmann, $ IOH[LEOH PRWLI VHDUFK WHFKQLTXH EDVHG RQ JHQHUDOL]HG SURILOHV, Comput Chem, 20 (1996), pp. 3-23. K. Bucka-Lassen, O. Caprani and J. Hein, &RPELQLQJ PDQ\ PXOWLSOH DOLJQPHQWV LQ RQH LPSURYHG DOLJQPHQW, Bioinformatics, 15 (1999), pp. 122-30. L. Cai, D. Juedes and E. Liakhovitch, (YROXWLRQDU\ FRPSXWDWLRQ WHFKQLTXHV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, , &RQJUHVV RQ HYROXWLRQDU\ FRPSXWDWLRQ , 2000, pp. 829-835. H. Carrillo and D. J. Lipman, 7KH PXOWLSOH VHTXHQFH DOLJQPHQW SUREOHP LQ ELRORJ\, SIAM J. Appl. Math., 48 (1988), pp. 1073-1082. K. Chellapilla and G. B. Fogel, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ HYROXWLRQDU\ SURJUDPPLQJ, , &RQJUHVV RQ (YROXWLRQDU\ &RPSXWDWLRQ , 1999, pp. 445-452. F. Corpet, 0XOWLSOH VHTXHQFH DOLJQPHQW ZLWK KLHUDUFKLFDO FOXVWHULQJ, Nucleic Acids Res., 16 (1988), pp. 10881-10890. F. Corpet, F. Servant, J. Gouzy and D. Kahn, 3UR'RP DQG 3UR'RP&* WRROV IRU SURWHLQ GRPDLQ DQDO\VLV DQG ZKROH JHQRPH FRPSDULVRQV, nucleic acids res, 28 (2000), pp. 267-9. L. Davis, 7KH KDQGERRN RI *HQHWLF $OJRULWKPV, Van Nostrand Reinhold, New York, 1991. P. J. Davis and R. Hersh, 7KH PDWKHPDWLFDO H[SHULHQFH, Birkauser, Boston, 1980. M. O. Dayhoff, $WODV RI 3URWHLQ 6HTXHQFH DQG 6WUXFWXUH, National Biomedical Research Foundation, Washington, D. C., U. S. A., 1978. J. Felsenstein, 3+1400 MSAs altogether) and on 6 out of 10 Balibase categories. Total results confirm M-Coffee to be the average best performer on the three datasets. Further analysis on individual datasets (Table 3) also reveals that on average M-Coffee is about twice more likely to deliver the most accurate MSA than any of the individual methods (1104 versus 614). In terms of CPU time, M-Coffee is very similar to the standard T-Coffee with the difference that it does not require the estimation of the pairwise library. For instance, if we consider 1bxkA-1he2A, a standard prefab dataset of 50 sequences, 200 amino acid long and 47% average identity, the default T-Coffee requires 270 s to align that dataset on a standard PC (Pentium 2 MHz, 500 MB RAM), while M-Coffee8 requires 180 s on a similar machine and 1698 Nucleic Acids Research, 2006, Vol. 34, No. 6 Table 2. The CS accuracy performance of M-Coffee8 and various individual methods on the HOMSTRAD, Prefab and Balibase references M-Coffee8 Homstrad Prefab <10% Prefab 10 to <20% Prefab 20 to <30% Prefab 30 to <40% Prefab 40 to <100% Prefab total BaliBase Set: 11 BaliBase Set: 12 BaliBase Set: 20 BaliBase Set: 30 BaliBase Set: 40 BaliBase Set: 50 BaliBase Set: S11 BaliBase Set: S12 BaliBase Set: S2 BaliBase Set: S3 BaliBase Set: S5 BaliBase total 67.75* 27.19 59.80* 84.58* 92.54* 97.05* 72.91* 43.18* 85.91* 43.12 59.19* 58.17 59.81 59.50 86.59 56.76 69.41* 60.60 62.02 ClustalW 61.15 18.25 43.27 74.79 87.27 94.91 61.68 22.68 71.43 21.68 25.48 39.04 33.69 40.76 79.05 44.37 49.69 43.27 42.83 Dialign-T 57.92 15.51 44.11 75.28 85.62 96.07 62.05 25.32 72.57 29.20 35.19 44.75 44.25 33.34 76.20 36.90 47.31 45.47 44.59 FINSI 64.22 24.86 58.76 83.76 91.81 96.92 72.01 38.95 82.68 45.85 57.59 60.02 57.69 50.63 84.02 53.85 63.83 57.73 59.34 Muscle 6 66.04 24.14 54.76 82.09 90.42 96.17 69.56 34.37 84.80 36.49 41.04 48.42 50.56 59.37 86.95 55.78 63.14 60.33 56.47 PCMA 63.73 25.53 55.96 81.47 89.84 95.03 69.76 37.45 82.61 44.83 58.15 53.83 59.88 44.76 82.91 51.85 64.10 56.73 57.92 POA 51.9 9.09 32.26 64.42 79.96 94.30 52.61 11.18 51.05 13.37 7.89 14.42 21.63 31.37 68.14 35.24 36.14 28.47 29.00 Probcons 66.41 24.81 56.21 82.85 91.68 96.20 70.54 39.55 84.80 37.78 47.26 51.25 55.25 58.45 87.05 54.46 65.03 59.80 58.24 T-Coffee 65.37 23.41 55.28 82.39 91.51 96.68 69.97 32.68 83.00 39.68 47.48 55.58 57.31 47.61 83.75 49.78 64.45 55.67 56.10 HOMSTRAD was evaluated with aln_compare, Prefab with Qscore and BaliBAse with BaliScore. Methods significantly better (P < 0.05) than the next best are marked with an asterisk. The highest score in each benchmark is highlighted in bold. Table 3. Individual dataset analysis M-Coffee8 better Homstrad Prefab <10% Prefab 10 to <20% Prefab 20 to <30% Prefab 30 to <40% Prefab 40 to <100% Prefab total BaliBase Set: 11 BaliBase Set: 12 BaliBase Set: 20 BaliBase Set: 30 BaliBase Set: 40 BaliBase Set: 50 BaliBase Set: S11 BaliBase Set: S12 BaliBase Set: S2 BaliBase Set: S3 BaliBase Set: S5 BaliBase total Total Total versus ProbCons 139 49 326 278 64 62 779 19 26 16 16 24 12 12 13 21 19 8 186 1104 1249 M-Coffee8 worse 65 37 226 132 35 25 455 5 7 14 5 10 4 15 11 13 6 5 95 615 615 P(Wilcoxon Signed) 0.000 0.16 0.000 0.000 0.003 0.002 0.000 0.002 0.008 0.967 0.013 0.333 0.078 0.793 0.437 0.397 0.024 0.623 0.002 Best single method ProbCons PCMA Finsi Finsi ProbCons Finsi / ProbCons ProbCons Finsi PCMA Finsi PCMA Muscle 6 ProbCons Muscle 6 ProbCons Muscle 6 / / ProBcons The data are the same as in Table 2. On each subset, M-Coffee8 is compared with the best performing method. Column 2 indicates the number of times M-Cofee8 is better/worse than the best single method on that subset. The two last lines indicate the total for the table (Total) and the result of a comparison against ProbCons, the best individual method. remains more accurate than most individual methods (including the duplicated one). These results suggest that the combination procedure is a rather robust process able to cope with a significant amount of noise. The problem with Meta-methods is their tendency to harmonize a field of research by unfairly competing against the individual methods they are made of. In the case of M-Coffee it is interesting to stress the importance of original and independent individual methods, as illustrated by the method tree. It is also worth pointing out that our analysis reveals several method convergences (Figure 1) that may not be entirely obvious for a non-specialist basing his judgement on their technical descriptions. Overall, M-Coffee will perform best and improve, as long as independent methods keep being produced. Such a concept resonates strongly with the notions of ‘crowds’ and ‘mobs’ and how a group of non-expert people can arrive at more accurate decisions than a small number of ‘experts’ (37). Crowds are described as having the potential to be wise but only as long as the crowd members are independent and not forming a mob. Mobs are consistent but easily lead to the wrong decision. ACKNOWLEDGEMENTS We are especially grateful to Martin Vingron for his advice in using the Variance/Covariance weighting system. We thank Nucleic Acids Research, 2006, Vol. 34, No. 6 1699 Prof. Jean-Michel Claverie (head of IGS) for useful discussions and material support, Fabrice Armougom, Sebastien Moretti, Olivier Poirot and Vladimir Saudek for their help in maintaining and debugging the T-Coffee package. C.N. was supported by CNRS (Centre National de la Recherche Scientifique), ´ Sanofi-Aventis Pharma SA., Marseille-Nice Genopole and the French National Genomic Network (RNG). Part of this work is funded by the Science Foundation Ireland. Funding to pay the Open Access publication charges for this article was provided by Centre National de la Recherche Scientifique. Conflict of interest statement. None declared. REFERENCES 1. Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131–144. 2. Wallace,I.M., Blackshields,G. and Higgins,D.G. (2005) Multiple sequence alignments. Curr. Opin. Struct. Biol., 15, 261–266. 3. Hogeweg,P. and Hesper,B. (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol., 20, 175–186. 4. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. 5. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. 6. Wallace,I.M., O’Sullivan,O. and Higgins,D.G. (2005) Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics, 21, 1408–1414. 7. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. 8. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. 9. Kececioglu,J.D. (1993) Lecture Notes In Computer Science. SpringerVerlag, Vol. 684, pp. 106–119. 10. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. 11. Morgenstern,B., Frech,K., Dress,A. and Werner,T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290–294. 12. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 13. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464. 14. Katoh,K., Kuma,K.I., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 15. Do,C.B., Mahabhashyam,M.S., Brudno,M. and Batzoglou,S. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. 16. Pei,J., Sadreyev,R. and Grishin,N.V. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19, 427–428. 17. Thompson,J.D., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Nucleic Acids Res., 15, 87–88. 18. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. 19. Cuff,J.A., Clamp,M.E., Siddiqui,A.S., Finlay,M. and Barton,G.J. (1998) JPred: a consensus secondary structure prediction server. Bioinformatics, 14, 892–893. 20. Allen,J.E. and Salzberg,S.L. (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics, 21, 3596–3603. 21. Bucka-Lassen,K., Caprani,O. and Hein,J. (1999) Combining many multiple alignments in one improved alignment. Bioinformatics, 15, 122–130. 22. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. 23. Thompson,J.D., Koehl,P., Ripp,R. and Poch,O. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136. 24. Lassmann,T. and Sonnhammer,E.L. (2002) Quality assessment of multiple alignment programs. FEBS Lett., 529, 126–130. 25. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 26. Chenna,R., Sugawara,H., Koike,T., Lopez,R., Gibson,T.J., Higgins,D.G. and Thompson,J.D. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res., 31, 3497–3500. 27. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. 28. Morgenstern,B. (1999) DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment. Bioinformatics, 15, 211–218. 29. Lenhof,H.P., Morgenstern,B. and Reinert,K. (1999) An exact solution for the segment-to-segment multiple sequence alignment problem. Bioinformatics, 15, 203–210. 30. Subramanian,A.R., Weyer-Menkhoff,J., Kaufmann,M. and Morgenstern,B. (2005) DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics, 6, 66. 31. Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. 32. Grasso,C. and Lee,C. (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics, 20, 1546–1556. 33. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 34. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. 35. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci., 10, 19–29. 36. Eyrich,V.A., Marti-Renom,M.A., Przybylski,D., Madhusudhan,M.S., Fiser,A., Pazos,F., Valencia,A., Sali,A. and Rost,B. (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242–1243. 37. Surowiecki,J. (2004) The Wisdom of Crowds. Abacus, London. Published online 17 April 2008 Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 doi:10.1093/nar/gkn174 R-Coffee: a method for multiple alignment of non-coding RNA ´ Andreas Wilm1, Desmond G. Higgins1 and Cedric Notredame2,* 1 2 The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland and Centre for Genomic Regulation (CRG), Dr Aiguader, 88, 08003 Barcelona, Spain Received December 20, 2007; Revised March 14, 2008; Accepted March 25, 2008 ABSTRACT R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (www.tcoffee.org). INTRODUCTION A number of recent discoveries have cast new light on the importance of RNA, revealing a functional scope much broader than realized only a few years ago. Small noncoding RNAs (ncRNAs) are actively involved in a wide range of cell processes, including gene regulation, cell differentiation, genome maintenance, RNA maturation and protein synthesis (1,2). The ncRNA big picture could change even further in the coming years, as suggested by a recent report of the ENCODE consortium (3) showing an unexpected level of ncRNA transcription across the entire human genome. While the problem of aligning sequences has been regularly addressed over the last 40 years (4), delivering accurate alignments for ncRNAs remains a challenging task for at least two main reasons. First of all, RNA molecules have a low chemical complexity compared to proteins with just a four-letter alphabet. This limited information content makes it hard to use sequence similarity as an estimator of the biological relevance of RNA alignments. The most notable consequence is the limited sensitivity of RNA alignments, and it is generally accepted that the RNA twilight zone (i.e. the level of identity below which pairwise alignments become uninformative) is close to 70%, as opposed to 25% for proteins (5–7). The second reason for the difficulty in aligning ncRNA comes from their rate and pattern of evolution. In general, functional ncRNAs have a well-defined structure and their evolution seems to be mainly constrained to retain a specific folding, mostly based on Watson and Crick base pairs. Maintaining such a structure can be achieved through compensatory mutations, a phenomenon that explains why sequences can diverge a lot while still coding for the same structure (8). Therefore, sequence identity alone is a very crude measure of biological similarity, as it does not reflect very well the level of structural conservation. Because of these problems, it is highly desirable to take RNA secondary structure into account while aligning ncRNA sequences, in order to assure optimal usage of the positional interdependence. Sankoff’s algorithm, published 20 years ago (9), does exactly this, but suffers from enormous runtime and memory requirements. Given two sequences of length L, the pairwise alignment requires O(L6) in time and O(L4) in computer space, while its extension to N sequences is exponential: O(L3N) in time and O(L2N) in space. Only a few simplified implementations exist, usually constrained to pairwise alignment (10–13). Recently a number of multiple alignment versions *To whom correspondence should be addressed. Tel: +34 93 316 02 71; Fax: +34 93 316 00 99; Email: cedric.notredame@crg.es ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e52 Nucleic Acids Research, 2008, Vol. 36, No. 9 PAGE 2 OF 10 have been published (11,14–18), which employ various techniques to reduce run-time and memory consumption, for example by limiting the length or types of structure motifs or by using banding techniques during the dynamic programming stage [for review, see (19)]. The most accurate of theses heuristics are restricted to sequences shorter than approximately 200 residues, a limitation that explains why it is often more practical to use regular sequence aligners when dealing with larger sequences such as ribosomal RNA, or RNA motifs embedded in long genomic sequences. Most of these aligners treat RNA sequences like regular DNA and rely on an identity-based scoring scheme only suitable for closely related sequences. For instance, ClustalW has long been used for establishing reference collections of ribosomal RNA alignments (20). Following manual curation and visual inspection of the conserved secondary structures, these alignments have been widely used to infer phylogenetic relationships between most living organisms. Taking secondary structures into account may not, however, improve alignment accuracy across the entire spectrum of known ncRNA. For instance, secondary structure-based alignments will not improve the comparison of mature miRNAs or mRNAs that are not structurally conserved. In this work, we address the problem of RNA multiple sequence alignments by taking advantage of the T-Coffee framework (21). T-Coffee is a multiple sequence alignment method able to combine the output of different sequence alignment packages. It takes as input, a collection of alignments (pairwise or multiple) and outputs a multiple sequence alignment containing all the sequences. The input, which is referred to as a ‘library’, can consist of alternative and possibly inconsistent alignments of the same sequences. The purpose of the algorithm is to generate a final alignment that is as consistent as possible with the original input alignments. The main advantage of this procedure is its flexibility. For instance, in the original T-Coffee, the library was compiled from pairwise local and global alignments of the sequences. In M-Coffee (22) the compilation was made using alternative multiple sequence alignments while in 3D-Coffee (23) or Expresso (24), the library is derived from structure-based pairwise alignments. This simple protocol can easily be built on top of any existing method, as illustrated by two RNA alignment packages: Marna (25) and T-Lara (19). Both packages focused on the development of a novel pairwise RNA alignment algorithm, which was then used to generate an alignment library fed to T-Coffee in order to produce a multiple alignment. In the present work we decided to go further and modify the library processing/ extension algorithm so that it could take advantage of known and predicted secondary structures. This is done when compiling the library and while evaluating the matching score of two residues. This novel strategy forms the core of R-Coffee. Our primary goal was not to produce a stand-alone method, but rather a novel component that can seamlessly be added on top of any existing alignment method. We demonstrate here that it is possible to improve the alignment quality of most existing methods by means of R-Coffee, with only minor computational over-head. SYSTEMS AND METHODS Reference alignments and evaluation We used two different benchmark sets: BRAliBase 2.0 (5), the standard RNA reference alignment dataset and Cmfinder (26), a smaller dataset specifically designed for testing local analysis of long sequences. BRAliBase is collection of RNA reference alignments especially designed for the benchmark of RNA alignment methods. We only used its multiple alignment component made of 388 multiple sequence alignments. These datasets are evenly distributed between 35% and 95% average sequence identity. Each of these datasets was originally produced by extracting sub-alignments from larger seed alignments coming from four RNA families (tRNA, group II intron, 5S rRNA and U5 RNA). Two of these were seed alignments obtained from the Rfam database (27). This procedure may appear slightly circular as it involves comparing sequence-based reference alignments with other sequence-based alignments. In order to address this issue, BRAliScore, the benchmarking scoring scheme, was designed in such a way that it not only depends on the similarity between the reference and the evaluated alignment but also on the intrinsic structural conservation of the target alignment [see also (6)]. This tradeoff illustrates the difficulties in establishing a gold standard for RNA analysis. The main problem comes from the lack of sufficient experimentally validated RNA structures, in contrast to protein sequence analysis where literally hundreds of accurate 3D structures exist. The BRAliScore combines two measures: the Sum of Pairs Score (SPS) and the Structural Conservation Index (SCI) (28). The SPS is the ratio between the number of residue pairs identically aligned in the target and the reference, divided by the number of pairs in the reference. It was measured using a variant of compalignp [based on Sean Eddy’s compalign; see also (6)] adapted to restrict the evaluation to a pre-defined core region. The SCI is a reference-independent measure. It is defined as the ratio between the average free energies of all single sequences of the MSA [as calculated by RNAfold; (29)] and the free energy of the MSA consensus structure [as calculated by RNAalifold; (30)]. A value of 0 indicates a lack of a conserved structure, 1 corresponds to a perfect agreement between the energies of the single sequences and the consensus energy, while values higher than 1 indicate a very good agreement supported by additional co-variation. The BRAliScore is the product of the SCI and the SPS score. This combination can lead to problems when either the SPS or the SCI are close to 0. In practice however, this situation rarely arises, and the combination of these two scores provides a very robust measure, less sensitive than the SPS, to the effective accuracy of the reference alignment. To test for statistical differences between pairs of methods, we applied the Wilcoxon signed rank test as in (6). All analyses were carried out using tools provided from http://www.biophys.uni-duesseldorf.de/ bralibase/. Our second dataset is named after the RNA motif finder program Cmfinder (26). It contains Rfam sequences embedded in 200 nt of their original flanking genomic PAGE 3 OF 10 Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 Table 1. Programs used for benchmarking and as input for T/R-Coffee Program ClustalW Consan Dynalign Foldalign FoldalignM Mafft Marna M-Locarna Murlet Muscle Pcma Pmcomp Pmmulti Poa Proalign Probcons Prrn Rnasampler Stemloc Stral T-Lara T-Coffee Reference (33) (10) (12) (13) (15) (35) (25) (17) (14) (32) (45) (11) (11) (46) (47) (34) (48) (44) (16) (49) (19) (21) Version 1.83 1.2 Dec-06 2.0.3 1.0.1 5.861 Jan-07 0.99 Mar-06 3.6 2 Jun-04 Jun-04 2 0.5.a3 1.1 SCC 3.0.a 1.3 Dart 0.19b 0.5.4 1.31 5.19 Structure N Y Y Y Y Y Y Y Y N N Y Y N N N N Y Y Y Y N Sankoff N Y Y Y Y N N Y Y N N Y Y N N N N Y Y N N N Alignment mode Multiple Pairwise Pairwise Pairwise Multiple Multiple Multiple (T-Coffee) Multiple Multiple Multiple Multiple Pairwise Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple Multiple (T-Coffee) Multiple Command line -type = dna -m mixed80.mod Len2-len1 + 5 0.4 5 20 2 1 0 -global -score_matrix global.fmat ginsi/fftns mlocarna-p -seqtype rna blosum80.mat -multiple –slow -global -o lara.params -dp_mode myers_miller_pair_wise This table lists all the packages evaluated along with their version numbers (or download date). The Structure column indicates whether the consider packages use predicted secondary structures (Y for Yes, N for No). The Sankoff column indicates whether the package is a heuristic implementation of the Sankoff original algorithm. The Alignment Mode column indicates whether the package performs pairwise or multiple alignment or if it’s based on the T-Coffee package. The last column gives used command line parameters; most programs were used as in the BRAliBase alignment benchmark publications (5) and (6). regions, randomly distributed between the 50 and the 30 of the ncRNA sequence (i.e. x nucleotides on the 50 -, y nucleotides on the 30 -end with x and y randomly chosen so that x + y = 200). The unaligned datasets were kindly provided by the Cmfinder authors. We limited our choice to datasets having less than 40 sequences thus generating 11 reference alignments (9 to 35 sequences, length between 167 and 368 nt). The average level of identity within the core regions of these alignments ranges from 31% to 73%. These characteristics make the Cmfinder dataset a difficult target, especially because of the sequence length and the inclusion of flanking regions. These datasets are also closer to ‘real life’ situations that often involve discovering an RNA motif within poorly characterized sequences. The presence of flanking genomic regions potentially embedded in the Cmfinder datasets made it impossible to systematically use the SCI component of the BRAliscore. Instead we used the Sum of Pairs score (SPS) and restricted the scoring to the Rfam core region of the alignment. Note that most available RNA alignment benchmark sets are based on Rfam alignments. These alignments are by no means a gold standard (especially not the ‘full’ Rfam alignments) and are not based on 3D superposition as most protein alignment benchmarks. For example, the BRAliBase benchmark set was created from four RNA families, two of which were full Rfam alignments (U5, g2intron) and two were Rfam seed alignments (tRNA, 5S) i.e. hand-curated and thus more likely of high quality. The Cmfinder data sets are exclusively based on Rfam seed alignments. The low number of quality alignments suited especially for benchmarking (i.e. equally distributed over a wide sequence identity range etc.) of multiple RNA alignment programs is a notorious problem. New RNA alignment benchmarks with a high number of RNAs using expert, hand-curated alignments, which are based on structural superposition [e.g. from the Comparative RNA web site (31)] would constitute a useful advance in this area. Alignment programs To test and benchmark R-Coffee, we compared a variety of programs with different features (Table 1). These programs can be roughly divided in three categories: pairwise structural aligners, multiple structural aligners and regular multiple sequence aligners. The pairwise structural aligners like Consan (10) are heuristic approximations to the original Sankoff algorithm. Their heavy computational requirements limit them to short sequences. The second category includes structural aligners extended to deal with multiple sequences like FoldalignM (15) or Stemloc (16). Like their pairwise counterparts, they use structure and sequence information during the alignment and are therefore restricted to short datasets. The third category is made of regular multiple sequence alignment programs like Muscle (32) or ClustalW (33). These programs do not rely on any kind of structural modeling, although some of them [like Probcons (34) and Mafft (35)] have optimized parameters for BRAliBase i.e. program parameters were trained on BRAliBase alignments by the respective program’s authors. These two last categories of packages can either be used to align multiple sequence datasets or pairs of sequences. e52 Nucleic Acids Research, 2008, Vol. 36, No. 9 PAGE 4 OF 10 Most programs were used as described in (5) and (6). Marna (25), Pmmulti (11) and Stemloc (16) were not able to align all test sets of BRAliBase. In particular, Marna cannot align sequences that contain IUPAC characters and the ability of Stemloc to align sequences seems to depend on the size of the main memory. We therefore had to exclude these packages from the comparison, although it should be noted that they produced accurate alignments on the datasets they could align (data not shown). In case of T-Lara we did not use the pairwise alignments as T-Lara already uses T-Coffee. Instead we used R-Coffee as a drop-in-replacement for T-Coffee. We used the standalone versions of all packages to compute multiple alignments for all the reference datasets. We also used them in combination with either T-Coffee or R-Coffee. All programs were run on a Quad-Xeon-3 GHz machine with 6 GB RAM running Red Hat Enterprise Linux. Original T-Coffee strategy T-Coffee is a versatile MSA package that allows the combination of many pairwise (or multiple) sequence alignments into one unique final model. The principle is fairly straightforward. Given a set of sequences, a collection of pairwise alignment is computed. This collection can be redundant (several alternative alignments for each pair of sequences) or not, and is compiled into a data structure called the primary library. The primary library contains the list of all the pairs of aligned residues observed in the alignment collection. Each of these pairs receives a weight equal to the score of the alignment it came from (in practice the percent identity is used). These weights are then re-estimated in a process named library extension. The purpose of the new weights (extended weights) is to reflect the level of consistency between each pair of aligned residues and the rest of the library. High-scoring pairs are those in very good agreement with the rest of the pairs and their high score insures that they should easily find their way into the final alignment. R-Coffee uses the Myers and Miller algorithm (command line option: -dp_mode = myers_miller_pair_ wise) to align pairs of sequences or profiles rather than the current T-Coffee default (-dp_mode = cfasta_pair_wise) that uses a banded dynamic programming implementation extensively tuned for proteins. The Myers and Miller setting corresponds to the T-Coffee algorithm as described in the original publication (21). Adaptation of T-Coffee to use RNA structural information The novel RNA-specific mode of T-Coffee described here has been designed to be able to use secondary structure predictions. The current design supports an arbitrary amount of structural prediction, and each sequence can be associated with one or more secondary structure predictions that do not need to be in agreement. It is also possible not to associate any structural information with some sequences. In practice, however, we expect the best results to be obtained when using at least one secondary structure prediction for each sequence in the dataset. These structural predictions are passed to R-Coffee, using a data structure similar to the T-Coffee primary library and named a structural library. In this structural library, each line indicates a pair of nucleotides predicted to be base-paired. Like its primary sequence counterpart, this structural library can be redundant, contain conflicting pairs or lack data for some pairs. RNA structures were computed using either a global or a local prediction method. Global structure predictions were obtained with RNAfold (29) which finds a structure with Minimal Folding Energy (MFE). When using a MFE structure as input, each predicted base pair was directly added to the structural library without any further filtering. This global MFE-based prediction has two major limitations: the algorithm is computationally demanding when being applied to very long sequences and its prediction accuracy decreases with sequence length (36,37). When dealing with long sequences, a sensible alternative is to use local RNA structure prediction methods such as RNAplfold (38). RNAplfold predicts local pair probabilities for base pairs within a certain span (default is set to 100 nt). The program outputs base pair probabilities rather than one precise structure and in order to reduce noise, we excluded pairs exhibiting a low thermodynamic probability. We determined a suitable probability threshold by varying the filtering threshold between 0.0 and 0.8 (in steps of 0.1) and estimating the accuracy of the resulting R-Coffee alignments (Figure 2). We found 0.3 to be the optimal threshold, although our results indicate a relative stability of the system around this value (flat peak). The structural pairs thus gathered are then fed to R-Coffee, the version of T-Coffee using the R-Score (see later). The structural libraries used here contain un-weighted structure pairs, although it is, in principle, possible to apply a weighting scheme onto these pairs, possibly reflecting the thermodynamic stability or the likelihood of each considered pair. For testing purposes we also used random structures as input. These structures were computed by shuffling the input sequences before predicting the structures using RNAfold/RNAplfold as described earlier. For shuffling we used the program shuffle from Sean Eddy’s squid package. The R-score: a novel T-Coffee scoring scheme The original T-Coffee algorithm was modified in order to incorporate structural information within the library compilation process. This novel evaluation procedure is named the R-score and gives its name to R-Coffee, with the letter R standing for RNA. The R-score requires the secondary structures of the considered sequences to be pre-computed and it also involves two modifications of the original T-Coffee algorithm: one when compiling the pairwise alignment library and the other when evaluating the score for aligning two residues. The new library compilation procedure involves extending the original T-Coffee library with any residue pair not observed within the pairwise alignments but whose relevance is suggested by the secondary structure predictions (Figure 1). For instance, let A $ X be two nucleotides predicted to form a secondary structure in sequence 1 and B $ Y two other paired nucleotides in sequence 2. PAGE 5 OF 10 Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 primary library. The R-score of that same pair is then defined as: RscoreðA,BjA $ X,B $ YÞ ¼ MAX ðTscoreðA,BÞ,TscoreðX,YÞÞ 2 where X pairs with A and Y with B as indicated by the structural library (A $ X, B $ Y). Whenever the structural library is ambiguous (i.e. A $ X, A $ W, B $ Y, B $ W), the final score is estimated by considering all the resulting combinations: RscoreðA,BjA $ X,A $ Z,B $ Y,B $ WÞ ¼ MAXðTscoreðA,BÞ,TscoreðX,YÞ, TscoreðX,WÞ,TscoreðZ,YÞ,TscoreðZ,WÞÞ The R-score, like the regular T-Coffee scoring scheme is then used as a position-specific substitution matrix while computing an alignment. R-Coffee uses the progressive alignment strategy described in the original T-Coffee algorithm and inspired from the ClustalW implementation. Sequences are all aligned two by two, using a standard identity based matrix and the Myers and Miller implementation of dynamic programming. These alignments are then used to derive a distance matrix that is turned into a Neighbor-Joining tree (39). This tree is used as a guide tree to define the order in which the sequences are aligned to create the multiple alignment. These alignments are made using the R-score as a positionspecific scoring scheme and the Myers and Miller pairwise algorithm. Apart from the use of the Myers and Miller pairwise alignment option, all the other T-Coffee parameters have been left to their original default values. Availability R-Coffee is part of the T-Coffee package, an open source freeware distributed under the GPL license and available at no cost along with documentation from www.tcof fee.org. R-Coffee can be compiled on most platforms (UNIX, Mac OS X and Windows). RESULTS AND DISCUSSION R-Coffee is an RNA multiple sequence alignment method able to use RNA secondary structure information while computing a multiple sequence alignment. One of the key properties of R-Coffee is its low computational complexity. Given predicted structures, R-Coffee can compute structure-based sequence alignments in a time and space complexity similar to that reported for T-Coffee or Probcons [in the order of O(N3L2) for N sequences of length L]. Nonetheless, the computation of the predicted structures can be a limiting factor. For example, for global Minimal Folding Energy methods like RNAfold (29), can be quite demanding with growing sequence length and prediction quality depends on sequence length (36,37). Our first concern was therefore to check whether the replacement of RNAfold with the less-demanding RNAplfold (38) could prove useful. RNAplfold is able to predict the fold of long sequences thanks to its local 3 Figure 1. R-Coffee’s RNA-extension. The two Gs correspond to a pair of matched residues observed in the input pairwise alignment. This gets incorporated in the library as a constraint. Both of these nucleotides are predicted to be base paired (Bp) with two Cs that have not been found aligned. The RNA extension involves incorporating the associated constraint (C matched to C), based on the information contained in the provided structures. This structure-based extension is one of the two main ingredients of the R-Coffee scoring scheme. In the standard T-Coffee procedure, if the primary alignment of sequences 1 and 2 contains the aligned pair A–B, this pair will be added as an entry to the library and associated with a weight equal the average identity of the alignment of sequences 1 and 2 (the weights will be added if several alternative alignments contribute the same pair). The R-Coffee library procedure goes further and involves incorporating the pair X–Y into the library (with the weight of A–B) even it was not aligned in any of the input library alignments. The rationale is that if the alignment of A–B is correct and if the predicted structures are correct as well, then the pair X–Y should be aligned and it therefore makes sense to add it to the library (if X–Y is already part of the primary library, its weight is increased by the A–B weight). Whenever more than one structure has been provided for each sequence, the secondary structure library may be ambiguous and provides several alternative base pairings for one or both residues (e.g. A $ X, A $ W in one sequence and B $ Y, B $ Z in the other). In this case, the update will consider a combination of all the potential structure-induced aligned pairs (i.e. X–Y, X–Z, W–Y, W–Z) and increase their primary weight with that of A–B. The R-score also uses a new evaluation procedure. The regular T-Coffee scoring scheme computes the matching score of a given residue pair A–B by summing up over the score of all the residue triplets including A, B and a third residue x from a third sequence. This can be formalized as follows: X MINðWeightðA,xÞ,WeightðB,xÞÞ 1 TscoreðA,BÞ ¼ x where Weight(A,x) is a primary weight and x is any residue reported aligned both to A and B within the e52 Nucleic Acids Research, 2008, Vol. 36, No. 9 PAGE 6 OF 10 Figure 2. R-Coffee/RNAplfold base pair probability threshold optimization. Base pairs predicted by RNAplfold above a certain probability threshold were used as input for R-Coffee. Then all BRAliBase sets were aligned and the average alignment accuracy (BRAliscore) calculated. The optimal threshold was determined to be 0.3. Figure 3. Effect of the RNA-extension on T-Coffee’s performance on BRAliBase 2.0. The plot shows the alignment accuracy as function of the sequence identity. Scores are averaged over 5% sequence identity bins. Standard T-Coffee is compared to R-Coffee using structure input from RNAfold and RNAplfold as well as random structures. structure prediction algorithm. In practice, this result is achieved by restricting the computation to the local neighborhood of each nucleotide (default is a span of 100 nt). RNAplfold outputs base-pairing probabilities rather than a single secondary structure. We therefore determined an optimal threshold for filtering out unreliable base pairs. We did so by extensive testing on the BRAliBase dataset (see ‘Material and methods’ section and Figure 2). The cutoff value thus determined (0.3) was used throughout the remaining experiments. Given this cutoff value, we systematically compared the BRAliscore obtained by R-Coffee when using RNAfold and RNAplfold structural libraries. Both structural libraries (RNAfold and RNAplfold) give similar results. Interestingly, this correlation is high regardless of whether the considered sequences are closely or distantly related (Figure 3). The mean BRAliscore for the two methods is the same (0.64) with 53% of the 388 BRAliBase datasets having their BRAliscore within 5% of each other when using the RNAfold or the RNAplfold structural library. We therefore decided to use RNAplfold as the default provider of secondary structure predictions for the rest of the analysis. This allows R-Coffee to deal with sequences of arbitrary size. In order to check the effect of the accuracy of the predicted structures, we also tested R-Coffee using random structure predictions, as input. The performance then returns to the default T-Coffee accuracy (Figure 3), i.e. alignment quality does not get worse as compared to default T-Coffee, but clearly decreases when compared with genuine structure predictions. Altogether these results suggest that it makes little difference in accuracy whether we use RNAfold or RNAplfold for secondary structure prediction in R-Coffee. They also confirm the effectiveness of the incorporation of structural information within the alignment procedure. We wish to note here, that although we limited our analysis to these two approaches, the flexibility of R-Coffee’s RNA extension allows incorporating and combining any kind of structure prediction. Alternatives include using RNAfold’s partition function and an applied threshold (as done with RNAplfold here) or using methods with a higher selectivity like Contrafold (40). But one could also include, for example sub-optimal structure or pseudoknot predictions (41). Next, we examined the merits of R-Coffee in comparison with other methods. It should be stressed here that our primary goal was not to produce a stand-alone method, but rather to use R-Coffee as a novel component that can seamlessly be combined with any existing RNA alignment method. We therefore focused our efforts on the evaluation of the combination between R-Coffee and other established methods. In order to determine the baseline of our analysis we ran common sequence alignment methods on the 388 BRAliBase datasets (top part of Table 2). Our results are relatively consistent with previous reports (42,43) of accuracy on protein sequence alignments: Mafft (35), Probcons (34) and Muscle (32) deliver the best alignments. The default T-Coffee is notably inaccurate with RNA (5), most likely because it uses, by default, a banded dynamic programming heavily tuned on protein sequences. The second part of Table 2 (structural aligners) is also consistent with previous reports and confirms that RNA alignment methods making use of structural information have a higher accuracy than sequence aligners. Our results show that FoldalignM (15), Rnasampler (44), T-Lara (19) and Murlet (14) clearly outperform all the regular sequence alignment methods, with more than five points difference between the best structure-based alignment methods (FoldalignM/Rnasampler) and their best non-structurebased counterpart (Mafft ginsi). The most straightforward way to embed these methods within R/T-Coffee is to use each individual method to generate libraries of pairwise alignments. This protocol merely requires a pairwise alignment for each pair of sequences within a dataset and using the resulting PAGE 7 OF 10 Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 Table 2. BRAliBase evaluations Method BRAliscore Net improvement Default +T-Coffee +R-Coffee +T-Coffee +R-Coffee T-Coffee Poa Pcma Prrn ClustalW Proalign Mafft fftns Probcons Muscle Mafft ginsi M-Coffee4 M-Locarna Stral Murlet Rnasampler FoldalignM Dynalign Foldalign T-Lara Consan 0.59 0.62 0.62 0.64 0.65 0.66 0.68 0.69 0.69 0.70 0.71 0.66 0.71 0.73 0.75 0.75 / / / / / 0.65 0.64 0.61 0.65 0.68 0.68 0.67 0.69 0.68 / 0.69 0.70 0.70 0.70 0.76 0.62 0.62 0.74 0.79 0.63 0.70 0.67 0.66 0.69 0.71 0.72 0.71 0.73 0.72 0.74 0.71 0.72 0.72 0.71 0.76 0.62 0.77 0.73 0.79 / 48 34 À63 À7 30 17 À74 À17 À49 / 101 À4 À132 À101 72 / / / / 125 154 120 45 83 128 68 51 42 39 84 133 19 À73 À95 76 / / / / Each line in the table corresponds to the evaluation of the package listed in the Method column. The BRAliscore section indicates the average BRAliscore performance of the package. The default column indicates the score obtained by the considered package. The +T-Coffee indicates the average BRAliscore using the corresponding package combined with T-Coffee. The +R-Coffee column indicates the average BRAliscore of the same package combined with R-Coffee. The slash / indicates values that could not be computed, either because the method only produces pairwise alignments (Dynalign, Foldalign and Consan), or because the method is a derivative of or uses T-Coffee (e.g. T-Lara). The Net Improvement section indicates the net improvement over the stand-alone methods. alignments as a primary library for either T-Coffee or R-Coffee. The structural libraries were computed once on the entire dataset and then re-used. This protocol was used on all the aligners with the exception of T-Lara for which we followed the combination protocol described by T-Lara’s authors. It involves compiling partial T-Coffee libraries with Lara (i.e. libraries restricted to aligned stems) and combining them with the default T-Coffee libraries made of global and local pairwise alignments, that same protocol was used when combining Lara with R-Coffee. We first evaluated the effect of using the regular T-Coffee to compute an MSA with pairwise libraries generated either with regular sequence or structural aligners. The results are displayed in the +T-Coffee column of Table 2. For each T-Coffee/method X combination (X being any of the tested methods), we calculated the average BRAliScore and the Net Improvement (NI), which is the absolute improvement induced by combining that method with T-Coffee. It is defined as the number of test cases where a method X outperforms that method combined with T-Coffee (T-Coffee/X) minus the number of times the T-Coffee/X combination outperforms method X:     T-Coffee XoutperformsT-Coffee NI¼ À 4 XoutperformsX X The NI provides a guide as to whether one of the methods outperforms another. Results in Table 2 are easier to interpret when the regular sequence aligners and the structural aligners are separately considered. The regular aligners show little benefit from the T-Coffee combination of their pairwise output (Column +T-Coffee), probably because these methods already make an efficient use of their sequence information, or at least because they use it as efficiently as T-Coffee could. It is not a surprising result since most of these methods either use a T-Coffee inspired consistency-based scoring scheme (Mafft g/linsi, Probcons) or a sophisticated iterative method (Muscle, Prrn) to improve the original progressive MSA. R-Coffee, on the other hand, provides a clear improvement to all the regular sequence alignment methods tested here (Table 2, +R-Coffee column). This improvement remains regardless of the metrics used (BRAliscore or Net Improvement). The results obtained when combining R/T-Coffee with structural aligners follow a similar albeit less marked pattern. When added on the top of structural aligners, T-Coffee improves two methods out of five and R-Coffee improves three out of five. These observations are fairly consistent with the underlying principles of the alignment programs (sequence and structural aligners). They suggest that the potential benefits of using R-Coffee come as much from the T-Coffee consistency-based scoring scheme as they do from the R-extension. The relatively small benefit coming from the R-extension in this case also makes sense if one considers that the structural aligners already use structural information and are therefore less likely to benefit from the incorporation of RNAplfold predictions than their sequence-based counterparts. This is especially true when combining T-Coffee with Consan. It is worth mentioning, however, that the use of the R-scoring scheme outperforms similar T-Coffee combinations in most cases with five methods out of nine being improved when switching from the T-Coffee to the R-Coffee combination and four methods remaining unchanged. Altogether, the data collected in Table 2 strongly suggest that consistency-based scoring schemes provide an efficient framework for making the best out of pairwise alignment methods. T/R-Coffee/Foldalign and T/RCoffee/Consan provide the best illustration of this concept (bottom of Table 2). Consan is computationally too expensive to be easily extended to MSAs, yet, a straightforward combination with R-Coffee results in a method that outperforms all the other methods analyzed in this work (Tables 2 and 3). Figure 4 shows a detailed performance plot on BRAliBase and compares R-Coffee/ Consan with the best sequence alignment method (Mafft ginsi) and FoldalignM. This plot shows, that R-Coffee/ Consan performs better than FoldalignM across the full range of sequence identities, even if the difference is not statistically significant (Table 3). It is important to point out that the shape of this curve is a side effect of the two components that comprise BRAliScore (SCI, the structural component and SPS the sequence one). High levels of sequence identity naturally result in high-scoring alignments. At the other side of the spectrum at low identity levels, numerous compensating base pair e52 Nucleic Acids Research, 2008, Vol. 36, No. 9 PAGE 8 OF 10 Table 3. Net Improvement of R-Coffee/Consan and RM-Coffee4 over programs on BRAliBase Method Poa T-Coffee Prrn Pcma Proalign Mafft fftns ClustalW Probcons Mafft ginsi Muscle M-Locarna Stral FoldalignM Murlet Rnasampler T-Lara versus R-Coffee-Consan 241ÃÃà 241ÃÃà 232ÃÃà 218ÃÃà 216ÃÃà 206ÃÃà 203ÃÃà 192ÃÃà 170ÃÃà 169ÃÃà 234ÃÃà 169ÃÃà 146 130à 129à 125à versus RM-Coffee4 Table 4. Cmfinder data set comparison Method SPS Net improvement Default +T-Coffee +R-Coffee +T-Coffee +R-Coffee 217ÃÃà 199ÃÃà 198ÃÃà 151ÃÃà 150Ãà 148à 136ÃÃà 128à 115 111 183Ãà 62 61 À12 À27 À30 ClustalW Mafft ginsi Mafft fftns Muscle Pcma Poa Proalign Probcons Prrn M-Locarnap T-Coffee R/M-Coffee4 0.54 0.64 0.60 0.32 0.49 0.31 0.40 0.50 0.43 0.53 0.54 / 0.57 0.64 0.64 0.40 0.55 0.38 0.39 0.45 0.54 0.63 / 0.63 0.58 0.64 0.64 0.42 0.58 0.42 0.41 0.51 0.56 0.63 0.53 0.65 5 À1 6 4 8 4 À4 À3 3 6 / / 5 2 6 8 8 8 À2 2 4 5 2 0 This table indicates the relative performance of the methods listed in the Method column in comparison with the R-Coffee/Consan and RMCoffee4 combinations, as net improvement. Asterisks indicate statistically significant differences according to Wilcoxon tests (à P = 0.05; Ãà P = 0.01; ÃÃà P = 0.001). The upper part of the table contains sequence aligners only, the lower part structural alignment programs. Within these sections, programs are sorted by net improvement. Each line in the table corresponds to the evaluation of the package listed in the Method column. The SPS section indicates the averaged sum-of-pairs scores (applied to the Rfam core alignment) measured on the considered package; +T-Coffee is the same score measured on the package combined with T-Coffee (+T-Coffee); the +R-Coffee column corresponds to that same package combined with R-Coffee. The slash / indicates values that could not be computed because the method is a derivative of T-Coffee (T-Coffee and M-Coffee). The Net Improvement section indicates the net improvement for similar combinations. Figure 4. Comparison of R-Coffee/Consan and RM-Coffee with other programs. The plot shows the alignment accuracy on BRAliBase 2.0 as function of the sequence identity. Scores are averaged over 5% sequence identity bins. We included the best stand-alone sequence aligner (MAFFT ginsi), one of the two best structural aligners (FoldalignM), the best R-Coffee combination (R-Coffee/Consan) and RM-Coffee4 that combines the pairwise alignments of Probcons, MAFFT ginsi/fftns and Muscle by means of R-Coffee. mutations can result in high scores, because they are taken into account by the SCI (see also Reference alignments and Evaluation). Nonetheless, and across the whole identity spectrum, our data supports well the idea that R-Coffee/Consan is probably the most accurate RNA MSA alignment method currently available for the kind of datasets found in BRAliBase (i.e. less than 150 nt). We next assessed whether R-Coffee is also useful for aligning long sequences. We analyzed the Cmfinder dataset made of Rfam alignments embedded within surrounding genomic sequences of varying lengths. None of the structural aligners except M-Locarna (17), was able to run on all the 11 datasets and the analysis was restricted to regular sequence aligners (Table 4). With the notable exception of Muscle (32), the ranking in this table is not dramatically different from that in Table 2. The behavior of these methods when combined with T- or R-Coffee is also similar. When considering the 10 sequence aligners with T-Coffee, we observed an improvement on 7 methods out of 10. This figure rises to 9 out of 10 when making the combination with R-Coffee. Although these results are based on too small a dataset (11 alignments) to be considered statically significant, they are in very good agreement with those reported on BRAliBase in Table 2 and confirm R-Coffee’s ability to improve over most sequence alignment methods. The main practical problem with using R-Coffee is that to reach its highest level of accuracy, it requires the installation of RNA alignment packages, which may be extremely greedy with memory and CPU usage. We therefore checked whether a simpler alternative could be better suited for more modest computational configurations, or for high throughput applications. In a previous paper, Wallace et al. reported and characterized a novel mode of T-Coffee named M-Coffee (22). M-Coffee is a meta-aligner that combines alternative multiple sequence alignment methods into one consensus alignment. This combination usually results in an improvement over the constituting methods. We used the M-Coffee approach to combine the four best regular alignment methods (i.e. non-structure based), and tested them on BRAliBase. Following the strategy outlined in the original M-Coffee paper, we incorporated the sequence aligners in order of decreasing performances and kept the combination with the highest average. This protocol resulted in RMCoffee4, a combination of Muscle, Probcons, Mafft ginsi and Mafft fftns fed to T-Coffee (M-Coffee4) or R-Coffee PAGE 9 OF 10 Nucleic Acids Research, 2008, Vol. 36, No. 9 e52 (RM-Coffee4). The results (Table 2 and Table 3, Figure 4) are unambiguous and indicate that RM-Coffee4 clearly outperforms all the sequence alignment methods while delivering the best BRAliBase alignments one may obtain without using a structural aligner. These results were not confirmed on the 11 Cmfinder datasets (Table 4), either because this dataset is too small to reveal the trend or because of the negative effect of Muscle on RM-Coffee4 on this specific dataset. CONCLUSION We have presented a modified version of the T-Coffee (21) multiple sequence alignment method, named R-Coffee, designed for delivering highly accurate multiple ncRNA alignments. R-Coffee is a heuristic, able to take advantage of secondary structure predictions carried out beforehand. It is best described as an alignment improver and we show in this work that it can effectively improve all sequence alignment packages, taken off the shelf and without tuning. Among all the combinations tested here, one clearly outperformed the alternatives: the combination of R-Coffee and Consan (10). Most of these tests were carried out on the BRAliBase reference datasets (5). We also checked whether R-Coffee was able to deal with datasets of longer sequences, combining a mixture of related and unrelated segments. For that purpose, we used a dataset designed for the Cmfinder algorithm (26). We found that the R-Coffee combination improved, to a greater or lesser extent, all the tested alignment methods. The combined observations made on the BRAliBase and Cmfinder datasets suggest that the R-Coffee scoring scheme is able to make effective use of RNA predicted secondary structures in order to improve accuracy over most regular sequence aligners. This strategy also works when applied to structural aligners, although less dramatically than when considering regular sequence aligners. These results confirm the strength of consistency-based scoring schemes over regular alignment methods. They suggest that most pairwise alignment methods can usefully be incorporated in a consistency-based framework such as T-Coffee. Our results also indicate that the meta-method approach originally described for M-Coffee (22) can be applied to R-Coffee, and that whenever the computation of highly accurate structure-based RNA pairwise alignments is not feasible, one may obtain alignments of reasonable quality by combining purely sequence-based alignments via R-Coffee. Further progress will also require the assembly of more demanding reference datasets, especially for long sequences. Such datasets are hard to assemble because RNA structural information is scarce (compared to protein structure information). RNA alignment remains a rapidly developing field. With an increasing number of novel biological functions associated with yet poorly characterized RNA genes, there is an ever growing need for methods allowing accurate comparison of RNA sequences and the identification of distant homologues. Any improvement in alignment accuracy is likely to have a big impact. In this context, R-Coffee can easily be further improved. The flexible way in which secondary structures are fed to the program allows a seamless combination of data from heterogeneous sources. It is important to point out that all the possibilities supported by the current software implementation have not yet been explored. Most notably, we have not yet fully exploited the possibility to associate more than one predicted structure to each sequence. These alternative structures could either be suboptimal structures, or the output of alternative structure prediction programs, such as ContraFold or Rfold. One could also combine structure predictions of any kind, including local, global or even tertiary interactions like pseudoknots, with experimentally verified structures. The possibility of combining data from various sources is, perhaps, the major strength of R-Coffee. ACKNOWLEDGEMENTS We thank Iain M. Wallace for useful discussions and all authors for their assistance with using their programs. This work was partly supported by funding from the Science Foundation Ireland. C.N. thanks the centre for genomic regulation for support and funding. Funding to pay the Open Access publication charges for this article was provided by Centro de Regulacio Genomica (CRG). Conflict of interest statement. None declared. REFERENCES 1. Zamore,P.D. and Haley,B. (2005) Ribo-gnome: The Big World of Small RNAs. Science, 309, 1519–1524. 2. Costa,F.F. (2007) Non-coding RNAs: lost in translation? Gene, 386, 1–10. 3. Birney,E., Stamatoyannopoulos,J.A., Dutta,A., Guigo,R., Gingeras,T.R., Margulies,E.H., Weng,Z., Snyder,M., Dermitzakis,E.T., Thurman,R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. 4. Levenshtein,V.I. (1966) Binary codes capable of correcting deletions, insertions, and reversals. Cybern. Control Theory, 10, 707–710. 5. Gardner,P.P., Wilm,A. and Washietl,S. (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. 6. Wilm,A., Mainz,I. and Steger,G. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol., 1, [Epub ahead of print]. 7. Thompson,J.D., Plewniak,F. and Poch,O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690. 8. van Nimwegen,E., Crutchfield,J.P. and Huynen,M. (1999) Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA, 96, 9716–9720. 9. Sankoff,D. (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J1 Appl1 Math., 45, 810–825. 10. Dowell,R. and Eddy,S. (2006) Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7, 400. 11. Hofacker,I.L., Bernhart,S.H.F. and Stadler,P.F. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227. 12. Mathews,D.H. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, 21, 2246–2253. e52 Nucleic Acids Research, 2008, Vol. 36, No. 9 PAGE 10 OF 10 information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2. 32. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. 33. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 34. Do,C.B., Mahabhashyam,M.S.P., Brudno,M. and Batzoglou,S. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. 35. Katoh,K., Kuma,K.-i., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 36. Doshi,K., Cannone,J., Cobaugh,C. and Gutell,R. (2004) Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics, 5, 105. 37. Dowell,R. and Eddy,S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71. 38. Bernhart,S.H., Hofacker,I.L. and Stadler,P.F. (2006) Local RNA base pairing probabilities in large sequences. Bioinformatics, 22, 614–615. 39. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. 40. Do,C.B., Woods,D.A. and Batzoglou,S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90–e98. 41. Reeder,J., Hochsmann,M., Rehmsmeier,M., Voss,B. and Giegerich,R. (2006) Beyond Mfold: recent advances in RNA bioinformatics. J. Biotechnol., 124, 41–55. 42. Carroll,H., Beckstead,W., O’Connor,T., Ebbert,M., Clement,M., Snell,Q. and McClellan,D. (2007) DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics, 23, 2648–2649. 43. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. Opin. Struct. Biol., 16, 368–373. 44. Xu,X., Ji,Y. and Stormo,G.D. (2007) RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics, 23, 1883–1891. 45. Pei,J., Sadreyev,R. and Grishin,N.V. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19, 427–428. 46. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464. 47. Loytynoja,A. and Milinkovitch,M.C. (2003) A hidden Markov ¨ model for progressive multiple alignment. Bioinformatics, 19, 1505–1513. 48. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol, 264, 823–838. 49. Dalli,D., Wilm,A., Mainz,I. and Steger,G. (2006) STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599. 13. Havgaard,J.H., Lyngso,R.B., Stormo,G.D. and Gorodkin,J. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. 14. Kiryu,H., Tabei,Y., Kin,T. and Asai,K. (2007) Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics, 23, 1588–1598. 15. Torarinsson,E., Havgaard,J.H. and Gorodkin,J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics, 23, 926–932. 16. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6, 73. 17. Will,S., Reiche,K., Hofacker,I.L., Stadler,P.F. and Backofen,R. (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol., 3, e65. 18. Meyer,I.M. and Miklos,I. (2007) SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput. Biol., 3, e149. 19. Bauer,M., Klau,G.W. and Reinert,K. (2007) Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8, 271. 20. Wuyts,J., Perriere,G. and Van de Peer,Y. (2004) The European ribosomal RNA database. Nucleic Acids Res., 32, D101–D103. 21. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 22. Wallace,I.M., O’Sullivan,O., Higgins,D.G. and Notredame,C. (2006) M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res., 34, 1692–1699. 23. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 24. Armougom,F., Moretti,S., Poirot,O., Audic,S., Dumas,P., Schaeli,B., Keduas,V. and Notredame,C. (2006) Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res., 34, W604–W608. 25. Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics, 21, 3352–3359. 26. Yao,Z., Weinberg,Z. and Ruzzo,W.L. (2006) CMfinder-a covariance model based RNA motif finding algorithm. Bioinformatics, 22, 445–452. 27. Griffiths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.R. and Bateman,A. (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res., 33, D121–D124. 28. Washietl,S., Hofacker,I.L. and Stadler,P.F. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. 29. Hofacker,I.L. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., 31, 3429–3431. 30. Hofacker,I.L., Fekete,M. and Stadler,P.F. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066. 31. Cannone,J.J., Subramanian,S., Schnare,M.N., Collett,J.R., D’Souza,L.M., Du,Y., Feng,B., Lin,N., Madabusi,L.V., Muller,K.M. et al. (2002) The Comparative RNA Web (CRW) Site: ¨ an online database of comparative sequence and structure © 2001 Oxford University Press Nucleic Acids Research, 2001, Vol. 29, No. 1 55–57 A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 Sabine Dietmann, Jong Park, Cedric Notredame1, Andreas Heger, Michael Lappe and Liisa Holm* Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK and 1Structural and Genetic Information, CNRS UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France Received October 6, 2000; Accepted October 17, 2000 ABSTRACT The Dali Domain Dictionary (http://www.ebi.ac.uk/ dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. INTRODUCTION Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB), and a number of derived databases that organise this data into hierarchical classification schemes or in terms of structural neighbourhoods have appeared on the World Wide Web (1–4). We maintain the Dali Domain Dictionary and FSSP database with continuous weekly updates. Because many structural similarities are between substructures (domains), i.e. parts of structures, protein chains are decomposed into domains using the criteria of recurrence and compactness (5). Each domain is assigned a Domain Classification number D.C.l.m.n.p representing fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). The discrete classification presents views that are free of redundancy and simplify navigation in protein space. The structural classification is explicitly linked to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships (Fig. 1). A MAP OF FOLD SPACE The central concept underlying the classification is a ‘map of fold space’. This map is based on exhaustive neighbouring of all protein structures in the PDB. The all-against-all structure comparison is carried out using the Dali program. As a result of the exhaustive comparisons, each structure in the PDB is positioned in an abstract, high-dimensional space according to its structural similarity score to all other structures. The graph of structural similarities (between domains) is partitioned into clusters at four different levels of granularity. Coarse-grained overviews yield few clusters with many members that share broad architectural similarities, while fine-grained clustering yields many clusters within which structural similarities between members can extend to atomic detail due to functional constraints, for example, in binding sites. Continuing the practice from the FSSP database, fold types are defined by agglomerative clustering so that the members of a fold type have average pairwise Z-scores above 2. The threshold has been chosen empirically to group together structures with topological similarity. Dali Domain Dictionary version 3 introduces two new levels to the fold classification, one above and one below the fold type abstraction. The top level of the fold classification corresponds to secondary structure composition and supersecondary structural motifs. We have previously identified five attractor regions in fold space (1). We partition fold space so that each domain is assigned to one of attractors I–V, which are represented by archetype structures, using a shortest-path criterion. Structures which are disconnected from other structures, are assigned to class X. Domains which are not clearly closer to one attractor than another, are assigned to the mixed class Y. Currently, *To whom correspondence should be addressed. Tel: +44 1223 494454; Fax: +44 1223 494470; Email: holm@ebi.ac.uk 56 Nucleic Acids Research, 2001, Vol. 29, No. 1 Figure 1. Unification of the histone deacetylase and arginase families. Reuse and adaptation of existing structural frameworks for new cellular functions is widespread in protein evolution. Histone deacetylase and arginase are unified at the functional family level of the classification despite very little overall sequence similarity. The supporting evidence comes from structural and functional similarity. (A) Structure comparison of arginase (left, 1rlaA) (10) and histone deacetylase (right, 1c3pA) (11) yields a high Z-score of 12. Superimposition by Dali, drawing by Molscript (12). (B) Joint structural, evolutionary and functional information for two segments around the active site. Structurally aligned positions are shaded. Arginase has a binuclear metal centre where residues D124, H126 and D234 bind one and residues H101, H128 and H232 the other manganese ion. The former site is structurally equivalent to the zinc binding site of histone deacetylase made up of residues D168, H170 and D258. Sequence variability from multiply-aligned sequence neighbours in HSSP (asterisk, values 10 or larger; 0, invariant) is shown above and the secondary structure summary from DSSP (E,B, beta-sheet, S bend; T,G, hydrogen-bonded turns) is shown below the amino acid sequences. exhibit extreme structural divergence in regions supporting the active site, while coiled coils and beta-barrels are simple, geometrically constrained topologies which are believed to have emerged several times in protein evolution. To address the evolutionary classification problem, we have chosen to analyse functional and sequence-motif attributes on top of structural similarity in a numerical taxonomy. The more functional features two proteins have in common, the more likely it is that they do so due to a common descent rather than by chance. Currently, our feature set includes common sequence neighbours (overlap of PSI-BLAST families), analysis of 3D clusters of identically conserved residues, enzyme classification (E.C. numbers) and keyword analysis of biological function. A neural network assigns weights to these qualitatively different features. The neural network was trained against the superfamily to fold transition in a manual fold classification (2). To unify families, we exploit the empirical observation that Dali’s intramolecular distance comparison measure gives higher scores to pairs of homologues than to analogues. In practice, we require that functional families are nested within fold families in the fold dendrogram: functional families are branches of the fold dendrogram where all pairs have a high average neural network prediction for being homologous. The threshold for unification was chosen empirically and is conservative. Five hundred and four functional families unify two or more sequence families. Unified families have functional residues or sequence motifs that map to common sites in the 3D context of a fold. The strongest evidence is usually obtained for unifying enzyme catalytic domains. In some cases the expert system fails to capture enough evidence for unification of domains which are believed to be homologous, such as within the varied set of helix–turn–helix motif containing DNA binding domains where several functional families are defined at the same fold type level. A LIBRARY OF STRUCTURE-BASED MULTIPLE ALIGNMENTS OF REMOTE HOMOLOGUES The Dali Domain Classification can be browsed interactively at http://www2.ebi.ac.uk/dali. The server is implemented on top of a MySQL database. The classification may be entered from the top of the hierarchy, or the user may make a query about a protein identifier or a node in the classification hierarchy. Multiple structural alignments including attributes of the proteins are generated on the fly for any user selection of structural neighbours. Precomputed alignments are available for each functional family. The T-Coffee program (9) is used to generate genuine consensus alignments of multiple structures from the library of pairwise Dali alignments. A reliability score is computed to indicate well defined regions (the structural core) and regions where structural equivalences are ambiguous. Technically, T-Coffee improves alignment quality in a few known cases of functional families where active site residues were inconsistently aligned in some of the pairwise Dali comparisons. Scientifically, the definition of functional families and reliable multiple structure alignments for each opens the door to sensitive sequence database searches using position-specific profiles, and to benchmarking the alignment accuracy of threading predictions. class Y comprises about one-sixth of the representative domain set. In the future, some of these may be assigned to emerging new attractors. AN EVOLUTIONARY CLASSIFICATION The other new level of the classification infers plausible evolutionary relationships from strong structural similarities that are accompanied by functional or sequence similarities. Conceptually, this functional family level is equivalent to the ‘superfamily’ level of scop (2). The computational discrimination between physically convergent (analogous) and evolutionarily related, divergent (homologous) proteins has received much attention recently (6–8). Structural similarity alone is insufficient to draw a line between the two classes. For example, lysozymes Nucleic Acids Research, 2001, Vol. 29, No. 1 57 ACKNOWLEDGEMENT S.D. and J.P. were supported by EU contract BIO4-CT96-0166. REFERENCES 1. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. 2. Hubbard,T.J., Ailey,B., Brenner,S.E., Murzin,A.G. and Chothia,C. (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res., 27, 254–256. 3. Orengo,C.A., Pearl,F.M., Bray,J.E., Todd,A.E., Martin,A.C., Lo Conte,L. and Thornton,J.M. (1999) The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res., 27, 275–279. Updated article in this issue: Nucleic Acids Res. (2001), 29, 223–227. 4. Marchler-Bauer,A., Addess,K.J., Chappey,C., Geer,L., Madej,T., Matsuo,Y., Wang,Y. and Bryant,S.H. (1999) MMDB: Entrez’s 3D structure database. Nucleic Acids Res., 27, 240–243. 5. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96. 6. Russell,R.B., Saqi,M.A., Bates,P.A., Sayle,R.A. and Sternberg,M.J. (1998) Recognition of analogous and homologous protein folds–assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng., 11, 1–9. 7. Kawabata,T. and Nishikawa,K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins, 41, 108–122. 8. Wood,T.C. and Pearson,W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977–995. 9. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 10. Bewley,M.C., Jeffrey,P.D., Patchett,M.L., Kanyo,Z.F. and Baker,E.N. (1999) Crystal structures of Bacillus caldevelox arginase in complex with substrate and inhibitors reveal new insights into activation, inhibition and catalysis in the arginase superfamily. Structure, 7, 435–438. 11. Finnin,M.S., Donigian,J.R., Cohen,A., Richon,V.M., Rifkind,R.A., Marks,P.A., Breslow,R. and Pavletich,N.P. (1999) Structure of a histone deacetylase homologue bound to the TSA and SAHA inhibitors. Nature, 401, 188–193. 12. Kraulis,P. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. Appl. Crystallogr., 24, 946–950. W604–W608 Nucleic Acids Research, 2006, Vol. 34, Web Server issue doi:10.1093/nar/gkl092 Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee ´ ´ Fabrice Armougom, Sebastien Moretti, Olivier Poirot, Stephane Audic, Pierre Dumas1, 1 Basile Schaeli , Vladimir Keduas and Cedric Notredame* ´ Laboratoire Information Genomique et Structurale, CNRS UPR2589, Institute for Structural Biology and Microbiology (IBSM), Parc Scientifique de Luminy, 163 Avenue de Luminy, FR- 13288, Marseille cedex 09, France and 1Laboratoire `mes periphe ´ ´riques, Ecole Polytechnique Fede ´ ´rale de Lausanne, CH 1015 Lausanne, Switzerland de syste Received February 14, 2006; Revised March 1, 2006; Accepted March 7, 2006 ABSTRACT Expresso is a multiple sequence alignment server that aligns sequences using structural information. The user only needs to provide sequences. The server runs BLAST to identify close homologues of the sequences within the PDB database. These PDB structures are used as templates to guide the alignment of the original sequences using structure-based sequence alignment methods like SAP or Fugue. The final result is a multiple sequence alignment of the original sequences based on the structural information of the templates. An advanced mode makes it possible to either upload private structures or specify which PDB templates should be used to model each sequence. Providing the suitable structural information is available, Expresso delivers sequence alignments with accuracy comparable with structure-based alignments. The server is available on http://www.tcoffee.org/. INTRODUCTION Over the past years, multiple sequence alignments (MSAs) have become one of the most widely used tools in biology along with database search methods. MSAs are needed for profile analysis, phylogenetic reconstruction, structure prediction and a wealth of minor but important applications such as PCR primer design or sequence reconciliation. The ever-growing reliance on MSAs is even more pronounced now that hundreds of complete genomes are being made available. This window opened on evolution provides an ideal context for MSAs to fulfill their potential as key tools in functional genomics. Unfortunately, MSA packages are not yet accurate enough to deliver all their promises and the sharp increase in the number of methods recently published (25 novel programs over the last 5 years) illustrates well the expectation for improvement within the community. MSAs are not always good enough for large-scale analysis and while immense progress has been made to accurately align multiple sets of sequences with >40% average identity, recent benchmarks published with the MAFFT 5 package (1) reveal that state of the art methods still fail to reliably align distantly related sequences. In the so-called ‘Twilight zone’ (2), sequences with <20% identity cannot be aligned with >30% average accuracy (as judged by comparison with reference alignments). So far, the most convincing solution to this problem has been to supplement sequences with structural information (3). The reason why structure-based MSAs are more accurate is not so much a consequence of better algorithms but rather an effect of structures evolutionary stability. Structures evolve slower than the sequences (4) and even when sequences have diverged beyond recognition it is often possible to establish homology (i.e. common ancestry) on the basis of 3D folds comparisons (3). The increasing availability of structural data (5) means that relying on structure-based methods for sequence analysis has become much more realistic than it used to be. However, sequences are still being determined much faster than structures, thus creating a context where methods able to efficiently combine sequences and structure into accurate MSAs are needed. To the best of our knowledge, only six algorithms have been designed that are able to make use of secondary (6,7) or tertiary (8–10) structure information. In the context of this work, we used 3D-Coffee (11) for its ability to combine the output of several methods into one unique model. 3D-Coffee is based on the T-Coffee algorithm (12), a heuristic method that uses a progressive algorithm to compute an MSA having a high consistency with a collection of pre-computed pairwise alignments (the library). *To whom correspondence should be addressed. Tel: +33 491 825 427; Fax: +33 491 825 420; Email: cedric.notredame@gmail.com Ó The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org Nucleic Acids Research, 2006, Vol. 34, Web Server issue W605 In 3D-Coffee, the principle is the same except that the library’s pairwise alignments are derived from structural superpositions using methods like Sap (13), Lsqman (14) or possibly any alternative structure alignment package [for a review see (15)]. When using combinations of structures and sequences, 3DCoffee can also incorporate structure-sequence (threading) alignment methods like Fugue (16) to ease the diffusion of structural information onto the sequences. 3D-Coffee has been available via the web server 3DCoffee@igs for >2 years (17). The original implementation made it possible to combine sequences and structures using the most advanced T-Coffee options through a simple web interface. Although it provides access to most of the T-Coffee inline functions, this server requires the user to explicitly specify which structural template is to be associated with each sequence. This specification, made through a cumbersome procedure of sequence renaming, was complicated and impractical for non-specialists. The novel version of 3D-Coffee@igs is named Expresso because it makes it possible for non-specialists to rapidly and automatically benefit from the strength of 3D-Coffee. The term Expresso also conveys the notion of aroma extraction and concentration, a notion that resonates with the way structures are ‘expressed’ within the MSA. In Expresso, we implemented an automated identification of suitable structural templates via a BLAST search against the PDB database. 3D-Coffee uses the selected structures to assemble a genuine structure-based MSA during a process that merely looks like a standard sequence alignment procedure from the user’s point of view. Providing the appropriate structural information is available, Expresso is significantly more accurate than regular homology-based methods and its alignments are often indistinguishable from reference structure-based alignments (11). METHODS Selection of the Structural Template The core idea of Expresso is to reliably identify structures that can be used as templates for the sequences (source) one wishes to align. The rationale is that any alignment carried out on the templates can easily be transposed onto the source sequences as long as the source and the template are highly homologous. The most basic and important step in Expresso is a BLAST search of the source sequences against PDB, in order to identify suitable templates. A BLAST match is considered a suitable template if it displays a minimum of 60% sequence identity with the source sequence and a minimum coverage of 70% (i.e. 70% of the source sequence residues matched). These rather conservative criteria were chosen to limit the template selection to close homologues whose alignment with the source is entirely non-ambiguous. No effort is made to identify structures with special conformations, or resolutions, although this could easily be added to the pipeline. However, whenever the automatic procedure appears inappropriate, the user can explicitly declare the source–template association using the advanced mode of the server. Integration of the Structural Template Once every sequence with a structural homologue has been assigned its template, 3DCoffee undertakes the library computation step. It applies a collection of pre-defined pairwise alignment methods on every pair of sequences. The methods are either sequence-based alignment (e.g. lalign) or structure-based (e.g. SAP). When using structural methods, a structure-based alignment of the templates is first computed. The two source sequences are then aligned to their respective templates, and the induced pairwise alignment of the two sources is integrated within the library (Figure 1). The accuracy of this delicate process relies on a high level of identity between the source and the template sequence, hence the stringency of the original BLAST search. Alignment computation Once the library assembly step is finished, the MSA is assembled in a progressive fashion, using the standard T-Coffee algorithm. The default mode of the server for running T-Coffee is t_coffee -in Mslow_pair, Msap_pair, Mlalign_ id_pair –template_file SCRIPT_blast.pl, where SCRIPT_blast.pl is a stand-alone script that BLASTs every source sequence against PDB in order to identify suitable structural templates (if they exist). USING EXPRESSO Default mode The server can be accessed at http://www.tcoffee.org/, by clicking on the Expresso link, either advanced or regular. To use the regular mode, one simply needs to cut and paste FASTA sequences. No special precaution is needed to name the sequences. Advanced mode The advanced mode of the server offers many more possibilities and guides the user with a series of bulleted points:  Cut and paste your sequences.  Upload your PDB structures. Should be used when some of the structures are not in the public domain. When uploading a PDB template, the associated source sequence is automatically generated using the SEQRES field. PDB files must follow the standard PDB format and the server requires a TITLE, a HEADER, an ATOM and a SEQRES section.  Select the methods. The default selection corresponds to 3DCoffee. Further structure alignment methods will soon be added, along with new multiple sequence alignment packages. Users are welcome to suggest the incorporation of any public domain method.  PDB template selection. By default no template is used in the advanced mode. Users should check the SCRIPT box to automatically fetch the templates with BLAST, or specify the source to template correspondences in the box below. The format for doing so is indicated in the corresponding section. Figure 2 shows a typical output, computed on the HOMSTRAD thioredoxin family (18). The first alignment W606 Nucleic Acids Research, 2006, Vol. 34, Web Server issue Figure 1. Computation of a template-based library. Structural templates are assigned to each original source sequence and these templates are used to generate a structure-based sequence alignment. The final library alignment is generated by aligning each source sequence with its template, thus generating a template-based alignment of the two sources. (Figure 2a) was computed using the standard T-Coffee protocol, while the other (Figure 2b) is an Expresso MSA computed using the regular mode. In the T-Coffee alignment, 15% of the columns are correctly aligned (as judged by comparison with the HOMSTRAD reference alignment) while in the Expresso MSA, 49% of the columns appear to be correct. Figure 2c shows which template was selected for each sequence. When selecting the template, no attempt is made to match the source sequence name with the template name, which sometimes results in apparent discrepancies (1aaza modelled with 1de2A). While in most cases, these arbitrary choices should not affect the output, better control can be achieved by specifying the template/sequence correspondence in the advanced mode. CONCLUSION AND FUTURE DEVELOPMENTS Expresso is an improved version of the original 3DCoffee@igs server. Structures are now fetched automatically and used to guide the alignment. This procedure can result in a dramatic improvement of the sequence alignment when homologue PDB structures are available. From the user point of view, Expresso is a regular multiple sequence alignments server that seamlessly includes structural information in MSAs, allowing Nucleic Acids Research, 2006, Vol. 34, Web Server issue W607 (a) (b) (c) Figure 2. Computation of an Expresso Alignment. (a) Default T-Coffee alignment of the thioredoxin HOMSTRAD dataset. Red portions have a high reliability and are expected to be more accurate that the rest. Blue and green portions are the less consistent. Consistency is estimated from a sequence-based T-Coffee library. In this MSA 15% of the columns are similar to the reference HOMSTRAD MSA. (b) Expresso Alignment. Consistency is now estimated from a library computed using template-based alignments. In this alignment 49% of the columns are similar to the HOMSTRAD reference MSA. (c) Automatic template assignment. non-specialists to benefit from the power of structure-based sequence alignment without having to address all the technical issues it implies. Future developments will involve a gradual extension of the methods available for combination in the advanced section. We strongly encourage users to send us their feedback. pay the Open Access publication charges for this article was provided by CNRS. Conflict of interest statement. None declared. REFERENCES 1. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33, 511–518. 2. Sander,C. and Schneider,R. (1991) Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins, 9, 56–68. 3. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 283, 595–602. 4. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol., 136, 225–270. ACKNOWLEDGEMENTS We thank Prof. Jean-Michel Claverie (head of IGS) for stimulating scientific discussions and material support. We also thank Prof. Roger Hersch (EPFL) for useful advices on code optimization. The development of the server was supported by CNRS (Centre National de la Recherche Scientifique), ´ Sanofi-Aventis Pharma SA., Marseille–Nice Genopole and the French National Genomic Network (RNG). Funding to W608 Nucleic Acids Research, 2006, Vol. 34, Web Server issue 5. Kouranov,A., Xie,L., de la Cruz,J., Chen,L., Westbrook,J., Bourne,P.E. and Berman,H.M. (2006) The RCSB PDB information portal for structural genomics. Nucleic Acids Res., 34, D302–D305. 6. Heringa,J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comput. Chem., 23, 341–364. 7. Simossis,V.A. and Heringa,J. (2005) PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res., 33, W289–W294. 8. Zhang,Z., Lindstam,M., Unge,J., Peterson,C. and Lu,G. (2003) Potential for dramatic improvement in sequence alignment against structures of remote homologous proteins by extracting structural information from multiple structure alignment. J. Mol. Biol., 332, 127–142. 9. Ren,T., Veeramalai,M., Tan,A.C. and Gilbert,D. (2004) MSAT: a multiple sequence alignment tool based on TOPS. Appl. Bioinformatics, 3, 149–158. 10. Kleinjung,J., Romein,J., Lin,K. and Heringa,J. (2004) Contact-based sequence alignment. Nucleic Acids Res., 32, 2464–2473. 11. O’Sullivan,O., Suhre,K., Abergel,C., Higgins,D.G. and Notredame,C. (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol., 340, 385–395. 12. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. 13. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. 14. Kleywegt,G.J. and Jones,T.A. (1999) Software for handling macromolecular envelopes. Acta. Crystallogr. D Biol. Crystallogr., 55, 941–944. 15. Kolodny,R., Koehl,P. and Levitt,M. (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol., 346, 1173–1188. 16. Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243–257. 17. Poirot,O., Suhre,K., Abergel,C., O’Toole,E. and Notredame,C. (2004) 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Res., 32, W37–W40. 18. de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. (2001) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, 17, 748–749. Education Recent Evolutions of Multiple Sequence Alignment Algorithms ´ Cedric Notredame n ever-increasing number of biological modeling methods depend on the assembly of an accurate multiple sequence alignment (MSA). These include phylogenetic trees, profiles, and structure prediction. Assembling a suitable MSA is not, however, a trivial task, and none of the existing methods have yet managed to deliver biologically perfect MSAs. Many of the algorithms published these last years have been extensively described [1–3], and this review focuses only on the latest developments, including meta-methods and template-based alignment techniques. The purpose of an MSA algorithm is to assemble alignments reflecting the biological relationship between several sequences. Computing exact MSAs is computationally almost impossible, and in practice approximate algorithms (heuristics) are used to align sequences, by maximizing their similarity. The biological relevance of these MSAs is usually assessed by systematic comparison with established collections of structure-based MSAs (‘‘gold standards’’; for review see [4]). Since only a few sequences have known structures, the accuracy measured on the references is merely an estimation of how well a package may fare on standard datasets. Gold standards have had a considerable effect on the evolution of MSA algorithms, refocusing the entire methodological development toward the production of structurally correct alignments. Their use has also coincided with a notable algorithmic harmonization, most MSA packages being now based on the ‘‘progressive algorithm’’ [5]. This greedy heuristic assembly algorithm involves estimating a guide tree (rooted binary tree) from unaligned sequences, and then incorporating the sequences into the MSA with a pairwise alignment algorithm while following the tree topology. The progressive algorithm is often embedded in an iterative loop where the guide tree and the MSA are reestimated until convergence. Most MSA packages reviewed here [6–18] follow this canvas, albeit more or less extensively adapted for improved performances [1–3]. The scoring schemes used by the pairwise alignment algorithm are arguably the most influential component of the progressive algorithm. They can be divided in two categories: matrix- and consistency-based. Matrix-based algorithms such as ClustalW [14], MUSCLE [6], and Kalign [19] use a substitution matrix to assess the cost of matching two symbols or two profiled columns. Although profile statistics can be more or less sophisticated, the score for matching two positions depends only on the considered columns or their immediate surroundings. By contrast, consistency-based schemes incorporate a larger share of information into the evaluation. This result is achieved by using a recipe initially developed for T-Coffee [10] and inspired by Dialign overlapping weights [20]. Its principle is to compile a collection of pairwise global and local alignments (primary library) and to use this collection as a position-specific A substitution matrix during a regular progressive alignment. The aim is to deliver a final MSA as consistent as possible with the alignments contained in the library. Many recent packages have built upon this initial framework. For instance, PCMA [15] decreases T-Coffee computational requirements by prealigning closely related sequences. ProbCons [7] uses Bayesian consistency and fills the primary library using the posterior decoding of a pair hidden Markov model. The substitution costs are estimated from this library using Bayesian statistics. MUMMALS [17] combines the ProbCons scoring scheme with the PCMA strategy, while including secondary structure predictions in its pair hidden Markov model. The most accurate flavors of MAFFT [8] (i.e., the GNS and LNS modes) use a T-Coffee–like evaluation. A majority of studies indicate that consistency-based methods are more accurate than their matrix-based counterparts [4], although they typically require an amount of CPU time N times higher than simpler methods (N being the number of sequences). Most of these methods are available online, either as downloadable packages or as online Web services (Table 1). The wealth of available methods and their increasingly similar accuracies makes it harder than ever to objectively choose one over the others. Consensus methods such as MCoffee [12] provide an interesting framework to address this problem. M-Coffee is a consensus meta-method based on TCoffee. Given a sequence dataset, it fills up the library by using various MSA methods to compute alternative alignments. T-Coffee then uses this library to compute a final MSA consistent with the original alignments. When combining eight of the most accurate and distinct MSA packages, M-Coffee produces 67% of the time a better MSA than ProbCons, the best individual method [12]. Aside from its ease of extension, M-Coffee’s main advantage is its ability to estimate the local consistency between the final alignment and the combined MSAs (CORE index [21]; Figure 1). This useful index has been shown to be well-correlated with the MSAs’ structural correctness [21,22]. M-Coffee is not, however, the ultimate answer to the MSA problem, and its limited performances on remote homologs suggest that Editor: Fran Lewitter, Whitehead Institute, United States of America Citation: Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8): e123. doi:10.1371/journal.pcbi.0030123 ´dric Notredame. This is an open-access article distributed Copyright: Ó 2007 Ce under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abbreviations: MSA, multiple sequence alignment ´ ´nomique et Structurale, CNRS UPR2589, Cedric Notredame is with Information Ge Institute for Structural Biology and Microbiology, Parc Scientifique de Luminy, Marseille, France. E-mail: cedric.notredame@gmail.com PLoS Computational Biology | www.ploscompbiol.org 1405 August 2007 | Volume 3 | Issue 8 | e123 Table 1. Summary of the Methods Described in the Review Method Score Templates Validation Values PreFab ClustalW [14] Kalign MUSCLE [6] T-Coffee [10] ProbCons [7] MAFFT [8] M-Coffee [12] MUMMALS [16] DbClustal [24] PRALINE [9] PROMALS [16] SPEM [28] Expresso [13] T-Lara [29] Matrix Matrix Matrix Consistency Consistency Consistency Consistency Consistency Profiles Matrix Consistency Matrix Consistency Consistency — — — — — — — — — Profiles Profiles Profiles Structures Structures 61.80 63.00 68.00 69.97 70.54 72.20 72.91 73.10 — — 79.00 77.00 — — [12] [18] [16] [12] [12] [12] [12] [16] Server HOMSTRAD — — 45.0 44.0 — — — — — 50.2 — — 71.9 — http://www.ebi.ac.uk/clustalw/ http://msa.cgb.ki.se/ http://www.drive5.com/muscle/ http://www.tcoffee.org/ http://probcons.stanford.edu/ http://align.genome.jp/mafft/ http://www.tcoffee.org/ http://prodata.swmed.edu/mummals/ http://bips.u-strasbg.fr/PipeAlign/ http://zeus.cs.vu.nl/programs/pralinewww/ http://prodata.swmed.edu/promals/ http://sparks.informatics.iupui.edu/Softwares-Services_files/spem.htm http://www.tcoffee.org/ https://www.mi.fu-berlin.de/w/LiSA/ [9] [9] [9] [16] [28] [11]a Validation values were compiled from several sources, and selected for comparability. PreFab validations were made using PreFab version 3. HOMSTRAD validations were made on datasets having less than 30% identity. The source of each value is indicated by the accompanying reference citation. a The Expresso value comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical. doi:10.1371/journal.pcbi.0030123.t001 further improvement using only sequence information remains an elusive goal. Progress is nonetheless needed, and, at this point, the most promising approach is probably to incorporate within the datasets any information likely to improve the alignments, such as structural and homology data. Template-based alignment methods [13] follow this approach. Structural extension was initially described by Taylor [23]. The principle is fairly straightforward (Figure 2) and involves identifying with BLAST a structural template in the Protein Data Bank for each sequence, aligning the templates using a structure superposition method, and mapping the original sequences onto their template’s alignment. The resulting sequence alignments are compiled in the primary library and used by a consistency-based method to compute the final MSA. Homology extension was originally introduced in the DbClustal package [24] and works along the same lines, using a profile rather than a structure. PSI-BLAST is used to build a profile for each sequence, and these profiles are used as templates to generate better sequence alignments, thanks to the evolutionary information they contain. The only difference between homology and structure extension is the templates’ nature and the associated alignment method. This generic approach can easily be extended to any kind of template. For instance, Expresso [13] uses SAP [25,26] and FUGUE [27] to align structural templates identified by a BLAST against the Protein Data Bank. PROMALS [17], PRALINE [9], and SPEM [28] make a profile–profile alignment with PSI-BLAST profiles used as templates. In PRALINE and PROMALS, the profile can be complemented with a secondary structure prediction in an attempt to improve the alignment accuracy. PROMALS uses ProbCons Bayesian consistency to fill its library with the posterior decoding of a pair hidden Markov model. T-Lara [29] uses doi:10.1371/journal.pcbi.0030123.g001 Figure 1. Typical Output of M-Coffee This output was obtained on the kinase1_ref5 BaliBase dataset, by combining MUSCLE, MAFFT, POA, Dialign-T, T-Coffee, ClustalW, PCMA, and ProbCons with M-Coffee. Correctly aligned residues (as judged from the reference) are uppercase; noncorrect ones are lowercase. The color of each residue indicates the agreement of the individual MSAs with respect to the alignment of that specific residue. Red indicates residues aligned in a similar fashion among all the individual MSAs; blue indicates very low agreement between MSAs. Dark yellow, orange, and red residues can be considered to be reliably aligned. PLoS Computational Biology | www.ploscompbiol.org 1406 August 2007 | Volume 3 | Issue 8 | e123 doi:10.1371/journal.pcbi.0030123.g002 Figure 2. Framework of a Template-Based Method Structural templates are first identified, mapped onto the sequences, and aligned using SAP. The sequence–template mapping is then used to guide the alignment of the original sequences. This alignment is integrated into the library that is used to compute the final MSA. RNA secondary structure predictions as templates and fills a T-Coffee library with the Lara pairwise algorithm. With the exception of PRALINE and SPEM, which use a regular progressive algorithm, most template-based methods described here are consistency-based (some of them taking advantage of T-Coffee modular structure). Their main advantage is increased accuracy. Recent benchmarks on PROMALS (Table 1) show that homology extension results in a ten-point improvement over existing methods. Likewise, structure-based methods such as Expresso produce alignments much closer to the structural references than do any of their sequence-based counterparts. One must, however, be careful not to over-interpret validation values like that given for Expresso in Table 1, since both the reference and the Expresso alignments were computed using the same structural information. This last point raises the important issue of method validation and benchmarking. A recent study [4] shows that with the exception of artificial datasets, benchmarks carried out on most reference databases tend to deliver compatible results. It also suggests that the best methods have become indistinguishable, except when considering remote homologs (less than 25% identity). Unfortunately, remote homologs are poorly suited to generating reference alignment, owing to the fact that their superposition often yields alternative sequence PLoS Computational Biology | www.ploscompbiol.org 1407 alignments that are structurally equivalent [30]. However, one can bypass the reference alignment stage by directly comparing the evaluated alignment to some idealized 3-D superposition. Such an alignment-independent evaluation has been described and used by several authors [17,31,32]. Another trend, not well accounted for by current reference collections, is the alignment of very large datasets. While many new methods incorporate special algorithms for aligning several hundred sequences [6,8,18], current reference databases do not allow the evaluation of very large datasets, thus making it unclear how the published accuracies scale with the number of sequences. While this last issue could probably be satisfyingly solved in the current benchmarking framework, another problem remains that is much harder to address. All the existing validation approaches have in common their reliance on the ‘‘one size fits all’’ assumption that structurally correct alignments are the best possible MSAs for modeling any kind of biological signal (evolution, homology, or function). A report on profile construction [33] has recently challenged this view by showing that structurally correct alignments do not necessarily result in better profiles. Likewise, it may be reasonable to ask whether better alignments always result in better phylogenetic trees, and, more systematically, to question and quantify the relationship between the accuracy of MSAs and the biological relevance of any model drawn upon them. In this review, I have presented some of the latest additions to the MSA computation arsenal. An interesting milestone has been the development of meta-methods able to seamlessly combine the output of several methods. Aside from easing the user’s work, the main advantage of these consensus methods is probably the local estimation of reliability they provide (Figure 1). Using this estimation to filter out unreliable regions has already proven useful in homology modeling [34] and could probably be used further. The main improvement reported here, however, is probably the notion of templatebased alignment. Template-based alignment is more than a trivial extension of consistency-based methods. Under this new model, the purpose of an MSA is not to squeeze a dataset and extract all the information it may contain, but rather to use the dataset as a starting point for exploring and retrieving all the related information contained in public databases. This information is to be used not only for mapping purposes, but also for driving the MSA computation. Such a usage of sequence information makes template-based methods a real paradigm shift and a major step toward global biological data integration. & Acknowledgments The author thanks the two anonymous reviewers for suggesting several missing references. Author contributions. CN analyzed the data and wrote the paper. Funding. CN is funded and supported by the Centre National de la Recherche Scientifique, France. Competing interests. The author has declared that no competing interests exist. References 1. Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16: 368–373. 2. Wallace IM, Blackshields G, Higgins DG (2005) Multiple sequence alignments. Curr Opin Struct Biol 15: 261–266. August 2007 | Volume 3 | Issue 8 | e123 3. Gotoh O (1999) Multiple sequence alignment: Algorithms and applications. Adv Biophys 36: 159–206. 4. Blackshields G, Wallace IM, Larkin M, Higgins DG (2006) Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6: 321–339. 5. Hogeweg P, Hesper B (1984) The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J Mol Evol 20: 175–186. 6. Edgar RC (2004) MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113. 7. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15: 330–340. 8. Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511–518. 9. Simossis VA, Heringa J (2005) PRALINE: A multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33: W289–W294. 10. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302: 205–217. 11. O’Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C (2004) 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. J Mol Biol 340: 385–395. 12. Wallace IM, O’Sullivan O, Higgins DG, Notredame C (2006) M-Coffee: Combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res 34: 1692–1699. 13. Armougom F, Moretti S, Poirot O, Audic S, Dumas P, et al. (2006) Expresso: Automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res 34: W604–W608. 14. Thompson J, Higgins D, Gibson T (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4690. 15. Pei J, Sadreyev R, Grishin NV (2003) PCMA: Fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19: 427– 428. 16. Pei J, Grishin NV (2007) PROMALS: Towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23: 802–808. 17. Pei J, Grishin NV (2006) MUMMALS: Multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34: 4364–4374. 18. Lassmann T, Sonnhammer EL (2005) Kalign—An accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298. 19. Lassmann T, Sonnhammer EL (2006) Kalign, Kalignvu and Mumsa: Web servers for multiple sequence alignment. Nucleic Acids Res 34: W596– W599. 20. Morgenstern B, Dress A, Wener T (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc Natl Acad Sci U S A 93: 12098–12103. 21. Notredame C, Abergel C (2003) Using multiple alignment methods to assess the quality of genomic data analysis. In: Andrade M, editor. Bioinformatics and genomes: Current perspectives. Wymondham (United Kingdom): Horizon Scientific Press. pp. 30–50. 22. Lassmann T, Sonnhammer EL (2005) Automatic assessment of alignment quality. Nucleic Acids Res 33: 7120–7128. 23. Taylor WR (1986) Identification of protein sequence homology by consensus template alignment. J Mol Biol 188: 233–258. 24. Thompson JD, Plewniak F, Thierry J, Poch O (2000) DbClustal: Rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res 28: 2919–2926. 25. Taylor WR, Orengo CA (1989) Protein structure alignment. J Mol Biol 208: 1–22. 26. Taylor WR (1999) Protein structure comparison using iterated double dynamic programming. Protein Sci 8: 654–665. 27. Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 310: 243–257. 28. Zhou H, Zhou Y (2005) SPEM: Improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21: 3615–3621. 29. Bauer M, Klau G, Reinert K (2005) Multiple structural RNA alignment with Lagrangian relaxation. Lect Notes Comput Sci 3692: 303–314. 30. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS (2000) ProSup: A refined tool for protein structure alignment. Protein Eng 13: 745–752. 31. O’Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, et al. (2003) APDB: A novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics 19: I215–I221. 32. Armougom F, Moretti S, Keduas V, Notredame C (2006) The iRMSD: A local measure of sequence alignment accuracy using structural information. Bioinformatics 22: e35–e39. 33. Griffiths-Jones S, Bateman A (2002) The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs. Bioinformatics 18: 1243–1249. 34. Claude JB, Suhre K, Notredame C, Claverie JM, Abergel C (2004) CaspR: A web server for automated molecular replacement using homology modelling. Nucleic Acids Res 32: W606–W609. PLoS Computational Biology | www.ploscompbiol.org 1408 August 2007 | Volume 3 | Issue 8 | e123 Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press A novel, noncanonical mechanism of cytoplasmic polyadenylation operates in Drosophila embryogenesis Olga Coll, Ana Villalba, Giovanni Bussotti, et al. Genes Dev. 2010 24: 129-134 Access the most recent version at doi:10.1101/gad.568610 Supplemental Material References Email alerting service http://genesdev.cshlp.org/content/suppl/2009/12/29/24.2.129.DC1.html This article cites 37 articles, 11 of which can be accessed free at: http://genesdev.cshlp.org/content/24/2/129.full.html#ref-list-1 Receive free email alerts when new articles cite this article - sign up in the box at the top right corner of the article or click here To subscribe to Genes & Development go to: http://genesdev.cshlp.org/subscriptions Copyright © 2010 by Cold Spring Harbor Laboratory Press Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press RESEARCH COMMUNICATION A novel, noncanonical mechanism of cytoplasmic polyadenylation operates in Drosophila embryogenesis Olga Coll,1 Ana Villalba,1 Giovanni Bussotti,2 ´ Cedric Notredame,2 and Fatima Gebauer1,3 ´ ` Gene Regulation Programme, Centre de Regulacio Genomica (CRG-UPF), 08003 Barcelona, Spain; 2Bioinformatics ´ ` Programme, Centre de Regulacio Genomica (CRG-UPF), 08003 Barcelona, Spain 1 Cytoplasmic polyadenylation is a widespread mechanism to regulate mRNA translation that requires two sequences in the 39 untranslated region (UTR) of vertebrate substrates: the polyadenylation hexanucleotide, and the cytoplasmic polyadenylation element (CPE). Using a cell-free Drosophila system, we show that these signals are not relevant for Toll polyadenylation but, instead, a ‘‘polyadenylation region’’ (PR) is necessary. Competition experiments indicate that PR-mediated polyadenylation is required for viability and is mechanistically distinct from the CPE/hexanucleotide-mediated process. These data indicate that Toll mRNA is polyadenylated by a noncanonical mechanism, and suggest that a novel machinery functions for cytoplasmic polyadenylation during Drosophila embryogenesis. Supplemental material is available at http://www.genesdev.org. Received July 20, 2009; revised version accepted November 23, 2009. Oocyte maturation and early development in many organisms occurs in the absence of transcription. Developmental progression at these times depends largely on differential translation of maternal mRNAs, and cytoplasmic polyadenylation is a major component of this control. In general, mRNAs with a short poly(A) tail remain translationally silent, while elongation of the poly(A) tail in the cytoplasm results in translational activation. Most of the accumulated knowledge on cytoplasmic polyadenylation derives from studies in oocytes of Xenopus (for review, see Belloc et al. 2008; Radford et al. 2008). Two cis-acting sequences in the 39 untranslated region (UTR) of substrate mRNAs are essential for this process: the conserved polyadenylation hexanucleotide —also required for nuclear polyadenylation—with the structure A(A/U)UAAA, and the U-rich cytoplasmic polyadenylation element (CPE), which generally consists of U4–5A1–3U. The hexanucleotide is recognized by the multisubunit complex CPSF (cleavage and polyadenylation specificity factor), and the CPE is recognized by [Keywords: CPE; hexanucleotide; polyadenylation; Toll] 3 Corresponding author. E-MAIL fatima.gebauer@crg.es; FAX 34-93-3969983. Article is online at http://www.genesdev.org/cgi/doi/10.1101/gad.568610. CPEB, a protein with a dual function that acts as a switch between translational repression and activation. In immature oocytes, CPEB represses translation by recruiting a set of factors that functionally block the two ends of the mRNA. On the one hand, CPEB recruits Maskin (or 4E-T in growing oocytes), which in turn binds to eIF4E and prevents its recognition by eIF4G during translation initiation (Stebbins-Boaz et al. 1999; Minshall et al. 2007). On the other hand, CPEB recruits the deadenylase PARN, which keeps the poly(A) tail short (Kim and Richter 2006). Upon meiotic maturation, CPEB phosphorylation results in eviction of PARN and enhanced recruitment of CPSF. Together, CPEB and CPSF recruit the cytoplasmic poly(A) polymerase GLD-2, leading to elongation of the poly(A) tail and translational activation (Barnard et al. 2004). The distance between the CPE and the hexanucleotide dictates the timing and extent of poly´ adenylation (Pique et al. 2008). Additional elements reported to function early during oocyte maturation are the U-rich polyadenylation response elements (PREs), which bind the protein Musashi (Charlesworth et al. 2002, 2006). CPEB is a conserved family of four members in vertebrates that, in addition to oocyte maturation, contribute to the regulation of local protein synthesis at synapses that underlies long-term changes in synaptic plasticity (for review, see Richter 2007). In Drosophila, the CPEB1 homolog Orb plays a role in mRNA localization and regulates the polyadenylation of oskar and cyclin B mRNAs during oogenesis (Chang et al. 1999; Castagnetti and Ephrussi 2003; Benoit et al. 2005). Orb2, the homolog of CPEB2-4, is required for long-term memory, but its role in cytoplasmic polyadenylation has not been demonstrated (Keleman et al. 2007). Other conserved factors that contribute to cytoplasmic polyadenylation during Drosophila oogenesis are the canonical poly(A) polymerase Hiiragi, and the GLD-2 homolog Wispy (Wisp) (Juge et al. 2002; Benoit et al. 2008; Cui et al. 2008). Cytoplasmic polyadenylation also occurs during embryogenesis, but the sequences and factors responsible for polyadenylation at these times remain poorly understood. In Drosophila, translation of the transcripts encoding Bicoid, Toll, and Torso is activated by cytoplasmic polyadenylation in early embryogenesis, and this activation is required for appropriate axis formation ´ (Salles et al. 1994; Schisa and Strickland 1998). How this polyadenylation occurs is intriguing, as Orb is barely detectable in early embryos (Vardy and Orr-Weaver 2007). Furthermore, no cis-acting elements responsible for cytoplasmic polyadenylation have been described yet in this organism. Therefore, an important question is whether similar signals, factors, and mechanisms operate for cytoplasmic polyadenylation in different biological settings. To address this question, we used an in vitro cytoplasmic polyadenylation system derived from Drosophila early embryos. Using Xenopus cyclin B1 (CycB1) mRNA as a substrate, we found that the canonical cytoplasmic polyadenylation signals—the CPE and the hexanucleotide—function in Drosophila. Surprisingly, however, these sequences are not necessary for Toll polyadenylation. Rather, a region of the 39 UTR that we term the ‘‘polyadenylation region’’ (PR) is required. Consistently, competition experiments indicate GENES & DEVELOPMENT 24:129–134 Ó 2010 by Cold Spring Harbor Laboratory Press ISSN 0890-9369/10; www.genesdev.org 129 Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press Coll et al. that PR-mediated polyadenylation is mechanistically distinct from the CPE/hexanucleotide-mediated process, implying that a novel machinery for cytoplasmic polyadenylation operates during Drosophila embryogenesis. Results and Discussion A Drosophila cell-free system for cytoplasmic polyadenylation Cytoplasmic polyadenylation has been observed in oocytes and early embryos of Drosophila. Maximal polyadenylation of Toll and bicoid mRNAs occurs at ;90 min ´ of development (Salles et al. 1994). To study cytoplasmic polyadenylation, we first tested whether extracts from 90-min embryos could recapitulate this process. Similar extracts obtained from nonstaged embryos have been shown previously to support translation (Gebauer et al. 1999). Because translation is the consequence of cytoplasmic polyadenylation for Toll and bicoid, we tested whether polyadenylation of these substrates in embryonic extracts could occur under translation conditions. We incubated nonadenylated full-length bicoid mRNA and the 39 UTR of Toll in staged 90-min embryo extracts. In addition, we used the mRNA encoding the ribosomal protein Sop as a negative control, as this transcript contains a canonical hexanucleotide and has been shown to undergo nuclear but not cytoplasmic polyadenylation (e.g., see Benoit et al. 2008). After incubation, we measured the length of the poly(A) tail by the PCR-based poly(A) test (PAT) assay. We found that both bicoid and Toll RNAs gained a poly(A) tail of ;150 nucleotides (nt) while sop mRNA remained nonadenylated, indicating that this system recapitulates the cytoplasmic polyadenylation process (Fig. 1A). Toll mRNA was selected for further studies because it was consistently polyadenylated more efficiently than bicoid mRNA in the cell-free system. To evaluate whether cytoplasmic polyadenylation resulted in increased translation, we first tested the correlation between both processes in a time-course experiment. We fused the 39 UTR of Toll to the firefly luciferase ORF to yield the Luc-toll transcript. Translation of this transcript closely paralleled polyadenylation of Toll 39 UTR (Fig. 1B, cf. the left panel and the Luc-toll curve in the right panel). In addition, treatment of the mRNA with the chain elongation inhibitor cordycepin (39-deoxyadenosine) dramatically reduced the efficiency of translation, decreasing it to the levels of nonadenylated luciferase mRNA (Fig. 1B, right panel). These data show that both cytoplasmic polyadenylation and polyadenylationdependent translation can be recapitulated in 90-min Drosophila embryo extracts. The canonical cytoplasmic polyadenylation signals function in Drosophila The cis-acting elements for cytoplasmic polyadenylation in Drosophila are unknown. To test whether the CPE and the hexanucleotide were recognized as polyadenylation elements in this organism, we analyzed the polyadenylation of the best-characterized vertebrate substrate, CycB1. The 39 UTR of this transcript contains three CPEs, one of them overlapping with the hexanucleotide (Fig. 1C), and has been shown to undergo strong poly- Figure 1. The canonical cytoplasmic polyadenylation signals function in Drosophila. (A) Cytoplasmic polyadenylation in Drosophila embryo extracts. Nonadenylated full-length sop and bcd mRNAs, as well as the 39 UTR of Toll, were incubated in 90-min embryo extracts, and the poly(A) tail was measured using the PAT assay. A schematic representation of the oligonucleotides used in this assay is shown in the top panel. For each transcript, a specific 59 oligonucleotide was combined with either a specific 39 primer to reveal the size of the nonadenylated RNA (39 lanes), or with an oligo(dT) anchor to visualize the poly(A) tail (dT lanes). Molecular weight markers are also shown. (B) Cytoplasmic polyadenylation promotes translation of Toll mRNA. (Left panel) Polyadenylation time course of the Toll 39 UTR, as measured by PAT assay. (Right panel) A firefly luciferase reporter containing the 39 UTR of Toll (Luc-toll) was incubated in embryo extracts for different times, and the efficiency of translation was determined by measuring the luciferase activity. As controls, the translation efficiencies of nonadenylated Luciferase and cordycepin-treated Luc-toll mRNAs were determined. Curves represent the average of five independent experiments. (C) The CPE and the hexanucleotide are functional CPEs in Drosophila. (Left panel) Polyadenylation of radioactively labeled wild-type and hexanucleotide-mutated CycB1 39 UTRs. The nature of the mutation is indicated in gray lowercase letters. RNAs were separated in a 6% denaturing acrylamide gel and visualized using the PhosphorImager. (Right panel) Polyadenylation of CAT reporter mRNAs containing either a wild-type CycB1 39 UTR or a derivative with U-to-G transversions in all three CPEs. RNAs were visualized by Northern blot using a probe against the CAT ORF. A schematic representation of a Xenopus CycB1 39 UTR is shown, with the three CPEs and the polyadenylation hexanucleotide (HN) depicted as gray and white boxes, respectively. adenylation at a late time during Xenopus oocyte matu´ ration and in early embryos (Groisman et al. 2002; Pique et al. 2008). The 39 UTR of CycB1 was polyadenylated in the Drosophila cell-free system and was sufficient to confer polyadenylation when fused to a CAT reporter (Fig. 1C). Importantly, mutation of the hexanucleotide (Fig. 1C, left panel) or the CPEs (Fig. 1C, right panel, CPE0) completely abrogated polyadenylation, indicating that the vertebrate cytoplasmic polyadenylation signals function in Drosophila. These results suggest that a canonical cytoplasmic polyadenylation machinery exists in Drosophila embryos. Intriguingly, however, poly(A) tail length control must occur in the absence of Orb, which is undetectable in early embryos, and PARN, which is not conserved in Drosophila. 130 GENES & DEVELOPMENT Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press Noncanonical polyadenylation Toll mRNA is polyadenylated in a CPE-independent and hexanucleotide-independent fashion Toll mRNA contains a canonical hexanucleotide, followed by a putative CPE (Fig. 2A). To determine whether these sequences were responsible for polyadenylation, we analyzed the behavior of Toll mutant derivatives. To visualize the polyadenylated products, we used Northern blots, which allow a more accurate measurement of the efficiency of polyadenylation as compared with PAT assays. Surprisingly, mutation or deletion of the CPE and/or the hexanucleotide did not affect polyadenylation of Toll (Fig. 2A). Treatment with oligo(dT) and RNase H confirmed that the size increase of Toll upon incubation was due to polyadenylation (Supplemental Fig. 1). That Toll polyadenylation was unaffected by deletion of the CPE and the hexanucleotide was unexpected, as both elements function as polyadenylation signals in Drosophila (Fig. 1C), and a single point mutation in the hexanucleotide (AAUAAA to AAGAAA) is sufficient to hinder cytoplasmic polyadenylation in vertebrates (e.g., see McGrew and Richter 1990). Thus, it appeared that Toll polyadenylation was independent of the CPE and the hexanucleotide. However, the canonical polyadenylation machinery could, in principle, bind to functional variations of these elements that could pass unrecognized by sequence inspection. To exclude this possibility, we performed competition assays. CycB1 effectively competed the polyadenylation of radiolabeled CAT-CycB1 (Fig. 2B, lanes 10–15). Polyadenylation of Toll was readily competed by an excess of cold Toll 39 UTR, but, re- markably, not by an excess of CycB1 (Fig. 2B, lanes 1–9, and see also lanes 16–21, showing the same reactions as lanes 4–9 taken 15 min later). These data argue that polyadenylation of Toll is independent of the CPE and the hexanucleotide. Addition of excess Toll competitor also destabilized the Toll substrate, while addition of CycB1 did not (Fig. 2B, lanes 16–21). In addition, nonadenylated CAT-CycB1 was often stabilized in the presence of CycB1 competitor (Fig. 2B, lanes 13–15), suggesting that the Drosophila extracts can be used to monitor both stability and adenylation changes, but that these two processes are not necessarily linked. A proximal region in Toll 39 UTR directs noncanonical cytoplasmic polyadenylation To identify the elements of Toll that were responsible for cytoplasmic polyadenylation, we performed mutational analysis. The distal 40% of Toll 39 UTR could be deleted without significant consequences for cytoplasmic polyadenylation (Fig. 3A, fragment D3). Further deletions progressively reduced the efficiency of polyadenylation (fragments D4 and D5). A region of 183 nt within the first half of the 39 UTR was sufficient to provide detectable levels of polyadenylation (fragment D6), while other regions of the 39 UTR were not (fragments D5 and D7). We refer to the D6 fragment as the PR. Importantly, deletion of the PR from an otherwise wild-type 39 UTR severely blocked polyadenylation and translation of Toll, indicating that the PR is essential for expression of this mRNA (Fig. 3B). Although the PR is necessary for polyadenylation, other sequences within the Toll 39 UTR seem to influence both the efficiency of polyadenylation and the length of the poly(A) tail. Deletions downstream from the PR reduce the polyadenylation efficiency, while deletions upstream of the PR reduce the length of the poly(A) tail (Fig. 3A, cf. fragments D3, D4, and D6). This illustrates the complexity and fine regulation of the process, which is likely to be mediated by a complex interplay of multiple activities. Auxiliary elements located both upstream of and downstream from the polyadenylation signal have also been described for nuclear polyadenylation (Chen and Wilusz 1998; Zarudnaya et al. 2003), including sequences and factors that mediate hexanucleotideindependent polyadenylation (Venkataraman et al. 2005). Elements other than the canonical CPE have been shown previously to stimulate cytoplasmic polyadenylation in other organisms. In Xenopus oocytes, the U-rich PRE and TCS (translational control sequence) stimulate the polyadenylation of a number of mRNAs early after progesterone induction (Charlesworth et al. 2002, 2004; Wang et al. 2008). Similarly, poly(U) and poly(C) sequences promote polyadenylation during Xenopus embryogenesis (Simon et al. 1992; Paillard et al. 2000), and undefined elements other than the CPE and the hexanucleotide direct polyadenylation of lamin B1 mRNA in Xenopus embryos (Ralle et al. 1999). However, no direct evidence exists that these elements function independently of the canonical polyadenylation machinery. To confirm that the PR is responsible for noncanonical polyadenylation, we performed competition assays. Polyadenylation of Toll was competed with the PR, but not with the distal 119-nt fragment of the Toll 39 UTR that contained the CPE and the hexanucleotide (Fig. 4A, lanes 1–6). Conversely, the PR did not compete polyadenylation Figure 2. Polyadenylation of Toll mRNA is independent of the CPE and the hexanucleotide. (A, top panel) Schematic representation of a Toll 39 UTR (1256 nt) and mutant derivatives. The location and sequence of the putative CPE and the hexanucleotide is detailed. (Bottom panel) Polyadenylation of these mRNAs was measured by Northern blot. (B) Polyadenylation competition assays. Polyadenylation of a 32P-labeled Toll 39 UTR or CAT-CycB1 after addition of excess Toll or CycB1 39 UTRs. RNAs were separated in a 1% denaturing agarose gel. Input (i) RNA probes are also shown. The percentage of polyadenylated transcript with respect to total transcript within each lane is indicated (%pA). Lanes 16–21 in the bottom panel show samples of the same reactions as lanes 4–9 taken 15 min later. GENES & DEVELOPMENT 131 Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press Coll et al. Figure 3. Elements required for cytoplasmic polyadenylation of Toll mRNA. (A, left panel) Schematic representation of Toll 39 UTR deletion derivatives. Nucleotide positions are shown according to the annotated Drosophila Toll sequence, taking as reference the first nucleotide of the 39 UTR. The location of unique restriction sites and the PR are indicated. Restriction sites at both ends of Toll belong to the vector in which this sequence was cloned. Typical patterns of these RNAs are shown in the middle panel, before (À) or after (+) incubation in the Drosophila extracts. The sizes of the respective poly(A) tails are indicated. (Right panel) Quantification of the efficiency of polyadenylation, measured as the percentage of RNA that was polyadenylated versus the total signal (polyadenylated and nonadenylated) within the ‘‘+’’ lane. Values represent the average of at least three independent experiments. (B) The PR is essential for polyadenylation and translation of Toll. (Left panel) Polyadenylation time course of a wild-type Toll 39 UTR or a mutant derivative lacking the PR. (Right panel) Translation efficiencies of reporter mRNAs containing either wild-type or DPR Toll 39 UTRs. Renilla luciferase mRNA was cotranslated as an internal control. The firefly values were corrected for Renilla expression, and the data are represented relative to the translation of Luc-toll mRNA at the last time point. Curves represent the average of three independent experiments using a single batch of extract. this region could not compete for polyadenylation of Toll (data not shown), suggesting that it does not function for polyadenylation in early embryos. Nevertheless, it should be noted that translation of Toll is also required at later times in development, where these sequences and/or the canonical signals could become relevant. To map more finely the sequences within the PR that were responsible for polyadenylation, we first searched for regions ultraconserved among Drosophilids. In addition, we looked for sequence words within the PR significantly overrepresented in the 39 UTRs of Drosophila melanogaster transcripts. We found two ultraconserved regions and two related sequence words distributed along the PR (Supplemental Fig. 3A,B, observe shadowed regions and words within red boxes). Fragments of the PR containing or lacking these sequences were used in functional competition assays (Supplemental Fig. 3C). The results suggest the existence of a complex element responsible for polyadenylation of Toll that is not associated with a simple linear sequence. We speculate that a structure—or multiple redundant, interdependent linear sequences—within the PR are necessary for polyadenylation. Interestingly, the conserved region at the 39 end of the PR consisting of TGTTATCTGTAAGC behaved as a stabilization element. All fragments containing this region destabilized Toll when added in excess, while fragments lacking it did not (Supplemental Fig. 3C). of CAT-CycB1 (Fig. 4A, lanes 7–10), while polyadenylation of this transcript was efficiently competed by an excess of CycB1, as well as by any fragment of Toll containing the CPE and the hexanucleotide, including the full-length Toll 39 UTR (Fig. 4A, lanes 11–18). Consistent with the results of the polyadenylation assays, the PR competed translation of a Toll reporter but not of a CycB1 reporter (Supplemental Fig. 2). The PR competed both polyadenylation and translation less efficiently than the full-length Toll 39 UTR, in agreement with its lower polyadenylation efficiency. The competition results cannot be explained by different affinities of the same factors for the PR compared with the canonical sequences, because neither the PR competes CycB1 mRNA polyadenylation nor CycB1 competes Toll polyadenylation. Thus, we conclude that polyadenylation of Toll is driven by a complex that binds to the PR and differs from the canonical machinery in at least one limiting component. Previously, a region of the Toll 39 UTR that lies downstream from the PR (located between nucleotides 582 and 774) was shown to promote polyadenylation of this transcript (Schisa and Strickland 1998). In our hands, Figure 4. The PR is required for noncanonical cytoplasmic polyadenylation in vitro and in vivo. (A) Polyadenylation competition experiments using the RNAs schematically represented in the top panel as competitors. Experiments were performed as indicated in the legend for Figure 2B. (B) Excess PRs disrupt polyadenylation of endogenous Toll and reduce viability. (Left panel) Embryos (0–30 min) were microinjected with different amounts of the PR or an unrelated RNA of similar length (159 nt) at a concentration of 10 ng/mL. Samples were collected 1 h after injection, the RNA was extracted, and the poly(A) tail length was tested by PAT assay. Amplified products were visualized by Southern blot using a random-primed probe against the Toll 39 UTR. The results for two independent injections are shown. (Right panel) Embryos were microinjected with RNA solutions at different concentrations or with water as control, and survival was scored as the percentage of hatched embryos as indicated in the Materials and Methods. The average of at least three independent injections of 100 embryos per injection is shown. 132 GENES & DEVELOPMENT Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press Noncanonical polyadenylation Importantly, fragment 9 (F9) strongly destabilized Toll but did not compete for polyadenylation, showing that polyadenylation and stability of Toll are separable processes. We next wanted to test the relevance of the PR in vivo. We performed in vivo competition experiments by injecting wild-type early embryos with either the PR or an unrelated RNA of the same length as control. We then tested polyadenylation of endogenous Toll mRNA and survival of injected embryos, as timely expression of Toll is essential for early development. The results showed that the PR specifically competed polyadenylation of endogenous Toll and reduced the viability of early embryos (Fig. 4B). These results indicate that PR directs noncanonical polyadenylation in vitro and in vivo. The Toll polyadenylation mechanism described here may affect a variety of mRNAs. Recent in silico EST database analysis indicates an incidence of the hexanucleotide A(A/U)UAAA of 60%–70%, suggesting that a significant fraction of mRNAs lack consensus hexanucleotide signals and may undergo AAUAAA-independent polyadenylation (for review, see MacDonald and Redondo 2002). In addition, activation of Drosophila pgc (polar granule component) mRNA in early embryos appears independent of Orb (Rangan et al. 2009). Which factors could be involved in polyadenylation of Toll? Mutations in cortex and grauzone were identified genetically to affect polyadenylation of Toll (Lieberfarb et al. 1996). Cortex is an activator of the anaphasepromoting complex, and mutations in this gene prevent the completion of meiosis (Chu et al. 2001). On the other hand, Grauzone is a transcription factor necessary for activation of Cortex (Harms et al. 2000). Thus, Cortex and Grauzone may affect polyadenylation of Toll indirectly by precluding the normal initiation of embryogenesis. Similarly, embryos from wisp mutant mothers are defective in polyadenylation of several maternal mRNAs, including Toll (Cui et al. 2008). Wisp is present until ;2 h of development and, therefore, could be directly involved in polyadenylation of Toll. However, Wisp is also required for expression of Cortex (Benoit et al. 2008), so it is unclear to what extent the observed effects on Toll polyadenylation are due to primary defects in Cortex expression. Direct biochemical dissection using the cell-free polyadenylation system, combined with Drosophila genetics, will allow us to decipher the components of both the CPE-dependent and CPE-independent cytoplasmic polyadenylation machineries. Materials and methods Extract preparation Extracts were prepared from staged 90-min embryos as described in Gebauer et al. (1999). To stage embryos, collecting trays were exchanged every 90 min, three times, and the third batch of trays was used to prepare extracts. After preparation, extracts were adjusted to 10% glycerol, snapfrozen in liquid nitrogen, and stored at À80°C. different batches of extract, carrying in parallel the full-length Toll 39 UTR as a positive control. In some cases, Renilla mRNA was cotranslated as an internal control. After incubation, the translation efficiency was determined by measuring the luciferase activity using the Dual Luciferase Assay System (Promega), and firefly luciferase values were corrected for Renilla expression. Polyadenylation was tested by either PAT assay, Northern blot, or direct visualization using radioactively labeled RNAs. For Northern blots, a random-primed probe against the full-length 39 UTR of Toll was used. Radioactively labeled RNAs were resolved in denaturing 6% acrylamide gels and visualized in a PhosphorImager. PAT assays were performed as ´ described by Salles and Strickland (1995) after treatment of RNA samples with Turbo DNase (Ambion). The gene-specific oligonucleotides used for these assays are described in the Supplemental Material. To amplify endogenous Toll, RNA was extracted from embryos using Trizol (Invitrogen), and 150–300 ng of total RNA (100 embryos) were used in the reaction. Amplified products were resolved in 1% agarose gels. For competition assays, extracts were preincubated for 10 min on ice with increasing amounts of ApppG-capped RNA competitor. The remaining reagents needed for translation were subsequently added, and the reaction was further incubated for 10 min before adding the radioactively labeled substrate mRNA. Plasmids and in vitro transcription DNA constructs are detailed in the Supplemental Material. mRNAs were synthesized as described previously (Gebauer et al. 1999). mRNAs lacked a poly(A) tail and contained a 7mGpppG cap. RNAs used as competitors contained an ApppG cap. Cordycepin was incorporated to the 39 end of Luc-toll mRNA with yeast poly(A) polymerase (GE Healthcare), following the recommendations of the vendor. Microinjections Oregon embryos (0–30 min old) were injected in a ventral–posterior position as described previously (Schisa and Strickland 1998). Embryos were allowed to develop for 72 h at 18°C, and the number of hatched embryos was scored to estimate the percentage of viability. For PAT assays, microinjected embryos were allowed to develop for 1 h before extraction of RNA with Trizol (Invitrogen). Acknowledgments ´ ´ We thank Raul Mendez, Juan Valcarcel, Martine Simonelig, Josep Vilardell, and Antoine Graindorge for critically reading this manuscript and for ´ ´ useful suggestions. We also thank Cornelia de Moor and Raul Mendez for CycB1 plasmids. This work was supported by grants BMC2003-04108 and BFU2006-01874 from the Spanish Ministry of Education and Science, and grant 2005SGR00669 from the Department of Universities, Information, and Sciences of the Generalitat of Catalunya (DURSI). F.G. is supported by the I3 Program of the Spanish Ministry of Education and Science. References Barnard DC, Ryan K, Manley JL, Richter JD. 2004. Symplekin and xGLD-2 are required for CPEB-mediated cytoplasmic polyadenylation. Cell 119: 641–651. Belloc E, Pique M, Mendez R. 2008. Sequential waves of polyadenylation and deadenylation define a translation circuit that drives meiotic progression. Biochem Soc Trans 36: 665–670. Benoit B, Mitou G, Chartier A, Temme C, Zaessinger S, Wahle E, Busseau I, Simonelig M. 2005. An essential cytoplasmic function for the nuclear poly(A) binding protein, PABP2, in poly(A) tail length control and early development in Drosophila. Dev Cell 9: 511–522. Benoit P, Papin C, Kwak JE, Wickens M, Simonelig M. 2008. PAP- and GLD2-type poly(A) polymerases are required sequentially in cytoplasmic polyadenylation and oogenesis in Drosophila. Development 135: 1969– 1979. Castagnetti S, Ephrussi A. 2003. Orb and a long poly(A) tail are required for efficient oskar translation at the posterior pole of the Drosophila oocyte. Development 130: 835–843. In vitro polyadenylation and translation Reactions using 90-min embryo extracts were assembled as described in Gebauer et al. (1999), without tRNA. In these reactions, both polyadenylation and translation could be observed. Typically, 0.01 pmol of substrate mRNA was used in the reaction. The use of small amounts of substrate is relevant, as we found that polyadenylation is saturable in this system. To account for batch–to–batch differences in the polyadenylation kinetics and efficiency, polyadenylation of each RNA construct was tested in GENES & DEVELOPMENT 133 Downloaded from genesdev.cshlp.org on January 19, 2010 - Published by Cold Spring Harbor Laboratory Press Coll et al. Chang JS, Tan L, Schedl P. 1999. The Drosophila CPEB homolog, orb, is required for oskar protein expression in oocytes. Dev Biol 215: 91–106. Charlesworth A, Ridge JA, King LA, MacNicol MC, MacNicol AM. 2002. A novel regulatory element determines the timing of Mos mRNA translation during Xenopus oocyte maturation. EMBO J 21: 2798– 2806. Charlesworth A, Cox LL, MacNicol AM. 2004. Cytoplasmic polyadenylation element (CPE)- and CPE-binding protein (CPEB)-independent mechanisms regulate early class maternal mRNA translational activation in Xenopus oocytes. J Biol Chem 279: 17650–17659. Charlesworth A, Wilczynska A, Thampi P, Cox LL, MacNicol AM. 2006. Musashi regulates the temporal order of mRNA translation during Xenopus oocyte maturation. EMBO J 25: 2792–2801. Chen F, Wilusz J. 1998. Auxiliary downstream elements are required for efficient polyadenylation of mammalian pre-mRNAs. Nucleic Acids Res 26: 2891–2898. Chu T, Henrion G, Haegeli V, Strickland S. 2001. Cortex, a Drosophila gene required to complete oocyte meiosis, is a member of the Cdc20/fizzy protein family. Genesis 29: 141–152. Cui J, Sackton KL, Horner VL, Kumar KE, Wolfner MF. 2008. Wispy, the Drosophila homolog of GLD-2, is required during oogenesis and egg activation. Genetics 178: 2017–2029. Gebauer F, Corona DF, Preiss T, Becker PB, Hentze MW. 1999. Translational control of dosage compensation in Drosophila by Sex-lethal: Cooperative silencing via the 59 and 39 UTRs of msl-2 mRNA is independent of the poly(A) tail. EMBO J 18: 6146–6154. Groisman I, Jung MY, Sarkissian M, Cao Q, Richter JD. 2002. Translational control of the embryonic cell cycle. Cell 109: 473–483. Harms E, Chu T, Henrion G, Strickland S. 2000. The only function of Grauzone required for Drosophila oocyte meiosis is transcriptional activation of the cortex gene. Genetics 155: 1831–1839. Juge F, Zaessinger S, Temme C, Wahle E, Simonelig M. 2002. Control of poly(A) polymerase level is essential to cytoplasmic polyadenylation and early development in Drosophila. EMBO J 21: 6603–6613. Keleman K, Kruttner S, Alenius M, Dickson BJ. 2007. Function of the Drosophila CPEB protein Orb2 in long-term courtship memory. Nat Neurosci 10: 1587–1593. Kim JH, Richter JD. 2006. Opposing polymerase-deadenylase activities regulate cytoplasmic polyadenylation. Mol Cell 24: 173–183. Lieberfarb ME, Chu T, Wreden C, Theurkauf W, Gergen JP, Strickland S. 1996. Mutations that perturb poly(A)-dependent maternal mRNA activation block the initiation of development. Development 122: 579–588. MacDonald CC, Redondo JL. 2002. Reexamining the polyadenylation signal: Were we wrong about AAUAAA? Mol Cell Endocrinol 190: 1–8. McGrew LL, Richter JD. 1990. Translational control by cytoplasmic polyadenylation during Xenopus oocyte maturation: Characterization of cis and trans elements and regulation by cyclin/MPF. EMBO J 9: 3743–3751. Minshall N, Reiter MH, Weil D, Standart N. 2007. CPEB interacts with an ovary-specific eIF4E and 4E-T in early Xenopus oocytes. J Biol Chem 282: 37389–37401. Paillard L, Maniey D, Lachaume P, Legagneux V, Osborne HB. 2000. Identification of a C-rich element as a novel cytoplasmic polyadenylation element in Xenopus embryos. Mech Dev 93: 117–125. ´ Pique M, Lopez JM, Foissac S, Guigo R, Mendez R. 2008. A combinatorial code for CPE-mediated translational control. Cell 132: 434–448. Radford HE, Meijer HA, de Moor CH. 2008. Translational control by cytoplasmic polyadenylation in Xenopus oocytes. Biochim Biophys Acta 1779: 217–229. Ralle T, Gremmels D, Stick R. 1999. Translational control of nuclear lamin B1 mRNA during oogenesis and early development of Xenopus. Mech Dev 84: 89–101. Rangan P, DeGennaro M, Jaime-Bustamante K, Coux RX, Martinho RG, Lehmann R. 2009. Temporal and spatial control of germ-plasm RNAs. Curr Biol 19: 72–77. Richter JD. 2007. CPEB: A life in translation. Trends Biochem Sci 32: 279–285. ´ Salles FJ, Strickland S. 1995. Rapid and sensitive analysis of mRNA polyadenylation states by PCR. PCR Methods Appl 4: 317–321. ´ Salles FJ, Lieberfarb ME, Wreden C, Gergen JP, Strickland S. 1994. Coordinate initiation of Drosophila development by regulated polyadenylation of maternal messenger RNAs. Science 266: 1996–1999. Schisa JA, Strickland S. 1998. Cytoplasmic polyadenylation of Toll mRNA is required for dorsal-ventral patterning in Drosophila embryogenesis. Development 125: 2995–3003. Simon R, Tassan JP, Richter JD. 1992. Translational control by poly(A) elongation during Xenopus development: Differential repression and enhancement by a novel cytoplasmic polyadenylation element. Genes & Dev 6: 2580–2591. Stebbins-Boaz B, Cao Q, de Moor CH, Mendez R, Richter JD. 1999. Maskin is a CPEB-associated factor that transiently interacts with elF-4E. Mol Cell 4: 1017–1027. Vardy L, Orr-Weaver TL. 2007. The Drosophila PNG kinase complex regulates the translation of cyclin B. Dev Cell 12: 157–166. Venkataraman K, Brown KM, Gilmartin GM. 2005. Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition. Genes & Dev 19: 1315–1327. Wang YY, Charlesworth A, Byrd SM, Gregerson R, MacNicol MC, MacNicol AM. 2008. A novel mRNA 39 untranslated region translational control sequence regulates Xenopus Wee1 mRNA translation. Dev Biol 317: 454–466. Zarudnaya MI, Kolomiets IM, Potyahaylo AL, Hovorun DM. 2003. Downstream elements of mammalian pre-mRNA polyadenylation signals: Primary, secondary and higher-order structures. Nucleic Acids Res 31: 1375–1386. 134 GENES & DEVELOPMENT Sociological Methods & Research http://smr.sagepub.com How Much Does It Cost?: Optimization of Costs in Sequence Analysis of Social Science Data Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher and Cédric Notredame Sociological Methods Research 2009; 38; 197 DOI: 10.1177/0049124109342065 The online version of this article can be found at: http://smr.sagepub.com/cgi/content/abstract/38/1/197 Published by: http://www.sagepublications.com Additional services and information for Sociological Methods & Research can be found at: Email Alerts: http://smr.sagepub.com/cgi/alerts Subscriptions: http://smr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://smr.sagepub.com/cgi/content/refs/38/1/197 Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 How Much Does It Cost? Optimization of Costs in Sequence Analysis of Social Science Data Jacques-Antoine Gauthier University of Lausanne, Switzerland Sociological Methods & Research Volume 38 Number 1 August 2009 197-231 Ó 2009 SAGE Publications 10.1177/0049124109342065 http://smr.sagepub.com hosted at http://online.sagepub.com Eric D. Widmer University of Geneva, Switzerland Philipp Bucher Swiss Institute of Bioinformatics and Swiss Institute for Experimental Cancer Research, Lausanne Switzerland ´ Cedric Notredame Centre National de la Recherche Scientifique, Marseille, France, and Centre for Genomic Regulation, Barcelona, Spain One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justification problem of cost schemes. Keywords: sequence analysis; optimal matching; trajectories; empirical cost optimization O ptimal matching analysis (OMA) has emerged since the 1990s as a main methodological innovation in the social sciences for finding Authors’ Note: Please address correspondence to Jacques-Antoine Gauthier, University of ˆ Lausanne, SSP–MISC, Batiment de Vidy, CH-1015 Lausanne, Switzerland; e-mail: JacquesAntoine.Gauthier@unil.ch. 197 Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 198 Sociological Methods & Research patterns in sequences of social events (Abbott and Tsay 2000). It is based on the assumption that succession of social statuses or events constitutes stories throughout the life course that can be measured in a set of data (Abbott 1984, 1990a, 1990b, 1995a, 2001). Usual measures of distance, such as the Euclidean distance, are ineffective for many sequential data, for example, when their lengths differ (Kruskal 1983; Abbott 1995b, 2001). Therefore, multivariate statistical methods falling within the framework of dynamic programming procedures and stemming from molecular biology (e.g., Needleman and Wunsch 1970) have been adapted to the study of social trajectories (Abbott and Hrycak 1990; Erzberger and Prein 1997; Giele and ¨ Elder 1998; Wilson 1998; Aisenbrey 2000, Rohwer and Potter 2002) and embodied in various softwares (TDA,1 Optimize,2 and CLUSTALG3). One problem identified as major in this set of methods, however, lies in the cost schemes on which empirical analyses are based. As a matter of fact, optimal matching methods decompose the total difference between any two sequences into a collection of individual elementary differences using substitution, deletion, and insertion operations (Kruskal 1983). The determination of the costs attributed to those operations is the subject of an ongoing debate in the social sciences (Abbott and Tsay 2000; Wu 2000) since the setting of costs is not in most cases based on explicit and strong theoretical stances. For example, given a pair of sequences to be aligned, one can wonder if it is the same to substitute 1 year of full-time employment with either 1 year of part-time employment or 1 year of being exclusively at home. If it is not, we should consider weighting the costs of those operations so that they contribute differently to the final alignment of the two sequences. Some scholars emphasize the large impact that cost setting has on the final results of their ¨ analysis (Rohwer and Potter 2002) whereas others take the opposite stance, underlining its marginal impact on similarity scores among sequences (Levine 2000). However, most argue for both sensitivity and stability of the effect of cost variations on the results of the analysis (Abbott and Hrycak 1990).4 Therefore, researchers in the social sciences are left wondering to what extent the final results of their analyses are reproducible and valid. This article first describes usual solutions proposed by social scientists in regard to the problem of the determination of costs in sequence analysis. Then, it proposes a method that computationally derives costs from the empirical data, based on state-of-the-art approaches in bioinformatics (Henikoff and ¨ Henikoff 1992; Muller and Vingron 2000; Ng, Henikoff, and Henikoff 2000; Yu and Altschul 2005). The proposed algorithm is then tested on three distinct social science data sets. We further discuss the consequences of the results for empirical analyses of sequence data in the social sciences. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 199 How Are Costs Determined in the Social Sciences? The issue of costs concerns two operations in sequence analysis: substitution and insertion/deletion. Because this stage of sequence analysis is critical for further results, all publications that use OMA provide some sense of how costs are set, but with unequal degrees of details. Based on a literature review in the field, we found five strategies regarding the setting of substitution costs as they are used in the social sciences. A first strategy is to set all substitution costs to a constant, that is, using an identity matrix (Dijkstra and Taris 1995; Rohwer and Trappe 1997; Pentland et al. 1998; Wilson 1998; Schaeper 1999; Billari 2001). Those using this strategy argue that they have no rational way to set costs in another way. This strategy is used largely when no theoretical rationale is available for supporting the setting of costs. It has been criticized, however, for its inability to reflect unequal differences between a given set of social characteristics, on one hand, and the distribution of those different positions on the other. Abbott and Hrycak (1990) gave the example of determining the proximity of some occupational positions such as senior executive, first-level supervisor, and line worker, which would be impossible. They proposed that in this case substituting or inserting the rarest one should be more costly. A second research strategy uses differentiated costs following theoretical intuitions concerning the ‘‘social weight’’ for substituting one status with another (Chan 1995; Erzberger and Prein 1997; Halpin and Chan 1998; Blair-Loy 1999; Giuffre 1999; Schaeper 1999; Scherer 2001; Widmer, Levy, et al. 2003). For instance, Chan (1995) underlined that decisions about costs have to be grounded in theoretically important divisions between social classes. One may agree only in principle with this and comparable statements, but the social sciences are currently characterized by various contradicting theories rather than by a common theoretical framework such as evolutionary theory in biology (Grauer and Li 2000; Turner 2001; Giddens, Duneier, and Applebaum 2003). Therefore, backing costs with theoretical statements often proves difficult because of the large number of alternatives, depending on the theory chosen. Also, because results from sequence analysis are used to support and contradict theoretical statements at the same time, there is some circularity in building the costs on the same theoretical statements that they are supposed to help prove or disprove. This is as true for research on social classes as for other research areas in the social sciences. A third strategy consists of applying some empirical coding scheme based on common sense or face value. Aisenbrey (2000) set the substitution costs Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 200 Sociological Methods & Research according to a hierarchical ordering of the statuses that constitute the sequences. Abbott and Forrest (1986, 1989), for instance, categorized the statuses of sequences according to the number of steps up the hierarchy necessary to put them under a common heading. The substitution cost is computed as the ratio of this number to the total number of steps possible. Applying the ‘‘garbage can model’’ to estimate the institutional influence on the textbook publishing process in physics and sociology, by means of sequence comparison technique, Levitt and Nass (1989) based the setting of their substitution costs on a list of topics and subtopics used in structuring textbooks. The cost was set to 1 for a change from one topic to another (e.g., stratification vs. ideology) and to 0.5 for a change between subtopics of the same topic (race vs. gender as substructures of stratification). Studying the structure of sociological articles across time, Abbott and Barman (1997) defined two levels of elementary states of sociological articles. Level 1 comprised statuses such as ‘‘introductory,’’ ‘‘hypotheses,’’ and ‘‘literature,’’ whereas Level 2 encompassed subdivisions such as ‘‘topic,’’ ‘‘state of affairs,’’ ‘‘questions,’’ and ‘‘author’s theory/assertion’’ for the introductory heading. A substitution cost of 1 was attributed to subheadings falling under different headings and 0.25 for subheadings falling under the same heading. In all cases reviewed, the setting of costs is not done on strong theoretical bases, but rather on rules that make empirical sense considering the problem at hand. Fourth, some authors set costs based on a combination of common sense (the third strategy) with the likelihood of transitions between statuses in the empirical data (Abbott and Hrycak 1990; Stovel, Savage, and Bearman 1996; Stovel and Bolan 2004). For instance, in their programmatic study of musicians’ careers, Abbott and Hrycak (1990) first distinguished for each musician nine spheres of activity (court, town, church, etc.) and 15 positions (vocalist, composer, Kapellmeister, etc.). Among the 135 combinations, they finally kept 35 different occupational positions as statuses in a musician’s career. To set the costs of substitution, they proposed that a change in both sphere and position is more drastic than a change in only one sphere. They set to 0.75 the cost for a change within either a sphere or a position. The cost was set to 1 when the change occurred on both levels. Second, in order to take into account the fact that some pairs of occupational positions seem to be closely connected with mobility (i.e., they often lie on the same career line), they combined the distance matrix, based on mobility, with a position/sphere dissimilarity matrix. This matrix was constructed by classifying all moves in all careers according to their frequency. The final substitution matrix is then a linear combination of corresponding symbols of the two matrices. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 201 An alternative to the development of substitution costs is represented by the use of transition costs, estimated directly on collections of trajectories. Such an option is available in TDA software, but to our knowledge, no empirical results based solely on this way of determining substitution costs have yet been published. In a transition matrix, low costs indicate pairs of symbols that are likely to co-occur in a specific life trajectory (such as work and retirement). In a substitution matrix, low substitution costs indicate symbols that are likely to occur simultaneously in two different trajectories. A low substitution cost does not imply any transition, but rather an equivalence of some sort between the two considered statuses. While transition matrices are ideal for analyzing individual strings and identifying trajectory anomalies, they are much less suitable to comparisons of alternative trajectories that rely on the comparison of symbols occurring simultaneously in different trajectories. Some scholars have used costs based on transitions, combined with some additional criteria. Stovel et al. (1996) derived the substitution costs from an analysis of the complete transition matrix reporting the distribution of work transitions of all workers of Lloyds Bank over the period 1890 through 1970. They then distinguished costs for positions and for branch changes and combined them. Considering residential trajectories, Stovel and Bolan (2004) used a similar strategy. They first constructed a place-type variable (nine categories) based on a continuum ranging from small rural towns to large metropolitan cities. This theoretically based distinction was then combined with the empirical distribution of the frequency of all possible transitions among types of places. The substitution matrix was constructed as a repeated adjustment between the initial theoretical model and the empirical transition rates. In contrast to the previous three strategies, this strategy marks a significant improvement as it is at least partially empirically driven. There are, however, various problems existing with the solutions currently proposed. First, all reviewed solutions are at least partially driven by intuition or face value, or by some kind of theoretical stance. Second, the choice of simple frequencies (or a linear function of them) to weight the substitution cost is not supported by any formalized computational methods nor by any statistical theoretical grounds. Third, even in cases where ‘‘pseudo’’ or intuitive iterative methods are used to set the substitution costs (cf. Stovel and Bolan 2004), no formal rules are presented that justify the solution chosen by the researchers. Fourth, none of those models succeed in giving a systematic and fully empirically driven procedure of substitution cost settings. Finally, no attempts are made to optimize costs based on the empirical data at hand. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 202 Sociological Methods & Research In the fifth strategy, some researchers acknowledge having used a mix of several if not all approaches listed above, insisting on the exploratory dimension of the process and the fact that guidelines are few and rather ¨ fuzzy (Rohwer and Potter 2002). To summarize, researchers in the field have underlined that the issue of the determination of costs in OMA remains presently open. An Alternative: Deriving the Cost Empirically To develop a more systematic and reliable method for cost setting to the ones currently existing in the social sciences, one should get back to the basics of sequence alignment. Given two strings I and J, Given two strings I and J, a penalty for insertions and deletions (called INDEL), and a cost matrix C, where CSi Sj is the cost for aligning Si, the ith symbol of I, against Sj, the jth symbol of J, the score of the optimal alignment can be computed using the following recursion: 8 < OMAði − 1, j − 1Þ + CSi Sj OMAði, jÞ = Best OMAði − 1, jÞ + INDEL : : OMAði, j − 1Þ + INDEL ð1Þ In a general sequence comparison perspective, one considers that a substitution is equivalent to a deletion followed by an insertion. Therefore, the value of an INDEL is often arbitrarily set to half of that of a substitution (Kruskal 1983). Each line in equation (1) corresponds to the optimal match score of two substrings. For instance, OMA(i − 1, j − 1) corresponds to the optimal match score of a subsequence containing the symbols 1 to i − 1 of Sequence 1, against a subsequence containing the symbols 1 to j − 1 in the second sequence. As such, this equation defines a recursion in which the score of any alignment OMA(i, j) can be estimated by considering an optimal extension of the three shorter alignments OMA(i − 1, j), OMA(i − 1, j − 1), and OMA(i, j − 1). Considering that each of these shorter alignments is already an optimal matching of the associated substrings, we know that OMA(i, j) is optimal. This strategy relies on the assumption that each position is independent and that the alignment scores are additive. The alignment of Sequences A and B in Figure 1 is produced by applying recursion, as in equation (1), and iteratively filling up the OMA(i, j) array until the optimal matching score OMA(I, J) is obtained (Kruskal 1983). By recording the results of all the comparisons made at each step of the recursion, it is possible to trace back the optimal scores from Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 203 Figure 1 Example of Optimal Matching Score Computation and Alignment Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 204 Sociological Methods & Research the cell OMA(I, J), thus generating an alignment, as shown in Figure 1, where an identity substitution matrix has been used. Such a matrix assigns the value 0 to the matching of two identical letters and the value −2 to the substitution of two different letters. Insertion or deletion occurring at one extremity of the alignment takes the value −1 (terminal INDEL) and the value −2 when they are used within the alignment (internal INDEL). In the OMA(i, j) array below, the traceback is indicated in bold. Starting from the bottom-right corner of the array, vertical moves correspond to an INDEL in Sequence A, horizontal moves to an INDEL in Sequence B, and diagonals to a match or a substitution. One of the main issues that arises in equation (1) concerns the estimation of the substitution costs (Cij). This issue is also central in biology. Given 20 amino acids, some so similar that they are almost interchangeable while others are very different, one cannot simply use any a priori substitution matrix; some modeling is required. Dayhoff, Schwartz, and Orcutt (1978) addressed this problem in the 1970s using a data-driven empirical approach. They manually aligned sets of highly similar, samelength sequences of amino acids and counted the number of mutations tolerated by evolution. A mutation is characterized by the presence of 2 different amino acids at the same position of the alignment. In a general sequence comparison perspective, this is called a substitution (Kruskal 1983). In this context, highly similar sequences are defined as those having more than 80 percent identity, where the percentage of identity is calculated by dividing the number of positions in the alignment in which the same letter appears in both sequences (identities) by the length of the alignment, as shown in equation (2). All positions with a gap in either sequence are nonidentities; thus, only the alignment of two identical sequences yields to 100 percent identity: Percentage of Identity = W = Number of Identical Matches : Alignment Length ð2Þ Selecting sequences with a high percentage of identity for computing data-driven costs of substitution prevents biases due to uncontrolled heterogeneity. For instance, the alignment of Figure 1 displays 57 percent identity (four identical pairs of letters found over the seven positions of the alignment). Finally, for each pairwise alignment, the relative frequency of substitutions occurring between two particular amino acids is compared to what was expected by chance alone. These values are computed as log odds and tabulated into a data-driven substitution matrix, as in equation (3): Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 205 Dayhoff Costða, bÞ = log   fab : fa à fb ð3Þ In equation (3), fab is the relative frequency with which the symbols a and b have actually matched at the same position of a given set of pairwise alignments, while fa à fb is the product of the relative frequencies of a and b in the same data set and therefore an estimation of the probability of seeing a and b aligned throughout all the alignments of the data set. If we consider fab to be an estimate of the probability of finding a and b matched in the data set, then it becomes possible to estimate the ratio of those two probabilities (their odds) and evaluate the extent to which a given substitution (match) between two symbols is over- or underrepresented in the alignments. The most notable property of log odds is to yield negative scores for events observed less often than expected by chance. In the context of optimal matching, this amounts to having a cost matrix that penalizes unexpected matches with negative values while expected matches or identities are rewarded with positive values. As in an alignment, two identical symbols do not systematically match, and the Dayhoff cost for substituting two identical symbols is often different from zero. In biology matching, various pairs of identical symbols can be associated with different positive values. The rationale is that in biology, all conservations are not equally important. In the social sciences, however, the decision was made early to set conservation costs to 0 and substitution to variable costs. This model suggests that all social statuses are equally conserved, regardless of their nature. This may or may not be true. For instance, one may ask if the social cost should be the same for matching years as unemployed or years spent on the labor market. The equality of these statuses cannot be ruled out as long as it has not been formally demonstrated. For the time being, the proposed algorithm sticks to the mainstream procedure in the social sciences, but it would be trivial to adapt it so that different costs may be used for different types of identities. To get a cost of zero for the substitution of two identical symbols, we use a normalized cost (N_cost) that is derived from the cost defined in equation (3) as follows: N– costða,bÞ=Dayhoff Costða,bÞ− Dayhoff Costða,aÞ+Dayhoff Costðb,bÞ : ð4Þ 2 In equation (4), Dayhoff cost refers to the original Dayhoff cost (equation [3]) that is positive and maximized for identities while yielding lower (often negative) values for mismatches. A substitution matrix based on Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 206 Sociological Methods & Research N_costs has the same properties as the Dayhoff cost matrix except that it yields a null cost to the alignment of two identical sequences, a convenient property for cluster analysis based on a distance matrix. In biology, it is common practice to use log-odds matrices as a scoring scheme when applying the optimal matching algorithm. The main reason is that the versatility of the log-odds method makes it possible to discriminate between different types of mismatches in an objective and quantitative fashion. As substitution and INDEL operations are mutually dependent, using cost matrices as defined in equation (3) or (4) calls for setting the value of the INDELs according to the cost matrix at hand, as shown in equation (5). PN − 1 PN INDEL = i=1 j = i + 1 Cij 2 − NÞ × 0:5 ðN : ð5Þ In equation (5), the cost for not matching a symbol (INDEL) was estimated using the Thompson formula (Thompson, Higgins, and Gibson 1994), where INDEL is set to the average substitution cost of the substitution matrix (i.e., the matrix average ignoring the values in the main diagonals). It is possible to distinguish two kinds of indels, the internal ones that occur between two given symbols and the terminal ones that come at the end of the shorter sequence to make its length equal to the longer one. In the context of this work, we simply attributed the INDEL value of equation (5) to internal indels only and lowered it to INDEL/2 for terminal ones, thus making it easier for indels to be terminal rather than internal. Given a collection of sequences, the main difficulty is the proper estimation of an appropriate cost matrix. Using reference alignments is possible but may require some arbitrary knowledge. In the case of Dayhoff, using reference alignments was possible because closely related sequences were available whose alignment could be assembled in an unambiguous manner (i.e., without INDEL). In the social sciences, reference alignments are not available, and a strategy must therefore be worked out to generate them in a systematic and unbiased fashion. Over the last 15 years, several techniques have been introduced in biology and aimed at training position-specific substitution matrices through iterative sequence-alignment procedures (Lawrence et al. 1993; Hughey and Krogh 1996; Altschul et al. 1997; Bateman et al. 1999). PSI-BLAST (PositionSpecific Iterative Basic Local Alignment Search Tool), one of the most popular tools in biology, is the one whose principle resembles most the one developed here. In PSI-BLAST, a biological sequence is first compared to all the others in the database, using an off-the-shelf substitution matrix. The Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 207 Figure 2 Pseudocode Describing the Iterative Training Procedure 0 - set CURRENT matrix to the identity matrix I - Optimal matching of the sequences with the CURRENT matrix II - Estimation of the NEW matrix on the alignments produced in (I) Measure the Percent Identity on the alignments Select alignments Yielding more than 60% identity Count the matches/mismatches on the selected alignments Weighting the counts of each alignment with its percent identity Compute the NEW matrix III–Comparison of NEW matrix and CURRENT matrix If CURRENT==NEW, terminate Else set CURRENT to NEW and proceed to I. best alignments thus generated are selected according to their percentage of identity and used to update the matrix in a process that goes on, cycle after cycle, until successive cycles fail to modify the matrix, in which case the algorithm is said to have reached convergence. The proposed strategy is directly adapted from this iterative method and is outlined in the pseudocode shown in Figure 2. In this context, the matrix can be viewed as a model used for generating optimal matches of the sequences. In other words, a correct matrix must be able to generate alignments similar to those it was estimated from. This equivalence is sought in the iteration procedure, in which matrices and alignments are alternatively generated until they both become invariant, suggesting an equivalent information content. Overall, this amounts to generating matrices whose purpose is to optimally summarize the information contained in the sequences. In this context, the alignments and the matrix can be viewed as two alternative models of the relationships among sequences. The convergence is meant to ensure that these two models are equivalent. Empirical Cost Matrix Estimation of Social Science Data Given a set of sequences of social statuses and a preestimated matrix (in this case, an identity matrix), pairwise alignments are generated with the OMA algorithm. This can be done either by exhaustively considering Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 208 Sociological Methods & Research all possible pairs of sequences or by restricting the training procedure (cf. Figure 2) to a random subset of the sequences if computation time is an issue. When computing matrix statistics from these alignments, the main caveat is the uneven alignment quality. While mismatches measured on almost identical strings can be expected to be meaningful substitutions, matches and mismatches measured on poorly matched strings may be suspicious. Dealing with low-quality alignments is a delicate issue in the social sciences as well as in biology. The simplest approach to deal with this limitation is to ignore alignments with a low percentage of identity, as done in PSI-BLAST (discussed previously). For instance, in the context of this article, we excluded all the alignments yielding less than 60 percent identity (equation [2]). Such a conservative threshold ensures the quality of the considered alignments and therefore the relevance of the observed substitutions. Furthermore, based on strategies developed in biology, we also applied an extra weight on the selected alignments in order to ensure that the best alignments contribute more to the final matrix. This extra step is similar to the selection made for empirically estimating the costs of substitution (equation [2]), but it specifically helps smooth the convergence of the iterative process and also guarantees a stronger contribution of the most reliable alignments. We again used percentage of identity (as measured on the alignment) as a weighting scheme. This parameter is often regarded as a good indicator of correctness and was successfully used by Notredame, Holm, and Higgins (1998) to design local scoring schemes. We therefore used equation (6) to derive a collection of weights that are specifically applied to the relative frequency of each possible substitution associated with a given alphabet: fab = PS 1 Wi à Ni ða, bÞ : PS à 1 Wi Li ð6Þ Thus, weighting the relative frequency fab of symbol a matching symbol b in the alignments is estimated in the following fashion, where Wi is the weight associated with the alignment i, Li is the length of that alignment, and Ni(a, b) is the number of pairs ab perfectly matched in that alignment. The term Ni(a, b) indifferently represents identical matches if a and b are the same symbols or mismatches if a and b are different symbols. In equation (6), the weight Wi is meant to increase the contribution of trustworthy alignments, thus speeding the convergence process and decreasing the amount of noise contributed by spurious alignments. In the case of social science data in which sequence patterns are shaped Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 209 following less stringent rules than in biological sequences and therefore show more diversity, this approach allows us at the same time to take a greater variability of sequences into account and to limit the influence of outliers. To prevent possible underflow (i.e., division by 0) caused by a rare mismatch or match, a small value (0.001) is added to every frequency. Given frequencies tabulated for every possible pair of symbols, the substitution matrix is then computed using equation (3). This matrix is the new matrix that will be used in the next training round (cf. Figure 2). The iteration procedure is meant to optimize the cost matrix so that it summarizes as accurately as possible the information contained in the alignments from which it is estimated. That procedure is complete when a matrix is able to generate alignments with statistical properties similar to those it originates from. That convergence can easily be measured by estimating the difference between two successive matrices in the evaluation procedure ( ), using, for instance, the mean square differences between them: = r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   Avg ðM1 ða, bÞ − M2 ða, bÞÞ2 : ð7Þ The iterative procedure is stopped when becomes equal to 0. However, this procedure is merely an attempt to reach optimality, with no proven guarantee (Hughey and Krogh 1996). In this context, the simplest criterion to ensure optimality is to check that alternative trainings converge on similar matrices as indicated by low values, as in equation (7). To validate this, we randomly selected 10 sets of 100 sequences in the test data set and trained the corresponding matrices, keeping the intermediate matrices obtained at every cycle. Figure 3 shows the average measured between all these matrices against the iteration number. Given a data set of 100 sequences 40 symbols long, Figure 3 shows the typical profile of several training procedures. The is an estimation of the difference between the matrices of two successive rounds (low s indicate highly similar matrices). While values tend to decrease over cycles, increasing values (peaks) usually result from the exceeding of a local minimum by the training procedure. Each curve in Figure 3 corresponds to one matrix estimation run. For each run, a set of 1,000 sequence pairs was chosen randomly (out of the 100*100 possible pairs) and kept through all the iterations. The results suggest that the estimation procedure is insensitive to this initial choice with a convergence systematically occurring in 5 to 6 cycles and final matrices highly correlated.5 Altogether, this high correlation and the constant number of cycles suggest an efficient and robust training procedure. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 210 Sociological Methods & Research Figure 3 Value of Against the Number of Iterations of the Training Procedure of Substitution Costs for 10 Randomly Selected Sets of Sequences From the Swiss Household Panel Data 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 All the procedures described here have been encapsulated in a sequence analysis package called SALTT (Search Algorithm for Life Trajectories and Transitions). It can be compiled and installed on any UNIX-like platform including Linux, Cygwin, and Mac OSX. The package and its documentation are distributed under the General Public License and available free of charge from the authors (Notredame et al. 2005). Criteria for Comparing Outcomes From Various Cost-Setting Strategies To test whether the proposed solution provides more adequate results than previous methods, one may consider some criteria specific to the training procedure as well as a set of criteria widely used to estimate how well sequence analysis and cluster analysis perform. The first and simplest criterion to establish the validity of the proposed strategy is to apply it to biological sequences and train matrices that could be compared to standard biological matrices. We have done so on a Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 211 Figure 4 Random Splitting of Symbol A into Two New Symbols (M and N) Not Belonging to the Original Alphabet Seq1: AABDEEBADBEDA Seq2: BDAEAEAADBADA A -> M ou N Seq1b: MNBDEEBNDBEDM Seq2b: BDNEMEMNDBNDM well-known collection of 500 related human sequences known as the kinome (Manning et al. 2002). The procedure delivered a substitution matrix highly correlated to a standard point accepted mutation in which all the known mutational preferences between amino acids could easily be recognized. We then do the same by comparing three distinct sets of social sciences data representing the same sequential reality. Then, the training procedure is evaluated by testing its ability to correctly identify the closeness of two different symbols, using solely the information contained in the data. To do so, we use a set of sequences to compare a reference cost of substitution between two given symbols produced by the training procedure (e.g., AB), with the cost produced by the training procedure for the same substitution, in the case where one of the symbols (e.g., A) has been randomly split into two new symbols (M and N) not belonging to the alphabet. As symbols M and N are actually ‘‘hidden A,’’ we expect the training procedure to determine the substitution costs AB, MB, and NB as equivalent. Figure 4 shows for two given sequences how a symbol is randomly split into two new symbols not belonging to the original alphabet. Testing the Quality of the Clustering A third set of criteria pertains to quality testing of cluster analysis. One of the main difficulties with clustering methods lies in the determination of the number of clusters really present in the data (Milligan and Cooper 1985, 1987). There is no perfect method to establish this number, but several indicators may be used to help decide (Everitt 1979; Bock 1985; Hartigan 1985; Milligan and Cooper 1985; SAS Institute 2004). For Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 212 Sociological Methods & Research Milligan and Cooper (1987), there are two categories of tests concerning the quality of cluster analysis: The first considers that internal criteria are able to validate the results of the clustering, that is, to justify the number of clusters chosen. The second one uses external criteria. Such criteria represent information that is external to the cluster analysis and was not used at any other point in the cluster analysis (Milligan and Cooper 1987). In terms of internal criteria, Milligan and Cooper (1985) have evaluated and compared 30 statistics known as stopping rules that help in deciding how many ‘‘real’’ clusters are present in the data. The availability of such indices in main statistical software packages (such as SAS or SPSS) is of course a nonnegligible element of choice concerning what criteria to use. Two of the most efficient indices among the 30 that Milligan and Cooper (1985, 1987) have evaluated are part of the SAS software. The first one is a pseudo-F developed by Calinski and Harabasz (1974); it represents an approximation of the ratio between intercluster and intracluster variance. The second index is expressed as Je(2)/Je(1) (Duda and Hart 1973) and may be transformed into a pseudo-t2.6 The third criteria we used is R2, which expresses the size of the experimental effect. It is reasonable to look for a consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). We can then define the stopping rule for a statistically optimal cluster solution as a local peak of the pseudo-F (high ratio between inter- and intracluster variance), associated with a low value of pseudo-t2 that increases at the next fusion and a marked drop of the overall R2.7 Generally, a cluster solution is said to be statistically optimal when the number of classes is kept constant across strategies, when the intercluster variance is highest, and when the intracluster variance is lowest. Put another way, clusters should exhibit two properties, external isolation and internal cohesion (Punj and Stewart 1983). Therefore, using comparative scree plots is a straightforward way of dealing with the issue of testing cluster solutions drawn from distances based on various cost schemes, including the computationally derived one. A given cluster solution is retained for analysis only if at least two among those three criteria (pseudo-F, pseudo-t2, and R2) support its validity. External criteria refer to the extent to which clusters drawn from the data correlate with either independent variables or outcomes (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help in social research as the ultimate goal of social sciences is explanation rather than description. A third criterion is more intuitive: To what extent are empirical clusters easily comprehended, based on prior knowledge of the phenomenon and the central hypothesis of the research? This criterion Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 213 can be approached by using experts and computing interreliability estimates. The procedure in that case is as follows: Provide cluster solutions based on the various cost schemes, and have a set of raters who decide independently which is their favorite solution. Then one may compute interrater reliability and see which coding scheme comes up first in the list. Given the importance of the debate concerning the influence of sociostructural factors on the occupational trajectories of women in the sociological field and the availability of high-quality data on occupational status during entire life courses, we test these methods on data sets addressing this topic. Description of the Test Samples Considering the fact that women’s labor market participation is more diverse than that of men (Myrdal and Klein 1956; Levy 1977; Mott 1978; ¨ Elder 1985; Moen 1985; Hopflinger, Charles, and Debrunner 1991; Moen ¨ and Yu 2000; Blossfeld and Drobnic 2001; Kruger and Levy 2001; Levy, Widmer, and Kellerhals 2002; Moen 2003; Widmer, Kellerhals, and Levy ¨ 2003; Bird and Kruger 2005; Levy, Gauthier, and Widmer 2006), and in order to facilitate the comparisons between the data sets, for each database we selected only women who were married or living with a partner at the time of the interview. Moreover, in order to maximize the quality of the data, we retain only the trajectories that had less than 10 percent of missing values. Sample Test 1: Social Stratification, Cohesion, and Conflict in Contemporary Families The first sample of occupational trajectories is drawn from a retrospective questionnaire of the study Social Stratification, Cohesion, and Conflict in Contemporary Families (SCF) that was conducted in 1998 with 1,400 individuals living as couples in Switzerland (Widmer, Kellerhals, and Levy et al. 2003; Widmer, Kellerhals, and Levy 2004). Respondents were asked to provide information about every year of their occupational trajectory starting from age 16, onward to 64. Every year of the trajectories was coded using a seven-category code scheme: full-time employment, part-time employment, positive interruption (sabbatical, trip abroad, etc.), negative interruption (unemployment, illnesses, etc.), housework, retirement, and full-time education. Data were right truncated as most individuals had not yet reached the age of 64 at the time of the interview. Sociostructural indicators (socioeconomic status of orientation family, educational Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 214 Sociological Methods & Research level, number of children, and income) were measured for the time of the interviews only. The final sample size was 564 women. Sample Test 2: The Swiss Household Panel Since 1999, the Swiss Household Panel (SHP) has collected data on a representative sample of private households in Switzerland on a yearly basis.8 In its third wave, the SHP included a retrospective questionnaire sent to 4,217 households (representing 8,913 individuals). For reasons of validity, the analysis of the subsample of individuals who answered the retrospective questionnaire was restricted to those aged 30 and older, decreasing the sampled female population to 1,935. The SHP asked respondents to provide information on their educational and occupational status from their birth to the present. Each change in status is associated with a starting year and an ending year. We recoded these the same way as for Sample Test 1. Sociostructural indicators comparable to those in Sample 1 were also obtained. This sample included 1,107 women. Sample Test 3: Female Job Histories From the Wisconsin Longitudinal Study The Wisconsin Longitudinal Study (WLS) is a long-term study of a random sample of 10,317 men and women who graduated from Wisconsin high schools in 1957. This data set is for public use and available at the University of Wisconsin–Madison Web site (http://www.ssc.wisc.edu/ wlsresearch). The female job histories of 1957-1992 were constructed by Sheridan (1997) from the 1957, 1964, 1975, and 1992 WLS data collections. The data also include social background, youthful aspirations, schooling, military service, family formation, labor market experiences, and social participation. The ‘‘female job histories’’ data concern 5,042 women born in 1938 and 1939. We could retain only three main occupational statuses, namely, full-time paid work, part-time paid work, and fulltime housewife. There were 2,243 women in this sample. Results Production of Data-Driven Costs of Substitution From a sociological point of view, we could expect a relative stability of the costs of substitution from one set of sequences to another, the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 215 occupational trajectories of contemporary Swiss and North American women being to a certain extent comparable, at least in regard to the influence of the birth of children on the reduction or cessation of paid work. The individual sequences of occupational statuses are built by attributing a single symbol (a code corresponding to a given occupational status) to each year of life of the respondents.9 Table 1 compares the different costs of substitution either set arbitrarily to identity, following theoretical arguments concerning differences among types and rates of occupational activities (for details, see Widmer, Levy, et al. 2003), or by means of a training procedure in the different databases. Table 1 shows that the training procedure produces costs that are more differentiated than identity costs. The range of costs is also broader, partly because the procedure is sensitive to very rare substitutions. The stability of the trained costs of substitution from one database to another confirms the ability of the training to produce meaningful cost schemes. The training procedure reflects some relations between the different statuses that are sociologically relevant. Compared to identity costs that may not be differentiated between men and women, the trained costs reveal, for example, the relative ease (the low costs) with which women in the samples go from paid work to housework. The comparison of knowledge-based costs and trained costs of substitution shows a high similarity between the two sets of values, which are correlated at .68 (p < .01) with trained costs for SCF data, at .63 (p < .01) for SHP data, and at .73 (p < .05) for WLS data. Table 2 shows Pearson’s coefficient of correlation between the costs by method of cost setting and database. Table 2 shows that the trained costs of substitution are more strongly associated with each other from one data set to another than they are with costs set to either identity or to knowledge-based values. On the other hand, even if it remains relatively high, the associations between trained, knowledge-based, and identity costs are systematically weaker than those between trained costs. This confirms the stability of the results stemming from the training procedure and explains at least partly the slightly but systematically different (and more highly correlated) results it provides compared to the two other strategies (identity and knowledge based). Validation of the Training Procedure An important issue in the use of a computerized data–based determination of substitution costs is to assess the extent to which this procedure is able to process information in a sociologically relevant way. Three Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 216 Sociological Methods & Research Table 1 Comparisons of Identity, Knowledge-Based, and Trained Costs of Substitution for Three Data Sets: SCF, SHP, and WLS Costs of Substitution Identity Knowledge Trained Trained Trained Based SCF SHP WLS 1.0 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.8 1.0 0.8 1.0 1.0 1.0 0.3 0.8 0.8 1.0 0.8 0.8 0.3 1.0 1.0 1.0 0.8 0.3 1.0 1.0 0.8 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.6 0.7 0.7 0.5 0.9 0.5 0.6 0.7 0.9 0.5 1.0 0.7 0.7 1.5 0.7 1.3 0.8 0.9 0.9 1.1 0.8 0.9 1.0 0.6 0.6 1.5 1.3 0.7 0.8 0.5 0.7 0.7 0.5 0.8 0.5 0.6 0.7 0.8 0.5 0.8 0.7 0.7 1.2 0.8 0.9 0.9 1.0 0.8 1.4 0.8 0.9 0.8 0.7 0.7 1.6 1.5 0.7 0.8 0.5 0.4 Substitutions of Occupational Status Full-Time * Part-Time Full-Time * Negative Interruption Full-Time * Positive Interruption Full-Time * At Home Full-Time * Retirement Full-Time * Education Full-Time * Missing Part-Time * Negative Interruption Part-Time * Positive Interruption Part-Time * At Home Part-Time * Retirement Part-Time * Education Part-Time * Missing Negative Interruption * Positive Interruption Negative Interruption * At Home Negative Interruption * Retirement Negative Interruption * Education Negative Interruption * Missing Positive Interruption * At Home Positive Interruption * Retirement Positive Interruption * Education Positive Interruption * Missing At Home * Retirement At Home * Education At Home * Missing Retirement * Education Retirement * Missing Education * Missing Insertion or Deletion 0.5 0.5 0.5 0.5 0.5 Note: SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. different tests were used. The first one referred to the ability of the procedure to evaluate the closeness of a symbol belonging to the alphabet with an unknown symbol not belonging to it. The second one focused on the degree of agreement between classifications of social trajectories made by Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 217 Table 2 Pearson’s Correlation Between Costs Matrices, by Method of Cost Setting and (Full) Data Sets Identity Identity Knowledge based SCF trained SHP trained WLS trained 1.00 .98*** .66*** .61*** .71* Knowledge .98*** 1.00 .68*** .63*** .73* SCF Trained .66*** .68*** 1.00 .96*** .97*** SHP Trained .61*** .63*** .96*** 1.00 .94*** WLS Trained .71* .73* .97*** .94*** 1.00 Note: UNIX command line to produce the trained matrix: saltt -e ‘-in dataset.dat -action + pavie_seq2pavie_mat _TGEPF50_THR60_TWE04_SAMPLE50000_’. SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .05. ***p < .001. specialists in the field compared with classifications of the same data based on identity, knowledge-based, and trained costs of substitution. The third one consisted of measuring the extent to which clusters drawn from the data correlate with some independent sociostructural variables or outcomes. Identifying the Proximity of Unknown Symbols A first way of validating the training procedure consists of measuring the extent to which it is able to unravel the proximity of two given symbols, based on no other information than the data itself. We tested this for the SCF set of sequences by randomly replacing a given symbol of the sequences alphabet A = {A, B, C, D, E, F, G, X}, which corresponds in this case to an occupational status, with two symbols that did not belong to the original alphabet of that set of sequences, that is, symbols whose actual identity was hidden. Using the training procedure, we then compared the original costs for substituting, for example, Symbol A with Symbol B, with the costs we obtained after having randomly replaced every A with either the hidden symbol M or N (cf. Figure 4). In a second run, we did the same by replacing each B with the hidden symbol O or P. We finally got five different expressions of the same initial substitution (in this example, AB = NB = MB = AO = AP), each associated with a specific cost. This procedure was applied to all pairs of symbols of the data set in turn. If we consider Ei and Ej to be respectively the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 218 Sociological Methods & Research ith and jth elements of the original alphabet and their two random substitutes—respectively S1(Ei), S2(Ei) and S1(Ej), S2(Ej)—there are five costs of substitution to take into account if we consider only the substitutions involving at least one symbol belonging to the original alphabet. Under these conditions, as they are actually different expressions of the same initial substitution, we should expect those five trained costs to be identical, or at least close to each other. To compare all those values in a synthetic way for the entire alphabet, we computed a standardized difference between the trained costs of substitution associated with a given pair of symbols belonging to the original alphabet and the trained costs of substitution between one of those original symbols and the substitute of the other one, as shown in equation (8). Std Difference= ðcostðEi ½S1 ðEj ފÞ−costðEi Ej Þ+ðcost½S1 ðEi ފEj Þ−costðEi Ej ÞÞ : ð8Þ 2 à costðEi Ej Þ The proximity of the five substitution costs associated with a given original pair of symbols and their substitutes was compared in two ways, using either the first substitute of that pair of symbols (as shown in equation [8]) or the second one (where S2 replaces S1 in equation [8]). All those values were then tabulated in Table 3. Its lower part contains the standard differences between the substitutions of Ei, Ej, and their first substitute (cost EiEj compared to Ei[S1(Ej)] and [S1(Ei)]Ej), whereas the upper part contains the values associated with their second substitute (cost EiEj compared to Ei[S2(Ej)] and [S2(Ei)]Ej)). Table 3 shows clearly that the training procedure identifies very precisely the closeness of two distinct, but actually identical, symbols.10 Among the 56 different costs of substitution in Table 3, 49 (87 percent) show a difference not larger than 10 percent compared with the original cost. The greater differences may be attributed to the fact that the training procedure is relatively sensitive to rare symbols. For example, symbols C, D, F, and X represent altogether only about 2 percent of the total symbols used in the sequences. The great majority of the hidden costs differing notably from their original costs are concerned with such rare symbols. The difference is maximal when it concerns two rare symbols. The ability of the training procedure to identify the similarity of two unknown symbols based on the data set at hand is one of the main strengths of this way of setting costs of substitution. Even if it stays relatively close to identity costs of substitution, this procedure takes into account the real relations of the different symbols present in the sequences and is therefore highly informative. On one hand, it avoids particular relationships remaining undetermined; on the other hand, it works as a predictive tool in the sense Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 219 Table 3 Standardized Difference Between the Original Trained Costs of Substitution and Their Substitutes A (%) B (%) C (%) D (%) E (%) F (%) G (%) X (%) Relative Frequency (%) A B C D E F G X — 0 7 6 0 6 10 0 8 — 0 6 –10 10 0 0 7 0 — 5 –7 0 0 –15 6 6 0 — –6 0 6 6 0 0 –7 0 — 13 8 0 0 25 9 11 13 — 0 0 10 7 0 0 8 0 — –7 0 7 –15 –28 0 0 –7 — 33.5 19.5 0.5 1.0 31.0 0.1 14.0 0.4 Note: Rows and columns are given the name of a symbol belonging to the alphabet, although each cell of the table compares the substitution cost of three pairs of symbols (the original one and two substitutes) according to equation (8). that two different symbols with low substitution costs can be predicted to substitute easily for one another in real life. Automatic Versus Classification by Judges Another way to validate the training procedure is to test the extent to which automatic classification succeeds in replicating a classification of sequences made by experts on a small subset of well-identified sequences. To do so, we extracted a sample of 100 occupational trajectories of women from each data set. Four judges were asked to classify them in a number of clusters that corresponded to previous empirical findings (Widmer, Levy, et al. 2003; Levy, Gauthier, and Widmer 2006) and to theoretical schemes (i.e., Kohli 1986). In each case, we retain only the sequences that were classified the same way by at least three (out of four) judges. The interrater agreement lies between 83 percent and 88 percent. To keep the computation procedures as parsimonious as possible, we first exactly replicated with SALTT the results we obtained with TDA using two different cost settings (identity and knowledge based). That allowed us to produce optimal alignments and to compare the distance matrices for the three strategies (identity, knowledge based, and training) from within SALTT. For each set of sequences, we ran three optimal matching analyses, the first one using identity costs of substitution (for details, see above), the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 220 Sociological Methods & Research Table 4 Association (khi2 and Symmetric) Between Judges and Automatic Classification, by Method of Cost Setting Database Method Identity * Judges SCF 213.2454* (khi2, df = 16) symmetric 0.8034 (value) 0.0458 (ASE) 206.1951* (khi2, df = 16) symmetric 0.8120 (value) 0.0440 (ASE) 224.5436* (khi2, df = 16) symmetric 0.8291 (value) 0.0434 (ASE) SHP 213.4108* (khi2, df = 16) symmetric 0.7500 (value) 0.0582 (ASE) 228.4631* (khi2, df = 16) symmetric 0.7705 (value) 0.0623 (ASE) 235.1387* (khi2, df = 16) symmetric 0.7797 (value) 0.0602 (ASE) WLS 143.9678* (khi2, df = 9) symmetric 0.7037 (value) 0.0684 (ASE) 148.6864* (khi2, df = 9) symmetric 0.7196 (value) 0.0677 (ASE) 143.2652* (khi2, df = 9) symmetric 0.7037 (value) 0.0677 (ASE) Knowledge Based * Judges Trained * Judges Note: ASE = Asymptotic standard error; SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .001 second one using knowledge-based costs, and the third one using costs stemming from the training procedure. A distance matrix was computed for each set of sequences and for each strategy and then entered into a cluster analysis. Table 4 shows the degree of association of khi2 and (Goodman and Kruskal 1979; Olszak and Ritschard 1995) between the clusters made by the judges and those stemming from automatic classification. Table 4 shows that results provided by a trained matrix lead to significant associations with the classification by judges for the three data sets considered. For the Wisconsin study, results are about the same when using either identity or trained costs of substitution. Trained costs never lead to a weaker association ( symmetric) with judges’ classifications than identity costs or knowledge-based costs for the SCF and SHP data sets. Results are less straightforward concerning the WLS data, with knowledge-based costs performing slightly better than trained costs. The fact that Wisconsin data are less differentiated (sequences with only three Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 221 different statuses as opposed to seven in the other databases and respondents all about the same age) may explain why trained costs do not lead to a markedly different solution than the two alternative strategies. In all cases, the associations are quite high and significant, suggesting the ability of the method to provide meaningful cost schemes. Given the fact that the reference classification based on judges responses was very consensual and based on predefined categories, results of that test express the ability of the procedure to differentiate clear-cut, sociologically relevant categories out of the data rather than to evaluate the extent to which those results and the underlying costs of substitution reflect the theoretical and subjective conceptual frame of an expert. Association With External Criteria A third validation procedure consisted of measuring the extent to which clusters drawn from the data correlate with either independent sociostructural variables or outcomes (Milligan and Cooper 1987; Rapkin and Luke 1993), an approach that seemingly few studies have used so far (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help as the ultimate goal of social sciences is explanation rather than description. For each strategy, the three stopping-rule criteria aimed at determining the number of clusters in the data (pseudo-t2, pseudo-F, and R2) suggested the presence of three clusters in the SCF and SHP data and of four clusters in the WLS data. A closer look at the data reveal that those clusters correspond precisely to typical female trajectories, as described elsewhere ¨ (Moen 1985; Hopflinger et al. 1991; Erzberger and Prein 1997; Widmer, Levy, et al. 2003; Levy et al. 2006), namely, trajectories characterized by full-time employment, part-time employment, and full-time as a housewife. In the Wisconsin data, the clusters are the same, but with a fourth one representing a return to the labor market after a period at home. Such a cluster also appears when the clusters of SCF and SHP data are further subdivided. The greater homogeneity of WLS data in terms of age of respondents and completeness of the sequences (no right truncatures) may explain the better visibility (consensus between stopping-rule criteria) of that fourth cluster, which is also documented in the literature (Widmer, Levy, et al. 2003; Levy et al. 2006). We first ran a multinomial logistic regression11 on each data set (SFC, SHP and WLS), using cluster membership (which represents in this case types of occupational trajectories) as response variables and a set of Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 222 Sociological Methods & Research Table 5 khi2 Values of the Likelihood Ratio Test by Database and Cost-Setting Method Cost-Setting Method Identity Data Sets Set 1: SCF data, 3 clusters Set 1: SCF data, 5 clusters Set 2: SHP data, 3 clusters Set 2: SHP data, 5 clusters Set 3: WLS data, 4 clusters df 272 596 404 808 258 khi 2 Knowledge Based 2 Trained khi 2 p > khi khi 2 p > khi 2 p > khi2 290.19 .2143 553.02 .8956 568.36 < .0001 897.67 .0150 307.35 .0189 280.87 547.03 562.04 863.32 323.81 .3428 .9250 < .0001 .0865 .0034 288.60 .2339 522.11 .9867 512.34 < .0002 740.12 .9574 288.37 .0939 Note: SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. indicators of social positioning (socioeconomic position of the orientation family, including level of education, number of children, age, and household income) generally considered (cf. description of the sample) as intervening variables in shaping female occupational trajectories. To be consistent with the stopping-rule criteria—that is, a consensus between pseudo-t2, pseudo-F, and R2—we retained in this first step the threecluster solutions that those criteria pointed out for each data set. As they are more homogeneous, they represent about the same social reality in each data set and therefore remain sociologically relevant. We then performed the tests on the five-cluster solutions for SCF and SHP data to check the efficiency of the different cost-setting methods on other empirically founded classifications (Widmer, Levy, et al. 2003; Levy et al. 2006). We felt justified in doing this because two new clusters emerged from further subdivision of the first three clusters defined by the proposed criteria (R2, pseudo-F, and pseudo-t2). Table 5 shows the test of likelihood ratio applied to those multinomial regressions. The likelihood tests compare a given model with the saturated one (a model that exactly replicates the data), meaning in this case that the smaller the value of khi2 (i.e., the larger the p value), the better the model fit to the data.12 One can read from Table 5 that the trained costs of substitution allow building a model that fit better to the data in all cases compared to identity costs and in four out of five cases compared to knowledge-based costs. Put another way, clusters produced by trained costs of substitution are Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 223 more sensitive to predictors than clusters produced by either identity costs or knowledge-based costs. This is true, although not with the same strength, for the three sets of sequences. The fit is significantly better (i.e., the model stemming from trained costs does not differ significantly from the saturated model, whereas the two others do) in two cases and with two data samples. Discussion Setting costs of substitution in the process of aligning sequences of social statuses is controversial because it may significantly influence the results of the analysis. We propose a method to determine costs of substitution empirically, which we tested using three distinct sets of social science data. The training procedure that we present appears to be, to our knowledge, the only one that is exclusively empirically grounded and optimized. First, we considered the correlation between the substitution matrices for a given alphabet across three data sets of the social sciences realm representing the same social realities (sequences of occupational statuses along the life course) and three cost-setting strategies. The training procedure leads to results that are very similar to those stemming from the two other methods (substitution costs represented as an identity matrix or following some knowledge-based weighting). In this sense, cost variability did not appear to modify the general results of the analysis. Nevertheless, the costs stemming from the training procedure may claim a greater legitimacy as they reflect the actual relationships of the symbols considered. That legitimacy is reinforced by the very high correlation existing between the substitution matrices stemming from the training procedures applied to the three data sets at hand. In this sense, the values of the trained cost matrices may even be considered as a validation a posteriori of the use of alternative costs of substitution (knowledge based or identity) found in the literature for the specific case of occupational trajectories. Moreover, the training procedure shows some interesting features that should be further explored, such as the possibility to differentiate specific substitution costs according, for example, to gender. The ability of the trained costs to provide a clustering that is better associated to some sociologically unequivocal classification of reference than the identity costs and the knowledge-based costs did illustrate the ability of the training procedure to discover some structural features of the data that are sociologically relevant. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 224 Sociological Methods & Research Second, based on likelihood ratio tests of multinomial logistic regressions, we compared the associations between cluster solutions (response variables) and a set of relevant sociostructural variables (intervening variables) for the three cost-setting strategies across the three data sets at hand. Here again, the training procedure led to better results than the identity and the knowledge-based costs did. That is, the data-driven costs of substitution contributed to classifications that fit better with widely recognized sociological models of women’s labor market participation than the two other strategies. Taking into account the actual structure of the data provides models that fit better with external factors than undifferentiated or knowledge-based costs schemes. Finally, the ability of the training procedure to discover certain actual internal relationships of the data and therefore to offer an efficient and empirically grounded way to determine costs of substitution is demonstrated in another way as it is able to accurately identify the closeness of two formally identical, but artificially differentiated, substitution costs (here, between two occupational statuses). Moreover, the degree of closeness between the substitution costs is also informative about the relative proximity of the symbols and the sociological reality they represent. The training procedure offers significant improvements compared to the methods generally used until now in social sciences. By revealing every symmetric relationship among those symbols, this procedure avoids assigning a cost based on prior knowledge that would later appear to be erroneous when compared to the actual data. The results show that for any pair of symbols of a given alphabet, the produced trained costs of substitution remain remarkably similar from one data set to another. It means that those costs do reflect some important information concerning the actual (in this case, social) significance of the symbols constituting the sequences and do not represent just abstract values varying from data set to data set (or from one training session to another). Therefore, these costs also constitute a predictive feature, in the sense that two different symbols with low substitution costs can be predicted to easily substitute for one another in real life. Identification of these low substitution costs therefore make it possible to predict situations likely to occur in similar contexts at similar ages and at similar frequencies. In comparison with approaches based on transitions costs, which are computed within each single sequence taken separately, the proposed method aims to determine substitution costs by looking for a match or mismatch at each specific position throughout all pairs of sequences. In this sense, the latter method is based on richer information and grants a Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 225 higher importance to time (i.e., to age and social age) and to the relations between sequences than cost schemes based on transitions rates. There is on one hand a constant and clear similarity between the results stemming from the three cost-setting strategies (identity, knowledge based, and training) and on the other hand a significant improvement in the tests of internal and external validity of the results provided by the training procedure. The conditions under which the method is most appropriate remain to be systematically tested. The experiments presented in this article point in several directions. First, the method provides a strong leverage when no or few theoretical arguments may be brought up to the scene in support of a cost solution or when contradicting theories propose different cost schemes. In other words, it is best suited for an exploratory research design. Second, this method is ideal whenever too many statuses have been used to characterize the data. We show, for instance, in this study that the proposed procedure reveals the identity between two statuses that may have been coded separately. Finally, the cost estimation provides a means for quantifying the relationship among symbols; as such, it can be used to identify and discover equivalences among categories. In itself, this means of quantification may prove to be a useful investigative tool for the social sciences. There are several limitations to the solution proposed in this article. First, the method deals poorly with symbols occurring rarely in sequences. Whenever this happens, the estimations of substitution costs are less accurate and more variable. Second, a key property of the optimal matching algorithm is to rely on the assumption that events defining a life trajectory are chronologically ordered and collinear among the considered sequences. This is, of course, a simplification, but it seems to hold reasonably well when considering sequences with a high percentage of identity. However, it should be mentioned that if recurrent subsequences were to be found scattered in different periods of life, they could probably be recovered using techniques related to the one that we describe here, such as Gibbs sampling (Lawrence et al. 1993; Abbott and Barman 1997) or the local alignment algorithm (Smith and Waterman 1981). Second, this algorithm, like other optimal matching algorithms, assumes the independence of each position constituting a sequence. This may be oversimplifying as one can argue that life trajectories are not homogeneous. They may be substructured in smaller units (life stages, transitions, turning points, specific life events, etc.), whose sizes may vary but should be kept intact in the alignments. This issue is likely to arise when comparing very distinct sequences. When this situation occurs, it may be worthwhile to modify the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 226 Sociological Methods & Research proposed algorithm. Nevertheless, the issue remains to automatically identify meaningful borders defining those subsequences. In biology, multiple sequence alignments have been used successfully to identify the exact extent of subsequences conserved across related sequences (Notredame, Higgins, and Heringa 2000). It is certainly worthwhile to explore the potential of this method in the social sciences. Notes ¨ 1. This freeware is available from the Ruhr-Universitat Bochum Web site at http://steinhaus .stat.ruhr-uni-bochum.de/tda.html. 2. This freeware is available from the University of Chicago Web site at http://home .uchicago.edu/aabbott/om.html. ~ 3. This freeware is available from the Strasbourg Bioinformatics Platform Web site at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX. 4. ‘‘Thus, while substitution must be carefully handled, it is not a supersensitive task whose errors will be compounded by later stages in the analysis’’ (Abbott and Hrycak 1990:176). 5. Student’s t tests performed on the 10 values generated by the training procedures for each cost of substitution reveal that those values do not differ from the mean (p < .0001, df = 9). 6. Hotelling’s T2 is a statistical measure of the multivariate distance of each observation from the center of the data set. This is an analytical way to find the most extreme points in the data. 7. This is the ratio between interclass variance and total variance. 8. This data set is for public use. Access to the data is provided by the Swiss Household Panel (SHP) Web site at http://www.swisspanel.ch. 9. Following the availability of the data, the range considered is 16 to 65 years old for Social Stratification, Cohesion, and Conflict in Contemporary Families and SHP data, and 20 to 56 years old for Wisconsin Longitudinal Study data. 10. Spearman correlation coefficient = .734 (p < .01). 11. We used PROC CATMOD of the SAS software. 12. At p ≤ .05, the tested model is not statistically different than the saturated one. References Abbott, Andrew. 1984. ‘‘Event Sequence and Event Duration: Collocation and Measurement.’’ Historical Methods 17:192-204. Abbott, Andrew. 1990a. ‘‘Conception of Time and Events in Social Science Methods: Causal and Narrative Approach.’’ Historical Methods 23:140-50. Abbott, Andrew. 1990b. ‘‘A Primer on Sequence Methods.’’ Organization Science 1:375-92. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 227 Abbott, Andrew. 1995a. ‘‘A Comment on ‘Measuring the Agreement Between Sequences.’’’ Sociological Methods & Research 24:232-43. Abbott, Andrew. 1995b. ‘‘Sequence Analysis: New Methods for Old Ideas.’’ Annual Review of Sociology 21:93-113. Abbott, Andrew. 2001. Time Matters: On Theory and Method. Chicago: University of Chicago Press. Abbott, Andrew and Emily Barman. 1997. ‘‘Sequence Comparison Via Alignment and Gibbs Sampling: A Formal Analysis of the Emergence of the Modern Sociological Article.’’ Sociological Methodology 27:47-87. Abbott, Andrew and John Forrest. 1986. ‘‘Optimal Matching Methods for Historical Sequences.’’ Journal of Interdisciplinary History XVI:471-94. Abbott, Andrew and Alexandra Hrycak. 1990. ‘‘Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.’’ American Journal of Sociology 96:144-85. Abbott, Andrew and Angela Tsay. 2000. ‘‘Sequence Analysis and Optimal Matching Methods in Sociology.’’ Sociological Methods & Research 29:3-33. Aisenbrey, Silke. 2000. Optimal Matching Analyse. Anwendungen in Den Sozialwissenschaften (Optimal Matching Analysis: Applications in the Social Sciences). Opladen, Germany: Leske + Budrich. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jianzhi Zhang, Zhu Zhang, Webb Miller, and David Lipman. 1997. ‘‘Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs.’’ Nucleic Acids Research 25:3389-3402. Bateman, Alex, Evan Birney, Richard Durbin, Sean R. Eddy, Robert D. Finn, and Erik L. Sonnhammer. 1999. ‘‘Pfam 3.1: 1313 Multiple Alignments and Profile HMMs Match the Majority of Proteins.’’ Nucleic Acids Research 27:260-62. Billari, Francesco C. 2001. ‘‘Sequence Analysis in Demographic Research and Applications.’’ Canadian Studies in Population 28:439-58. ¨ Bird, Katherine and Helga Kruger. 2005. ‘‘The Secret of Transitions: The Interplay of Complexity and Reduction in Life Course Analysis.’’ Pp. 173-94 in Towards an Interdisciplinary Perspective on the Life Course, vol. 10, edited by R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, and E. Widmer. Amsterdam: Elsevier JAI. Blair-Loy, Mary. 1999. ‘‘Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.’’ American Journal of Sociology 104:1346-97. Blossfeld, Hans-Peter and Sonja Drobnic. 2001. Careers of Couples in Contemporary Society: From Male Breadwinner to Dual Earner Families. New York: Oxford University Press. Bock, Hans H. 1985. ‘‘On Some Significance Tests in Cluster Analysis.’’ Journal of Classification 2:77-108. Calinski, Tadeusz and Joachim Harabasz. 1974. ‘‘A Dendrite Method for Cluster Analysis.’’ Communication in Statistics 3:1-27. Chan, Tak Wing. 1995. ‘‘Optimal Matching Analysis: A Methodological Note on Studying Career Mobility.’’ Work and Occupations 22:467-90. Dayhoff, Margaret O., Robert M. Schwartz, and Bruce C. Orcutt. 1978. ‘‘A Model in Evolutionary Change in Proteins.’’ Pp. 345-52 in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, edited by M. O. Dayhoff. Washington, DC: National Biomedical Research Foundation. Dijkstra, Wil and Toon Taris. 1995. ‘‘Measuring the Agreement Between Sequences.’’ Sociological Methods & Research 24:214-31. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 228 Sociological Methods & Research Duda, Richard O. and Peter E. Hart. 1973. Pattern Classification and Scene Analysis. New York: John Wiley. Durbin, Richard, Sean E. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. Elder, Glen H. 1985. Life Course Dynamics: Trajectories and Transitions, 1968-1980. Ithaca, NY: Cornell University Press. Erzberger, Christian and Gerald Prein. 1997. ‘‘Optimal-Matching-Technik: Ein Analysever¨ fahren zur Vergleichbarkeit und Ordnung individuell differenter Lebensverlaufe.’’ [Optimal matching technique: an analytical process to compare and classify individual life courses] ZUMA-Nachrichten 40:52-80. Everitt, Brian S. 1979. ‘‘Unresolved Problems in Cluster Analysis.’’ Biometrics 35:169-81. Forrest, John and Andrew Abbott. 1989. ‘‘The Optimal Matching Method for Studying Anthropological Sequence Data: An Introduction and Reliability Analysis.’’ Journal of Quantitative Anthropology 1:151-70. Giddens, Anthony, Mitchell Duneier, and Richard P. Applebaum. 2003. Introduction to Sociology. New York: W. W. Norton. Giele, Janet Z. and Glen H. Elder. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Giuffre, Katherine A. 1999. ‘‘Sandpiles of Opportunity: Success in the Art World.’’ Social Forces 77:815-32. Goodman, Leo A. and William H. Kruskal. 1979. Measures of Association for Cross Classification. New York: Springer. Grauer, Dan and Wen-Hsiung Li. 2000. Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer. Halpin, Brendan and Tak Wing Chan. 1998. ‘‘Class Careers as Sequences: An Optimal Matching Analysis of Work-Life Histories.’’ American Sociological Review 14:111-30. Hartigan, John A. 1985. ‘‘Statistical Theory in Clustering.’’ Journal of Classification 2:63-76. Henikoff, Steven and Jorja G. Henikoff. 1992. ‘‘Amino Acid Substitution Matrices From Protein Blocks.’’ Proceedings of the National Academy of Sciences 89:10915-19. ¨ Hopflinger, Francois, Maria Charles, and Annelies Debrunner. 1991. Familienleben und Ber¸ ufsarbeit (Family Life and Professional Work). Zurich, Switzerland: Seismo. Hughey, Richard and Anders Krogh. 1996. ‘‘Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method.’’ Computer Applications in Biological Science 12:95-107. Kohli, Martin. 1986. ‘‘The World We Forgot: A Historical Review of the Life Course.’’ Pp. 271-303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. ´ ¨ Kruger, Helga and Rene Levy. 2001. ‘‘Linking Life Courses, Work and the Family: Theorizing a Not So Visible Nexus Between Women and Men.’’ Canadian Journal of Sociology 26:145-66. Kruskal, Joseph. 1983. ‘‘An Overview of Sequence Comparison.’’ Pp. 1-44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Toronto, Canada: Addison-Wesley. Lawrence, Charles E., Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. 1993. ‘‘Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment.’’ Science 262:208-14. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 229 Levine, Joel H. 2000. ‘‘But What Have You Done for Us Lately?’’ Sociological Methods & Research 29:34-40. Levitt, Barbara and Clifford Nass. 1989. ‘‘The Lid on the Garbage Can: Institutional Constraints on Decision Making in the Technical Core of College-Text Publishers.’’ Administrative Science Quarterly 34:190-207. ´ Levy, Rene. 1977. Der Lebenslauf als Statusbiographie. Die weibliche Normalbiographie in makrosziologisher Perspektive. [The life course as a sequence of statuses. The female standard biography in a macrosociological perpsective]. Stuttgart, Germany: Enke. ´ Levy, Rene, Jacques-Antoine Gauthier, and Eric Widmer. 2006. ‘‘Entre contraintes institu´ tionnelle et domestique: Les parcours de vie masculins et feminins en Suisse.’’ [Between institutional and domestic constraints: the life courses of women and men in Switzerland] Revue canadienne de sociologie 31:461-89. ´ Levy, Rene, Eric Widmer, and Jean Kellerhals. 2002. ‘‘Modern Family or Modernized Family Traditionalism? Master Status and the Gender Order in Switzerland.’’ Electronic Journal of Sociology 6(4). Manning, Gerard, David B. Whyte, Ricardo Martinez, Tony Hunter, and Sucha Sudarsanam. 2002. ‘‘The Protein Kinase Complement of the Human Genome.’’ Science 298:1912-34. Milligan, Glenn W. and Martha C. Cooper. 1985. ‘‘An Examination of Procedures for Determining the Number Clusters in a Dataset.’’ Psychometrika 50:159-79. Milligan, Glenn W. and Martha C. Cooper. 1987. ‘‘Methodology Review: Clustering Methods.’’ Applied Psychological Measurement 11:329-54. Moen, Phillis. 1985. ‘‘Continuities and Discontinuities in Women’s Labor Force Activity.’’ Pp. 113-55 in Life Course Dynamics: Trajectories and Transitions, 1968-1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Moen, Phillis. 2003. It’s About Time: Couples and Careers. Ithaca, NY: Cornell University Press. Moen, Phyllis and Yan Yu. 2000. ‘‘Effective Work/Life Strategies: Working Couples, Work Conditions, Gender, and Life Quality.’’ Social Problems 47:291-326. Mott, Frank L. 1978. Women, Work and Family. Lexington, MA: Lexington Books. ¨ Muller, Tobias and Martin Vingron. 2000. ‘‘Modeling Amino Acid Replacement.’’ Journal of Computational Biology 7:761-76. Myrdal, Alva and Viola Klein. 1956. Women’s Two Roles: Home and Work. London: Routledge. Nargundkar, Satish and Timothy J. Olzer. 1998. ‘‘An Application of Cluster Analysis in the Financial Services Industry.’’ Presented at the sixth annual conference of the South East SAS Users Group, September 13-15, Norfolk, VA. Needleman, Saul B. and Christian D. Wunsch. 1970. ‘‘A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.’’ Journal of Molecular Biology 48:443-53. Ng, Pauline C., Jorja G. Henikoff, and Steven Henikoff. 2000. ‘‘PHAT: A TransmembraneSpecific Substitution Matrix. Predicted Hydrophobic and Transmembrane.’’ Bioinformatics 16:760-66. ´ Notredame, Cedric, Philipp Bucher, Jacques-Antoine Gauthier, and Eric Widmer. 2005. T-Coffee/saltt: User Guide and Reference Manual. Lausanne: Swiss Institute of Bioinformatics. Retrieved from http://www.tcoffee.org/saltt. ´ Notredame, Cedric, Desmond G. Higgins, and Jaap Heringa. 2000. ‘‘T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment.’’ Journal of Molecular Biology 302:205-17. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 230 Sociological Methods & Research ´ Notredame, Cedric, Liisa Holm, and Desmond G. Higgins. 1998. ‘‘Coffee: An Objective Function for Multiple Sequence Alignments.’’ Bioinformatics 14:407-22. Olszak, Michael and Gilbert Ritschard. 1995. ‘‘The Behavior of Nominal and Ordinal Partial Association Measures.’’ Statistician 44:195-212. Pentland, Brian T., Malu Roldan, Ahmed A. Shabana, Louise L. Soe, and Sidne G. Ward. 1998. ‘‘Lexical and Sequential Variety in Organizational Processes.’’ School of Labor and Industrial Relations, Michigan State University, East Lansing. Unpublished manuscript. Punj, Girish and David W. Stewart. 1983. ‘‘Cluster Analysis in Marketing Research: Review and Suggestions for Application.’’ Journal of Marketing Research 20:134-48. Rapkin, Bruce D. and Douglas A. Luke. 1993. ‘‘Cluster Analysis in Community Research: Epistemology and Practice.’’ American Journal of Community Psychology 21:247-77. ¨ ¨ Rohwer, Gotz and Ulrich Potter. 2002. TDA User’s Manual. Bochum, Germany: Ruhr ¨ Universitat Bochum. Retrieved from http://www.stat.ruhr-uni-bochum.de/pub/tda/doc/ tman63/tman-pdf.zip. ¨ Rohwer, Gotz and Heike Trappe. 1997. ‘‘Describing Life Courses. An Illustration Based on NLSY Data.’’ Pp. 30 in POLIS Project Conference. Florence, Italy: European University Institute. SAS Institute, Inc. 2004. SAS/STAT User’s Guide. Cary, NC: Author. ¨ Schaeper, Hildegard. 1999. ‘‘Erwerbesverlaufe von Ausbildungsabsolventinnen und -Absolventen: Eine Anwendung der Optimal-Matching-Technik.’’ [Employment history of girls and boys after completion of vocational education and training: an appli¨ cation of optimal matching technique]. Sonderforschungsbereich 186, Universitat Bremen, Germany. Scherer, Stefani. 2001. ‘‘Early Career Patterns: A Comparison of Great Britain and West Germany.’’ European Sociological Review 17:119-44. Sheridan, Jennifer T. 1997. ‘‘The Effects of the Determinants of Women’s Movement Into and Out of Male Dominated Occupations on Occupational Sex Segregation.’’ CDE Working Paper 97-07, Department of Sociology, University of Wisconsin, Madison. Smith, Temple F. and Michael S. Waterman. 1981. ‘‘Identification of Common Molecular Subsequences.’’ Journal of Molecular Biology 147:195-97. Stovel, Katherine and Marc Bolan. 2004. ‘‘Residential Trajectories: Using Optimal Alignment to Reveal the Structure of Residential Mobility.’’ Sociological Methods & Research 32:559-98. Stovel, Katherine, Michael Savage, and Peter Bearman. 1996. ‘‘Ascription Into Achievement: Models of Career Systems at Lloyds Bank, 1890-1970.’’ American Journal of Sociology 102:358-99. Thompson, Julie, Desmond G. Higgins, and Toby Gibson. 1994. ‘‘Clustal W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.’’ Nucleic Acids Research 22:4673-80. Turner, Jonathan H. 2001. ‘‘Sociological Theory Today.’’ Pp. 1-17 in Handbook of Sociological Theory, edited by J. H. Turner. New York: Kluwer Academic. ´ ´ ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2003. Couples contemporains: Cohesion, reg¨ ulation et conflits. [Contemporary couples: cohesion, regulation, conflicts] Zurich: Seismo. ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2004. ‘‘Quelle pluralisation des relations familiales?’’ [What pluralization of family relations]. Revue francaise de sociologie ¸ 45:37-67. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 Gauthier et al. / How Much Does It Cost? 231 ´ Widmer, Eric, Rene Levy, Alexandre Pollien, Raphael Hammer, and Jacques-Antoine Gauthier. 2003. ‘‘Entre standardisation, individualisation et sexuation: une analyse des trajectoires personnelles en Suisse’’ [Between standardization, individualization and gendering: an analysis of personal life courses in Switzerland] Revue suisse de sociologie 29:35-67 Wilson, W. Clarke. 1998. ‘‘Activity Pattern Analysis by Means of Sequence-Alignment Methods.’’ Environment and Planning A 30:1017-38. Wu, Lawrence L. 2000. ‘‘Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect.’ ’’ Sociological Methods & Research 29:41-64. Yu, Yi-Kuo and Stefen F. Altschul. 2005. ‘‘The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins With Non-Standard Compositions.’’ Bioinformatics 21:902-11. Jacques-Antoine Gauthier is a senior lecturer in sociology at the University of Lausanne and member of the Center for Life Course and Lifestyle Studies (Pavie). He has worked in the fields of health, addiction, and family sociology. His latest publications have appeared in the Canadian Journal of Sociology, European Journal of Operational Research, and the Swiss Journal of Sociology. Eric D. Widmer is a professor of sociology at the University of Geneva, with an appointment at the Center for Life Course and Lifestyle Studies (Pavie). His long-term interests include life course research, family research, and social networks. His latest publications have appeared in the Journal of Personal and Social Relationships, European Sociological Review, and Journal of Marriage and Family. Philipp Bucher is a group leader at the Swiss Institute for Experimental Cancer Research and a founding member of the Swiss Institute of Bioinformatics. His long-term interests include the development of algorithms for the analysis of molecular sequences and the application of these algorithms in various areas of biomedical research. His latest publications have appeared in PLoS Computational Biology and Nucleic Acids Research. ´ Cedric Notredame is a group leader at the Centre for Genomic Regulation in Barcelona (Spain) and a research investigator for the Centre National de la Recherche Scientifique (France). The focus of his work is the development and improvement of multiple sequence alignment algorithms. His latest publications have appeared in the Journal of Molecular Biology and Nucleic Acid Research. He is also the coauthor, with J. M. Claverie, of a popular introductory textbook in bioinformatics, Bioinformatics for Dummies (New York: Wiley, 2003). Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on October 26, 2009 00 MULTICHANNEL SEQUENCE ANALYSIS APPLIED TO SOCIAL SCIENCE DATA Jacques-Antoine Gauthier* Eric D. Widmer† Philipp Bucher‡ C´ dric Notredame** e Applications of optimal matching analysis in the social sciences are typically based on sequences of specific social statuses that model the residential, family, or occupational trajectories of individuals. Despite the broadly recognized interdependence of these statuses, few attempts have been made to systematize the ways in which optimal matching analysis should be applied multidimensionally— that is, in an approach that takes into account multiple trajectories simultaneously. Based on methods pioneered in the field of bioinformatics, this paper proposes a method of multichannel sequence analysis (MCSA) that simultaneously extends the usual optimal matching analysis (OMA) to multiple life spheres. Using data We thank the editor and the anonymous reviewers for helpful comments and suggestions. Direct correspondence to Jacques-Antoine Gauthier, University of Lausanne, Faculty of social and political sciences (SSP), research center Methodology, Inequalities and Social Change (MISC), Bˆ timent de Vidy, CH – 1015 Laua sanne, Switzerland. Email: Jacques-antoine.gauthier@unil.ch. *University of Lausanne †University of Geneva ´ ‡Swiss Institute of Bioinformatics and the Ecole Polytechnique F´ d´ rale e e de Lausanne **Centre for Genomic Regulation, Barcelona, and Centre National de la Recherche Scientifique, Marseille 1 2 GAUTHIER ET AL. from the Swiss household panel (SHP), we examine the types of trajectories obtained using MCSA. We also consider a random data set and find that MCSA offers an alternative to the sole use of ex-post sum of distance matrices by locally aligning distinct life trajectories simultaneously. Moreover, MCSA reduces the complexity of the typologies it allows to produce, without making them less informative. It is more robust to noise in the data, and it provides more reliable alignments than two independent OMA. 1. INTRODUCTION Most multivariate analyses using longitudinal data are based on hard causal models in which one or several independent variables predict the future actualization of some state of a dependent variable. Optimal matching analysis (OMA) offers a more descriptive perspective, that does not emphasize the causal priority of some variables over the others but aims at elaborating a systemic view on the social phenomena that develop over time. However, most applications of OMA have been limited to one dimension at a time, a serious shortcoming for empirical analyses. This paper develops a multichannel sequence analysis (MCSA) which enables researchers to describe individual trajectories on several dimensions simultaneously. Various empirical studies (Elder 1985; Clausen 1986; Kohli 1986; Levy 1991, 1996; Giele and Elder 1998; Heinz and Marshall 2003; Mortimer and Shanahan 2003; Levy et al. 2005; Macmillan 2005) emphasize the multidimensionality of life trajectories based on social, psychological, and biological factors that interact over time (Wetzler and Ursano 1988; Spruijt and de Goede 1997; Repetti, Taylor, and Seeman 2002; Lesesne and Kennedy 2005). A major problem with research on life trajectories, however, lies in the fact that the researcher is confronted with a variety of unequally linked sequences unfolding at various speeds (Abbott 1992). Life course studies therefore require the integration of seemingly heterogeneous trajectories into a unique empirical model (Levy et al. 2005), an ambitious task that even regression-based models cannot accomplish (Esser 1996). In this perspective, Abbott (2001:151) insists on using sequential data as multicase narratives to uncover patterns of careers rather than looking for causal models. Many social scientists have used OMA to model life trajectories. Nevertheless, since its emergence in the social sciences, OMA has neglected the multidimensionality of life trajectories. Social scientists have MULTICHANNEL SEQUENCE ANALYSIS 3 always lacked a standard approach for undertaking multidimensional sequence analysis of life trajectories. To fill this gap, the present study proposes multichannel sequences analysis (MCSA), a computational approach that makes practical improvements to optimal matching algorithms at two levels. 1 First, it systematizes approaches for dealing with multidimensionality using OMA. Second, it accounts simultaneously for local interdependencies among different social statuses present at each point of the alignment process for all channels. 2 Third, it offers practical improvements toward visualizing parallel processes occurring in various life spheres, a key element to describe and interpret sets of ¨ individual trajectories (Tufte 1997; Muller et al. 2008; Piccarreta and Lior 2010). In this study, we first present the quantitative methods available at the moment in the social sciences for dealing with the multidimensionality of the life course and describe in detail how the method works. To this end we also briefly present an example of substantive results produced by MCSA using social science data. We then illustrate the potential of MCSA by testing its validity and reliability using various formal criteria. Finally, we use random data to compare several multidimensional approaches using OMA. 2. QUANTITATIVE APPROACHES TO LIFE HISTORIES There are a few methodological options for dealing with the multidimensionality of life trajectories. The most often used is event history analysis (EHA; Blossfeld and Rohwer 1995) and sequence analysis (SA; e.g., Sankoff and Kruskal 1983; Abbott and Hrycak 1990), while some attempts have been made with latent class methods (Macmillan and Eliason 2003) and life history graphs (Butts and Pixley 2004). The latter focuses on internal configurations of the life course to reveal general sociological patterns. It uses a formal definition of life The computations presented in this paper are encapsulated in the program SALTT (Search Algorithm for Life Trajectories and Transitions), an open-source freeware program written in C (Notredame, Bucher, Gauthier, and Widmer 2005). The package and its documentation can be downloaded from: http://www.tcoffee.org/saltt/. Recently, TraMineR, a package of the R software environment, has allowed performing OMA and MCSA (Gabadinho et al. 2008). Otherwise, computations are made using SAS (Sas Institute 2004). 2 By “channel,” we mean each sequence of statuses that constitute the multidimensionality under study. 1 4 GAUTHIER ET AL. history that applies social network analysis at an intrapersonal level. Individual histories are expressed as structural relationships between life spells, such as centrality, betweenness, or closeness (Wasserman and Faust 1994). Life history graphs (LHG) take multidimensionality into account, but they put little emphasis on time as it focuses on the overlap of life spells for a given individual. The use of latent class methods (LCM) is based on transition probabilities. It allows identifying subsamples characterized by typical (i.e., most probable) family and occupational roles configurations over time. Unfortunately, methodological limitations make it difficult for LCM to consider more than a few time points and to represent life courses at an individual level. Event history analysis estimates time-to-event or risk functions concerning, for instance, the occurrence of specific events, such as divorce or entry into the labor market, that are then used as dependent variables in different regression models (e.g., Kaplan-Meier or Cox). EHA provides strong information at a population level. However, the information concerning the unfolding of individual life history is limited to a dichotomous variable (the occurrence or not of a given event). A major strength of these methods is to allow statistical testing of the model, but they show limited sensitivity to temporality. Overall, LHG, LCM, and EHA are insufficient to account for the multidimensionality of life trajectories because they fail to “take a narrative approach to social reality” (Abbott 2001:185). In contrast, sequence analysis techniques and, more specifically, optimal matching analysis take the entire sequences of statuses held by individuals over a given period of time (e.g., family, occupational) as the analytical unit to find chronological patterns of stability and change (George 1993). Thus, each individual life course is modeled as a specific sequence of social statuses that may be expressed as a specific character string. For instance, the sequence aaaabbcccc may describe the family trajectory of an individual over ten years (e.g., between the ages of 18 and 27), with a standing for living with both biological parents, b for living alone, and c for living in a couple. Basically, OMA involves making pairwise comparisons between individual sequences of statuses to evaluate how similar they are. 3 This is accomplished by counting the minimal (weighted) number of elementary operations (known as 3 There are promising techniques for multiple sequence alignment, whereby all sequences are simultaneously compared to all others, but these tech- MULTICHANNEL SEQUENCE ANALYSIS 5 “costs”) of insertion, deletion, 4 and substitution that are necessary to transform one sequence into another (Sankoff and Kruskal 1983; Abbott and Hrycak 1990). 5 For instance, in Figure 8, shown later in this paper, one has to delete one m and then substitute two times n for m in the sequence Ac = mmllm to transform it into Bc = nlln. Among all possible ways to transform sequence Ac in sequence Bc , the one associated with the smallest cost is obtained through dynamic programming (Needleman and Wunsch 1970) and is called the optimal distance between two sequences. 6 The alignment of two life-as-sequences takes into account both the relative position of specific statuses in each individual trajectories and the process of their unfolding over time. Moreover, as the modeling of the sequences is only limited by the number of time points and that of possible characters, the possible individual variability of sequence rapidly becomes huge. Thus, the distance computed by OMA summarizes in an elegant manner the extent to which life courses are similar. The smaller the distance between two life trajectories, the more similar they are. Once all pairwise alignments are computed, the researcher performs a cluster analysis on the resulting distance matrix to reveal types of individual trajectories. 7 Eventually, the typologies stemming from the two latter steps may be used as categorical variables in secondary analyses (cross-tabulations, regressions, and so forth). We now turn to the more general issue of the extent to which OMA can be systematically and straightforwardly applied to multidimensional trajectories. Our goal is to evaluate the ability of OMA to adequately model two main tenets of the life course paradigm. The first one (the principle of linked lives) states that individuals participate niques are poorly suited to large samples and divergent sequences (Claverie and Notredame 2003). 4 Insertion and deletion are equivalent and are referred to as indel. 5 The question of costs necessary to align sequences is a central methodological debate in the use of OMA by social scientists (e.g., Abbott and Tsay 2000; Levine 2000; Wu 2000). Recently, significant advances toward empirical, data-based cost-setting offer objective means of defining the relationships between elements to be compared (Gauthier et al. 2009; Aisenbrey and Fasang 2010). 6 For a closer description of the algorithm, see, for example, Kruskal (1983). 7 The general principle of cluster analysis consists of grouping individuals according to a systematic rule. In this paper, we use the hierarchical Ward’s algorithm, which aims to minimize the intragroup and maximize the intergroup variance of interindividual distances. 6 GAUTHIER ET AL. simultaneously in various social spheres and that the corresponding positions they hold in each are interdependent, as is the case between family and occupational careers (e.g., Heinz 2003). The second tenet (lifelong development) emphasizes the fact that the interdependence of multiple social participation at an individual level may vary continuously over time (Elder et al. 2003). To develop a methodology that corresponds to the two tenets, we investigate the options currently available and define the prerequisites for simultaneously modeling distinct life sphere alignments, while also taking their interdependence into account. 3. OPTIMAL MATCHING ANALYSIS Due to methodological or computational limitations, sequence analysis in the social sciences has until recently focused mainly on (1) onedimensional social trajectories; (2) recoded statuses belonging to different social spheres prior to data processing; or (3) summed interindividual distances measured independently on distinct one-dimensional trajectories. In doing this, measurement of the similarity between two pairs of trajectories does not take full account of the possible interactions that may occur at some points of these linked sequences during the alignment process. Three different strategies have been used in OMA to measure life trajectories along several dimensions. The first consists of using typologies from one-dimensional analyses (e.g., occupational trajectories) as response variables in a logistic regression model that includes indicators of other trajectories (e.g., number of children) as predictor variables (Widmer et al. 2003; Levy, Gauthier, and Widmer 2006). This approach to a large extent disregards the longitudinal information provided by the predictor variables. A second strategy involves retrospectively combining the results obtained from various independent OMA into distinct types of trajectories (e.g., Han and Moen 1999). Since this approach sums interindividual distances from consecutive OMA, it is akin to cross-analyzing typologies stemming from distinct one-dimensional analyses of the same individuals. The main problem with such an approach is that it does not accurately take into account the local or temporal interdependence of the trajectories under study, because the respective types they MULTICHANNEL SEQUENCE ANALYSIS 7 belong to are modeled independently of one another. Moreover, this combination of typologies produced by cross-tabulating the categorical variables stemming from one-dimensional OMA may lead to an overestimation of the number of relevant types, with many types being poorly populated and therefore noninformative. In particular, the approach suffers from a lack of parsimony and potential sensitivity to noisy data, as we demonstrate below. Furthermore, as each dimension is analyzed and clustered independently, it is impossible to use regular clustering quality estimates to decide on the number of types present in the data (Mojena 1977; Milligan and Cooper 1985, 1987). Moreover, the data in each dimension may not be equally reliable or informative. While the combination of alternative channels may compensate for this inequality, the separate treatment of dimensions will lead to spurious alignments, which may then result in the creation of artificial typologies. A third and more interesting strategy is based on combining two or more alphabets (e.g., Stovel, Savage, and Bearman 1996; Abbott and Hrycak 1990; Blair-Loy 1999; Pollock 2007; Dijkstra and Taris 1995; 8 Elzinga 2003). An alphabet is a collection of characters bijectively associated with an ensemble of distinct statuses, and characteristic of a given life course dimension (e.g., family, occupational, residential). For this purpose, an extended alphabet is generated by combining individual alphabets associated with specific channels. There is, however, a problem associated with this strategy: since it allows many possibilities for estimating the substitution costs associated with the extended alphabet, it becomes increasingly difficult to justify the choice of a given cost scheme as the number of categories grows larger and more heterogeneous. Furthermore, depending on the number of channels, the extended alphabet becomes uncomfortably large (Han and Moen 1999). Take, for instance, two channels with three statuses. Family statuses are given a specific code for singlehood, marriage, or divorce/widowhood. Occupational status is recorded as “at home,” “part-time,” or “fulltime.” In this scenario, there is no rationale for deciding a priori how to set costs stemming from the combination of “at home”/“marriage” versus “single”/“part-time or other statuses.” Moreover, each dimension’s local contribution to the overall interindividual distance, as well 8 These authors use a different algorithm from that of Needleman and Wunsch (1970), on which many OMA are based. 8 GAUTHIER ET AL. as the particular unfolding of each set of linked trajectories, remains hidden or unknown. 4. MULTICHANNEL SEQUENCE ANALYSIS The multidimensional approach we have developed is based on the assumption that taking this local contribution into account differs from using an extended alphabet, since each dimension differentially influences the final alignment. Therefore, a systematic approach is needed in order to deal with these issues. We propose a multidimensional approach in which (1) the dimensions under study are used simultaneously during the alignment process; (2) no enumeration of an extended alphabet is needed; and (3) the combination of cost estimations is as explicit as possible and is dealt with using a standard parameter. Given an alphabet containing a finite number of characters, take two sequences I and J based on a finite number of characters belonging to the alphabet. 9 Consider the costs associated with insertions and deletions (henceforth called indel and abbreviated d), as well as with the substitution costs given by a cost matrix C, where Csi s j is the cost for aligning S i , the ith character of I against S j , the j th character of J. 10 In this paper, for simplification purposes, we set a cost of one to all substitutions involving two different characters. The substitution of a character with itself yields a cost of zero, and the costs associated with indel are set to the half of that of a substitution. 11 The optimal alignment score can then be computed using the following recursion: In practice, most algorithms are based on existing sets of characters, as is, for instance, the English alphabet (26 characters) or the ASCII characters table (127 characters that may be complemented with 128 extended ASCII codes). Taking into account a greater number of characters is not a limitation per se but may require some programming. 10 In the context of this paper, we consider that the matching of identical characters yields a null score and that mismatches are associated with same sign, nonzero, finite costs, although other cost schemes may be found, notably in biology (Durbin et al. 2002). 11 We choose this option to simplify the exposition. Many other weighting schemes for substitutions and/or insertion/deletion may be considered (Thompson et al. 1994; Durbin et al. 2002; Widmer et al. 2003). We propose elsewhere a method that estimates differentiated costs on an empirical basis (Gauthier et al. 2009) 9 MULTICHANNEL SEQUENCE ANALYSIS 9 ⎧ ⎪ F(i − 1, j − 1) + Csi s j ⎪ ⎪ ⎨ . F(i, j ) = min F(i − 1, j ) + d ⎪ ⎪ ⎪ F(i, j − 1) + d ⎩ (1) Each line in equation (1) defines a possible optimal match score of two subsequences, whether it is less costly at this point to insert, delete, or substitute characters to fully align the subsequences. For instance, F(i − 1, j − 1) corresponds to the optimal match score of a subsequence containing the 1 to i − 1 characters of sequence I against a subsequence containing the symbols 1 to j − 1 in sequence J. As such, this equation defines a recursion in which the score of any alignment F(i, j) can be estimated by considering an optimal extension of the three shorter alignments F(i − 1, j), F(i − 1, j − 1), and F(i, j − 1). Considering that each of these shorter alignments was already an optimal matching of associated substrings, F(i, j) will also be optimal (Durbin et al. 2002:20). 12 We take the OMA concept a step further and extend it to the use of different information sources associated with individual trajectories. We name it multichannel sequence analysis (MCSA). In MCSA, each individual is associated with two or more distinct channels, each tapping a distinct life trajectory within a specific sphere (e.g., occupation, family, housing, location, health) by means of a specific alphabet. Channels associated with a given individual are synchronized so that, for example, the xth character of the family channel and the yth character of the occupational channel correspond to the same year for a given individual. For instance, given two individuals A and B, one can express the MCSA example given in Figure 8 as two bidimensional sequences: A = {(m, z), (m, t), (l, t), (l, t), (m, t)} and B = {(n, y), (l, z), (l, z), (n, z)}, where each doublet in parentheses characterise the situation at a given time point; the first and second positions in the doublet correspond to the channels of family and occupational participation respectively. Once 12 Of course, this strategy relies on the assumption that each position is independent and that the alignment scores are additive. 10 GAUTHIER ET AL. defined for an individual, these doublets remain the same throughout the alignment procedure. Optimal matching analysis is based on a recursive algorithm that parses a pair of sequences from the first to the last element in an array and then estimates an optimal score at each point of the alignment (Sankoff and Kruskal 1983). A given optimal solution for two substrings of the sequences Ac and Bc does not imply that the optimal solution will be the same for any extension of these substrings. The optimal distance is given only after the algorithm has been applied to the entire sequences to be aligned. 13 Therefore, our goal is to analyze multiple social participations while taking into account what each pair of nested individual statuses contributes over time to the overall similarity between two individual life courses. The method is general in the sense that it can use as many channels as needed, with the only condition being their synchronization. In practice, taking into account synchronized channels within an OMA framework, as defined by equation (1), is relatively straightforward and only requires adapting the substitution costs Csi s j and the indel terms so that they reflect the relationship between equivalent channels. The multichannel version of these terms can be expressed as follows: Nc c Csi s j Csi s j = C=1 Nc . (2) While a single cost matrix is used to match two individual life trajectories in standard OMA, our approach considers two or more channels per individual and uses one cost matrix for each channel. These cost matrices are standard and can be generated using any appropriate strategy, such as unitary, knowledge-based, or data-based (Gauthier et al. 2009; 14 Aisenbrey and Fasang 2010). For instance, in equation (2), a channel-specific cost matrix (C c ) is associated with each channel. This matrix controls the cost of matching any character in the 13 For instance, using the cost schemes presented above, aligning Ac = {m} with Bc = {l} implies either a substitution (mismatch) or two insertion/deletions, whereas aligning Ac ’ = {ml} – where Ac ’ is equal to Ac plus character l – with Bc ’ = {l} calls for an insertion/deletion followed by a match. 14 In this paper, we present an empirical method for defining substitution costs using a data-based iterative procedure. MULTICHANNEL SEQUENCE ANALYSIS 11 channel in question with any counterpart character for another individual. Formally, given two individuals A and B, each associated with c two channels c and d, Csi s j will be the cost associated with matching th the i character of channel c for individual A with the j th character d of channel c for individual B. Csi s j will be the cost of matching the ith character of channel d for individual A with the j th character of channel d for individual B. Eventually, the contribution of channels c and d is averaged to yield the final cost associated with the matching of positions i and j for the two individuals, where Nc stands for the number of channels. 15 Costs for the insertion/deletion (indel) of each channel are averaged the same way. An alternative used below is to define indel as the average off-diagonal value (AOD) of the corresponding substitution matrix (Thompson et al. 1994). This procedure can be extended to any number of channels. In the above example, as cost matrices are unitary, matching the doublets (m, t) with (l, z) is more costly than matching (l, t) with (l, z) as the latter doublets share a common character. Hence, the optimal MCSA alignment presented in Figure 8 inserts a doublet of indels in order to match the most similar doublets. 16 Eventually, the raw score of this bidimensional alignment is computed as 2 ∗ indels + 2 ∗ (mismatch/mismatch) + 2 ∗ (match/mismatch) = 2 ∗ 0.5 + 2 ∗ 2 + 2 ∗ 1 = 7. There are several ways to compute the distance from an optimal pairwise alignment. We may use the raw score provided by the algorithm, 17 or the percentage of identity (PID) between the two sequences (National Centre for Biotechnology Information 2004; May 2004). PID corresponds to the number of aligned identical characters, divided by the length of the longer sequence (see examples in Figure 8). It is an interesting measure, as it is approximately normally distributed (Doolitle 1981) and gives a useful indication concerning the common structure of two sequences (Raghava and Barton 2006). To simplify the exposition, we have set the combination of the substitution costs at one point of the aligned sequences at the average value of the two substitution costs involved at this point. Future developments should implement some alternative ways of dealing with the relationship between local scores. 16 Matching two doublets leads to either two matches, one match and one mismatch, or two mismatches. Following the cost scheme used, the resulting costs may be quite differentiated. 17 When the length of the sequences to align differ, the resulting distance between them is normalized by dividing it through the length of the longer sequence. 15 12 5. EMPIRICAL ILLUSTRATION GAUTHIER ET AL. To test this method and illustrate its strength with an empirical example, we use data from the Swiss Household Panel (Tillmann and Zimermann 2004). It includes in its third wave a retrospective questionnaire that asks respondents to provide information on their educational, family, and occupational status from birth to the year of the interview. Each change in status is therefore associated with a starting date and an ending date. Every year, occupational trajectories are coded using a seven-category code scheme: full-time employment; part-time employment; positive interruption such as a sabbatical or trip abroad; negative interruption, such as unemployment or illness; full-time housework; retirement; and full-time education. A ten-category code scheme is used for family trajectories: living with a biological father and mother; living with only one biological parent, either mother or father; living with one biological parent and her or his partner; living alone; living with a partner; living with a partner and one’s own biological child; living with a partner and a nonbiological child; living with one’s own biological child without a partner; living with friends; other situations. A 12-category scale is used to describe education-to-work trajectories. 18 Given that individual life trajectories require substantial time to differentiate from one another, and to build sequences that are as complete and informative as possible, we consider here only the individuals aged 45 and older who answered the retrospective questionnaire (N = 2,212). As we further restrict our sample to individuals whose trajectories contain less than 50% missing data, our final data set contains 1,847 individuals (54% women, 46% men) characterized by two sequences of statuses describing their family and occupational lives from birth to age 45. 19 Technically, MCSA may be applied to any number of sequences, without restriction regarding censored or incomplete data. However, from a sociological point of view, it makes sense to compare life course sequences that have about 18 This scale is a combination of the seven categories of educational attainment following the classification by the Swiss Federal Statistical Office, which range from compulsory education to university degree, and five post-educational occupational statuses (full-time, part-time, household, unemployment, other). 19 When comparing education-to-work trajectories, we use sequences ranging from age 0 to 25, with 2,153 individuals aged 25 or older. MULTICHANNEL SEQUENCE ANALYSIS 13 the same size, which results here in the fact that the youngest cohorts are not taken into account. 20 Our choice for an empirical example on which to test MCSA reflects the importance of debates about the influence of sociostructural factors on the divergent occupational and family trajectories ¨ of women and men (Moen 1985; Hopflinger and Debrunner 1991; Sheridan 1997; Levy, Gauthier, and Widmer 2006), as well as the availability of high-quality data on family and occupational status over entire lives. We use one channel to describe occupational statuses and another to model family trajectories over time. The interindividual distance matrix is computed by means of MCSA. 21 We then run a cluster analysis on that matrix using Ward’s hierarchical method to reveal coherent types of individual trajectories. We use stopping rules in order to estimate the relevant number of clusters to retain (e.g., Milligan and Cooper 1985; SAS Institute 2004). 22 We eventually decide to keep three clear-cut, bidimensional types of individual trajectories, in addition to one residual category not presented here. Figures 1, 2, and 3 present three contrasted illustrations of the visualization potentialities offered by a multichannel approach to individual life courses—namely, simultaneous local analysis and parallel visualization of interdependent social trajectories (e.g., see Tufte 1997). The first bidimensional type of trajectories (Figure 1, 26% of respondents) includes individuals that experience a quick transition to parenthood. After a long stay with their two biological parents, they live a few years alone or with a partner before entering a long and stable 20 More generally, we do not know the extent to which missing cases are missing at random or not. As occurs frequently with survey data, we may expect slight selection biases toward, for example, age, sex, occupation, or nationality (e.g. see Groves et al. 2004). 21 We use a unitary substitution cost matrix for both channels; insertion or deletion costs are set to half of one substitution cost. 22 We retain three criteria among those tested by Milligan and Cooper: (1) pseudo F, which represents an approximation of the ratio between the intercluster and intracluster variance of sequences and measures the separation between all clusters at the current level; (2) Je(2)/Je(1) (Duda and Hart 1973), which may be transformed into a pseudo T2, an index that measures the separation between the two most recently joined clusters; and (3) R squared, that expresses the size of the experimental effect. It is reasonable to look for consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). In the present study, a given cluster solution was retained for analysis only if at least one of these three criteria supported its validity. 14 GAUTHIER ET AL. FIGURE 1. “Parental and non–full-time employment” bidimensional trajectories (26%). period of parental life in a nuclear family. The associated occupational trajectories of the same individuals show a short period of full-time work after completing education, followed by a long period out of the job market or working part-time. Women are significantly overrepresented in this type, which we label “parental and non-full-time employment” trajectories; indeed, 92% of individuals belonging to this type are female. MULTICHANNEL SEQUENCE ANALYSIS 15 FIGURE 2. “Nonparental and full-time employment” bidimensional trajectories (24%). The second type (Figure 2, 24% of the sample) brings together people who experienced a long stay in a family of orientation composed of two biological parents, followed by a relatively long period of predominantly single living and/or childless conjugal life. The occupational trajectories of this type consist nearly exclusively of full-time activity. We name this nongendered type “nonparental and full-time employment” trajectories. In contrast to the first type, the proportions of men and women in this type are roughly equal. 16 GAUTHIER ET AL. FIGURE 3. “Parental and full-time employment” bidimensional trajectories (30%). The third type (Figure 3, 30% of the sample) comprises a large majority of men (92%) who follow family trajectories similar to those presented in Figure 1, and whose employment activity is stable and full-time. Further decomposition of the residual category (not presented here) reveals interesting minority patterns, such as conjugal trajectories associated with non–full-time occupational activities (7%, women overrepresented), or parental trajectories combined with long-term full-time MULTICHANNEL SEQUENCE ANALYSIS 17 paid work of individuals who experienced their own parents’ separation during childhood (7%, no gender bias). 6. VALIDATION In the following sections, which use the data from the Swiss Household Panel, we test the extent to which MCSA produces more consistent results than regular OMA according to three criteria: (1) parsimony (reduces complexity), (2) reliability (takes advantage from channels interdependence), and (3) robustness (resists noise and distortion). 6.1. Reduction of Complexity Based on two distinct sequences of statuses for each individual in our data sets, three distance matrices are produced. Two of them correspond to one-dimensional analyses performed separately on family and occupational trajectories, whereas the third stems from MCSA applied simultaneously to both trajectories and corresponds to the empirical example presented above. 23 We then run a cluster analysis on these matrices, using Ward’s hierarchical method (Wards 1963). The number of clusters actually present in the data is estimated using the stopping rules presented above. For both one-dimensional types of trajectory (family and occupational), the presence of three or five clusters is supported in the data. The same procedure suggests the presence of four clusters in the distance matrix resulting from MCSA. If we cross-combine solutions stemming from the one-dimensional sequence analysis to build ex post multidimensional trajectories, we find typologies ranging from nine to 25 types each, 24 whereas an MCSA performed on the same data drastically reduces complexity, as the stopping rules indicate the presence 23 For this exploratory analysis, to focus on the specific features of MCSA, we use two unitary matrices of substitution, and the cost of insertion/deletion is set to the half of that of substitution. 24 Cross-combinations of three or five types of family trajectories with three or five types of occupational trajectories form, respectively, nine, 15, 15, and 25 distinct types of family-and-occupational trajectories. 18 GAUTHIER ET AL. in the data of only four types of bidimensional trajectories. 25 In the following, we will not consider the respective semantic value of these typologies but will focus instead on the extent to which this reduction is associated with a loss of information. To measure the ability of multichannel analysis to both reduce complexity and preserve information, we cross-tabulate the four clusters’ multidimensional typology stemming from MCSA with the corresponding cross-combinations of one-dimensional OMA described above. The Goodman-Kruskal statistic is a measure of “proportionate reduction in error” (PRE), which reflects the percentage by which knowledge of the independent variable reduces errors in predicting the dependent variable (Goodman and Kruskal 1979; Siegel and Castellan 1988; Olzak and Ritschard 1995; Confais, Grelet, and Le Guen 2005). This statistic varies between 0 (absolute independence) and 1 (perfect association). When lambda (R|C) 26 has a value of 1, it means that each row of the table has only one cell different from zero. To efficiently reduce the complexity of a contingency table, we should capture the maximum information available in the rows, with minimum overlapping from one row to another in the same column, as schematically presented in Figure 4. This is exactly what we get from cross-tabulating MCSA with the combined OMA—that is, many cells with no cases or very few cases and many cells with high-column percentages and no cells in the same column with comparably high scores. 27 Table 1 shows the degree of association (lambda and contingency coefficients) between family and occupational types of trajectories computed either with MCSA (four clusters) or cross-combined monochannel solutions (three or five clusters, respectively). The contingency coefficients in Table 1 show a strong association between multichannel and cross-combined monochannel 25 The stopping rules reveal also a seven-types solution for the MCSA. Its association with the cross-combined monochannels is very similar to the four clusters solution. 26 This is called asymmetric lambda, which predicts the rows distribution (R) under the condition that one knows the columns distribution (C). 27 Due to the size of the contingency tables used in the tests, we decided to provide a schematic example of the situation (Figure 4) and to summarize the results by only indicating the value of the lambdas and the contingency coefficients (Table 1). MULTICHANNEL SEQUENCE ANALYSIS 19 Cross-Combined F1*O1 F1*O2 F1*O3 F2*O1 F2*O2 F2*O3 F3*O1 F3*O2 F3*O3 Monochannel Typologies MCSA Typology MCSA1 MCSA2 MCSA3 MCSA4 X X X (Y) (Y) X X X X X X= most cases of a column are concentrated in a single cell; Y= important proportion of cases are distributed on more than one cell of the same column. FIGURE 4. Schematic representation of the association between cross-combination of family (F) and occupational (O) monochannel typologies containing three types each (respectively F 1 , F 2 , F 3 , O 1 , O 2 , O 3 ) and MCSA typology containing four types based on the same data. solutions. The asymmetric lambda R|C is systematically higher than the asymmetric lambda C|R, indicating that knowing the distribution of combined monochannel solutions allows for better predictions of the multichannel solution distribution. Put another way, MCSA efficiently reduces the complexity of the data while conserving most of the relevant information. More than 80% of the MCSA solution may be predicted by the cross-combined one-dimensional OMA distributions, whereas the reduction in complexity (i.e., the difference between the number of cells in the cross-combined and multichannel solutions, divided by the number of cells in the cross-combined solution) is, respectively, 56%, 73%, and 84%. The asymptotic standard error (ASE) values are much lower than the lambda values. This means here that the 95% confidence interval limits of the lambdas do not contain zero (data not shown), suggesting that these results may be considered statistically significant (SAS technical support, private communication, 2006). 6.2. Interdependence Starting from the results presented in Table 1, we now turn to the extent by which statistical association between individual trajectories unfolding in distinct social spheres influences the quality and reliability of the MCSA features described above. We therefore first crosstabulate the categorical variables corresponding to the one-dimensional 20 GAUTHIER ET AL. TABLE 1 Association Between Categorical Variables (Asymmetric Lambda) Corresponding to Types of Trajectories Stemming from Either MCSA or Cross-Combined One-Dimensional OMA Cross-Combined One-dimensional OMA Combination 1: 9 clusters Trajectories: Family (3 clusters) ∗ Occupational (3 clusters) Dimension of contingency table 1: 9 ∗ 4 = 36 Combination 2: 15 clusters Trajectories: Family (5 clusters) ∗ Occupational (3 clusters) Dimension of contingency table 2: 15 ∗ 4 = 60 Combination 3: 15 clusters Trajectories: Family (3 clusters) ∗ Occupational (5 clusters) Dimension of contingency table 3: 15 ∗ 4 = 60 Combination 4: 25 clusters Trajectories: Family (5 clusters) ∗ Occupational (5 clusters) Dimension of contingency table 4: 25 ∗ 4 = 100 Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient Lambda C|R Lambda R|C Contingency coefficient MCSA (4 clusters) Value 0.4641 0.7975 0.8197 0.3436 0.8237 0.8237 0.3772 0.8006 0.8238 0.2659 0.8463 0.8320 0.0124 0.0110 0.0128 0.0117 0.0128 0.0113 ASE 0.0132 0.0118 ASE = Asymptotic standard error. typologies of family trajectories with those of occupational trajectories (Table 2). 28 Table 2 shows that family and occupational types of trajectories have strong statistical association. The value of the likelihood ratio chi-square is larger when the number of clusters is greater; the 28 According to our stopping rules, we consider for both trajectories a three- and a five-type typology. MULTICHANNEL SEQUENCE ANALYSIS 21 TABLE 2 Association Between Categorical Variables (Likelihood Ratio Chi-square) Corresponding to Types of Family and Occupational Trajectories Stemming from One-dimensional OMA Cross-Tabulated Types of Trajectories Family (3 types) ∗ Occupational (3 types) Family (3 types) ∗ Occupational (5 types) Family (5 types) ∗ Occupational (3 types) Family (5 types) ∗ Occupational (5 types) df 4 8 8 16 LR χ 2 27.5998 32.2821 28.1240 35.4739 p Value <.0001 <.0001 0.0005 0.0034 Family (3 types) = three types of family trajectories; Occupational (5 types) = three types of occupational trajectories; LR χ 2 = likelihood ratio chi-square; df = degree of freedom. N = 1847. significance level stays under the threshold of 0.01 but decreases slightly as the number of types of trajectories increases. From this result we hypothesize that the use of MCSA provides better results when the types of one-dimensional trajectories are statistically associated. Two life spheres are considered interdependent when the types stemming from OMA performed independently on each of the corresponding trajectories are associated. 29 As mentioned earlier, it is the common information implied by interdependence that allows MCSA to reduce the complexity of multidimensional typologies by locally “deducing” a channel’s missing or hidden information. Therefore, in order to test this hypothesis, we focus on other multiple social participations over time—namely, family and education-to-work trajectories. To differentiate education-to-work from occupational trajectories, we limit the period of observation from birth to age 25. 30 One-dimensional OMA performed on these trajectories along with usual stopping rules indicate a two- or five-cluster solution for the first channel (family trajectories), a three- or five-cluster solution for the second one (education-to-work trajectories), and a four-cluster solution for the typology stemming from Association is measured using the likelihood ratio of chi-square and asymmetric lambda. 30 To support the comparison with the results presented in Table 2, we measure the association between family and occupational trajectories over a 25year period and still find similarly high degrees of association. We conclude that the absence of association between family and education-to-work trajectories is therefore not due primarily to the length of the trajectories. 29 22 GAUTHIER ET AL. TABLE 3 Association Between Categorical Variables (Likelihood Ratio Chi-square and Asymmetric Lambda) Corresponding to Types of Family and Education-to-Work Trajectories from Either MCSA or Cross-Combined One-Dimensional OMA Cross-Tabulated Types of Trajectories Family (2 types) ∗ Educ-work(3 types) Family (2 types) ∗ Educ-work (5 types) Family (5 types) ∗ Educ-work (3 types) Family (5 types) ∗ Educ-work (5 types) MCSA (4 types) ∗ [Family (2) ∗ Educ-work (3)] MCSA (4 types) ∗ [Family (2) ∗ Educ-work (5)] MCSA (4 types) ∗ [Family (5) ∗ Educ-work (3)] MCSA (4 types) ∗ [Family (5) ∗ Educ-work (5)] Likelihood Ratio Chi-square 2.7324 4.2019 5.8219 8.1966 833.3803 836.3845 1669.0510 1675.4887 p Value 0.2551 0.3794 0.6672 0.9428 <.0001 <.0001 <.0001 <.0001 Lambda R|C 0.0000 0.0000 0.0000 0.0000 0.3257 0.3257 0.4397 0.4397 Family = family trajectories; Educ-work = trajectories of the transition between education and work; MCSA = multichannel sequence analysis of these trajectories. The number of types considered is indicated in parentheses. MCSA. 31 Cross-tabulations of the categorical variables based on each one-dimensional typology are also created (Table 3). The results from Table 3 show that the categorical variables representing these one-dimensional types of trajectories are not statistically associated with one another, whereas the MCSA based on family and education-to-work sequences is logically and significantly correlated with the cross-combination of these types. Lambda values in this case, however, are much lower, in comparison to the results stemming from the significantly correlated one-dimensional trajectories described above. This means that the percentage reduction in error in predicting the dependent variable given by MCSA in this case is two to four times lower than it is when the lambda values are obtained with higher correlated trajectories. These results confirm to a certain extent that MCSA is more efficient at reducing data complexity when the considered The stopping rules also suggest a six-cluster solution for the MCSA. Its association with the cross-combined monochannels is quite similar to the fourcluster solution. 31 MULTICHANNEL SEQUENCE ANALYSIS 23 trajectories are interdependent—that is, when they share a certain amount of information. 6.3. Resistance to Noise The third test comparing MCSA and unidimensional OMA concerns the ability of the two approaches to “resist” noise in the data. In other words, we aim at testing the extent to which these procedures are able to identify the same structure in the data when characters in sequences are progressively and randomly replaced by characters that do not belong to the alphabet building the original sequences. From our original data set of family and occupational life trajectories which contain only valid values, we generate 15 alternative data sets for each type of trajectory. Each of these data sets contains a progressively greater proportion of a randomly assigned unknown status compared to the original data set (from 2% to 30%, in increments of 2%). 32 The unknown status is associated with the same unitary substitution cost as the other statuses. The size of the sequences remains the same after the noising process. We then run cluster analyses on each of the distance matrices produced by MCSA and OMA for these data sets and then cross-tabulate the typologies stemming from the original data set with those obtained using the increasingly noisy versions of that same data set. For a given type of trajectory, the number of clusters is held constant and corresponds to the types presented above (cf. Section 4.1.). The degree of association between typologies (lambda coefficient) is computed for each solution and plotted in Figure 5. It compares the ability of MCSA and cross-combined OMA to identify the original data structure from its degraded signal. Figure 5 illustrates that the four-clusters multichannel typology resists noise much better than do the other typologies. The lambda values for the former remain stable at approximately 0.85, which indicates a rather strong association with the original solution. For onedimensional types of trajectories, the lambda values decline rapidly and show greater variation than the four-clusters multichannel solution. 32 The “noising” of the data is a random procedure that is made by SALTT on each individual sequence. 24 1 GAUTHIER ET AL. 0.9 0.8 Lambda R|C 0.7 0.6 0.5 0.4 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% Increasing noise (missing values) 22% 24% 26% 28% 30% Multi_4 Combo_15b Multi_7 Combo_25 Combo_9 Fam_5 Combo_15a Prof_5 Multi = multichannel analysis; Combo = cross combination of monochannel analyses; Fam = types of family trajectories as categorical variables; Prof = types of professionnal trajectories as categorical variables; _n= number of clusters retained. FIGURE 5. Value of asymmetric lambda, by increasing amount of missing values on eight types of trajectories. As the noising of the data occurs before the clustering procedure, each noise level considered on the abscissa of Figure 5 will produce a specific cluster solution. This may explain the substantial variation from one noise level to another that is visible in Figure 5, and the peak lambda value in Figure 6 for the three-cluster solution. In the latter case, the noisy sequences lead to a splitting of the two-cluster solutions, which is not the case when clustering the original sequences that serve as references for both graphs. To focus specifically on the behavior of MCSA regarding noised data, we compute the values of the asymmetric lambda for the two to 25 clusters solutions of the original multichannel trajectories for three levels of noise in the same data (10%, 20%, and 30%) and plot them in Figure 6. Figure 6 shows that noise resistance is weakened by the increasing number of clusters and by the level of noise in the sequences. It appears, however, that more noise is systematically associated with a weaker lambda for any given cluster solution. We must also address the extent to which the resistance to noise of a given cluster solution suggests the reliability of that solution. For instance, to what extent can we use such a result to select one cluster solution over another? MULTICHANNEL SEQUENCE ANALYSIS 1 25 0.9 0.8 Lambda C|R Noise = 10% Noise = 20% Noise = 30% 0.7 0.6 0.5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of clusters FIGURE 6. Asymmetric lambda value between a given multichannel cluster solution and its corresponding noised solutions (10%, 20%, and 30% missing) The best solution is also the one that is more resistant to internal variations, which suggests a more stable and informative data structure. Comparing the 25-cluster solutions for the typologies based on MCSA (Figure 6) and one-dimensional OMA (Figure 5) shows that the noisy multichannel solution predicts the original solution better than does the combination of one-dimensional OMA, although this difference is small at a noise level of 10%. 6.4. Minimizing the Distortion of Alignments Considering two distinct dimensions of the individual life course, we use the length variation resulting from pairwise alignment as an indicator of distortion. Minimizing this variation is of special interest because each position in a sequence represents a year of life, which corresponds to a specific age. Given the fact that some statuses and some transitions are more common at certain ages than at others, alignments with greater length variation bias the actual relations between age and social statuses. For instance, Figure 8 exemplifies how MCSA contributes to limiting distortions, since the optimal alignment of Channel d results in a length of six; whereas when both channels are aligned simultaneously 26 Position: seq1 aligned: seq2 aligned: 0123456789 A-BBBBA-CA AABBB-AB-A (seq1 original: (seq2 original: GAUTHIER ET AL. ABBBBACA) AABBBABA) FIGURE 7. Measuring the distortion resulting from a pairwise alignment. (MCSA), the length of the final alignment is five for both dimensions. In this way, MCSA keeps the chronological order of both trajectories as close to the original as possible without using indels, which allows for a better structural conservation of sequences than do systematic substitutions. The distortion due to an alignment is defined as the sum of the products of the number of character(s) shifted, multiplied by the size of the shift (time units), and divided by the total, number of aligned character pairs. This is a standardized measure that may be used to align sequences of different lengths, although the ones used here are of equal lengths. Figure 7 gives an example of distortion measurement resulting from a pairwise alignment. Considering the aligned sequence seq1, the three characters ‘B’ from the original sequence seq1 (positions 2–4) and character ‘C’ (position 8) are shifted by one position (time unit) to the right. In this case, there are six aligned character pairs in the alignment. The value of the distortion resulting from the pairwise alignment of seq1 and seq2 is 0.66 [((3 ∗ 1) + (1 ∗ 1)) / 6 = 4/6 = 0.666]. Our aim is to test whether MCSA provides less distorted alignments than one-dimensional OMA does. Therefore, using SHP data, we compare the age distortion stemming from two separate monochannel alignment procedures for each individual—one for family and the other for occupational trajectories—with age distortion computed using MCSA based on the same trajectories. From three data sets containing 1,847 family, occupational, and multidimensional trajectories, we obtain 1,704,781 possible alignments for each of them. 33 A distortion score is computed for each alignment. To compare the alignments produced by cross-combined one-dimensional OMA and MCSA, we subtract for each individual the largest distortion score stemming from either one-dimensional 33 Number of alignments = N ∗ (N-1)/2 = 1847 ∗ 1846/2 = 1’704’781 MULTICHANNEL SEQUENCE ANALYSIS 27 TABLE 4 Difference in Distortion Between Multichannel (Reference) and Family, Occupational as well as Max(Family, Occupational) Pairwise Alignments Sequences Aligned Multichannel is better (−) No difference (0) Multichannel is worse (+) Total Family 28% 44% 28% 100% N = 1’704’781 Occupational 26% 50% 24% 100% N = 1’704’781 Max(fam., occup.) 48% 46% 6% 100% N = 1’704’781 Max(fam., occup.) = larger distortion resulting from the alignment of a pair of either family or occupational trajectories for the same individual. alignments to that produced by MCSA. 34 A resulting negative value indicates that the distortion stemming from MCSA is smaller than that resulting from one-dimensional alignments. Table 4 shows the distortion differences between MCSA (reference) and one-dimensional OMA based on family and occupational trajectories. For each individual, we also consider the larger distortion produced by either alignment. Table 4 shows that in the majority of cases, MCSA provides less or equally distorted alignments than does regular one-dimensional OMA. Since MCSA represents a combination of two alignments, it would be understandable if MCSA produces more distorted alignments than one-dimensional OMA. Actually, MCSA clearly produces better results than the one-dimensional OMA performed on the channel associated with the most distorted alignments. MCSA generates less distorted alignments in approximately 50% of cases. Distortion from MCSA is greater than that from one-dimensional OMA in only 6% of alignments. In other words, MCSA’s distortions are almost always smaller than, or equal to, those of two one-dimensional OMA applied separately. By reducing sequences’ distortion in the alignment process, MCSA offers a better conservation of structural and temporal patterns (Lesnard and Saint Pol 2004). Table 5 presents the paired t-test values for these comparisons and shows that MCSA significantly reduces the structural and Comparison of individual distortion scores equals distortion score measured on MCSA (the largest distortion score measured on the alignment of either family or occupational trajectories). 34 28 GAUTHIER ET AL. TABLE 5 Paired t-Test on Distortion’s Value Resulting from Either MCSA or One-dimensional OMA Mean Standard Deviation t-Test Value Family—MCSA 0.2911 Occupational—MCSA 0.1659 Max (fam., occup.)—MCSA 1.0948 N = 1’704’781. p <.0001 <.0001 <.0001 2.4628 1.7331 2.4066 154.33 124.98 594.00 temporal distortion of aligned sequences (p < 0.0001). This reduction is greatest when comparing MCSA to regular OMA performed on sequences that produce the greatest distortions. Despite the large number of cases, which improves performance in significance tests, the relatively high standard deviations indicate substantial variability in the data. Moreover, non-paired t-tests on the same data (not shown in Table 5) indicate that MCSA produces significantly smaller standard deviations than does one-dimensional OMA (p < 0.0001). 7. FURTHER VALIDATION ON RANDOM DATA Having already shown the favorable properties of MCSA compared to regular OMA performed on existing social science data, this section aims at assessing the extent to which MCSA also provides qualitatively similar results when used on random data. To compare various multidimensional approaches using OMA, we use two simulated data sets (N = 2001 pairs of sequences), each corresponding to a specific channel. In this simulated data, the alphabet and length of sequences are kept constant. In each data set, the first sequence has a length of five characters and the second a length of four. The alphabets of the first and the second channel contain three and four characters, respectively (cf. Figure 8). 35 We first use the simulated data to evaluate whether different approaches to multidimensional sequence analysis produce similar results. We compare four ways of computing multidimensionality: ex-post sum of the distance matrices produced by two independent OMA, MCSA, 35 To create the sequences, we use the Perl’s function rand(), which produces uniformly distributed pseudo-random numbers (Wall et al. 2000). MULTICHANNEL SEQUENCE ANALYSIS 29 and two ex-ante recoding of both channels into one unique channel (called “extended 1 and 2” in Figure 8). For aligning pairs of sequences (nested or separately), we use unitary substitution costs matrices. 36 For the extended alphabet, we consider the fact that, when comparing two characters of the recoded sequences, the cost may be one unit if both recoded characters have a character in common. The cost is set to two units if they have no characters in common. 37 The cost is zero when both pairs of characters are identical for two given individuals. In the first case, the value of indel is set to the average off-diagonal value (AOD) of the substitution cost matrix, while in the second case, indel is set to half of this value 38 (see Figure 8). We compare these different approaches using the degree of similarity between all pairs of sequences, which is given by either the raw score of the alignment or the PID. Using the simulated data sets and the cost schemes described above, we compute linear coefficient correlations among alignment scores stemming from different ways of assessing the distance between multidimensional sequences, as shown in Table 6, where the distances produced using either extended alphabets, ex-post sum of monochannel distances, or MCSA are strongly associated, although not identical. According to the two latter methods, the use of either percent identity or raw score produces the same correlations with the other measures of multidimensional distances. Since it otherwise brings the most differentiated correlations, we retain PID to estimate the distance between sequences (May 2004). The five measures of multidimensional distances between individual trajectories are all based on some linear function of one-dimensional OMA distances. They differ essentially in the timing of the contribution of each channel—that is, either before, during, or after the alignment process. As one can read from Table 6, results produced by MCSA appear here as a representative denominator to the other measures. This means that they are at the same time as variable This means that substituting any character with another one has a cost of 1, whereas substituting a character with itself has a cost of 0. 37 For instance, if “recoded f ” stands for “m” and “z” at the same position in channels 1 and 2, “recoded j” stands for “m” and “t,” and recoded “g” for “n” and “z,” the cost of substituting “f ” and “j” is two-fold lower than the cost of substituting “j” and “g.” 38 At this point, we did not consider the differentiation between gap opening penalty (GOP) and gap extension penalty (GEP; Thompson et al. 1994), or between internal and external gaps. 36 30 GAUTHIER ET AL. FIGURE 8. Comparing MCSA to summing two distance matrices or using extended alphabet (random data sets). as the five others but less sensitive to the computation options, which is a first indication toward its robustness. 8. DISCUSSION This paper explores two key points regarding the methodological potential of multichannel sequence analysis (MCSA). First, MCSA offers an overall advantage over conventional OMA, since it allows for the simultaneous analysis of multiple social trajectories without prior recoding of the data. MCSA produces an extended alphabet that corresponds to the combination of two or more alphabets defining different types of sequences used in the analysis. The main advantage of MCSA over other extended alphabet methods (Dijkstra and Taris 1995; Stovel et al. 1996) is that it avoids defining, coding, and weighting all combinations prior to the analysis, and it therefore allows for the use of weighting strategies specific to each dimension (family, occupation, and so forth) considered separately, such as the data-based training procedure TABLE 6 Similarity Matrix (Linear Coefficient Correlation) Between Six Measures of Two-dimensional Pairwise Distances Using Random Data Sets (N = 2001) Sum_PID_OMA 1.000 0.773 0.748 0.517 0.540 1.000 0.826 0.704 0.577 Score_ext1 Score_ext2 PID_ext1 PID_ext2 MULTICHANNEL SEQUENCE ANALYSIS PID_MCSA PID_MCSA Sum_PID_OMA Score_ext1 Score_ext2 PID_ext1 PID_ext2 1.000 0.798 0.898 1.000 0.803 0.898 0.960 0.754 0.739 1.000 0.736 1.000 31 32 GAUTHIER ET AL. (Gauthier et al. 2009). Keeping the specific codification of each trajectory distinct allows for better interpretation of MCSA’s typologies, since nested trajectories are represented as parallel processes associated with substitution matrices that are in themselves informative. Applied to social science data from the Swiss household panel, the illustrative application of MCSA shows that it produces more convincing results than does independent OMA. Moreover, it provides semantically and graphically straightforward patterns of the ways multidimensional social participations unfold over time, a feature that represents one of the central developing fields of sequence analysis. Second, our results on the same data show that MCSA performs best when the dimensions under study are interdependent. Comparing the analysis of correlated versus uncorrelated monochannels, we find that MCSA leads to a greater reduction of complexity when the trajectories are statistically associated and when the number of clusters is relatively small. This outcome provides a first indication toward MCSA’s range of applications: It is precisely when nested trajectories are interdependent—that is, when they share information—that MCSA is required. Additionally, the results show the ability of MCSA to simplify the outputs obtained from regular OMA. This simplification is achieved by dramatically reducing the number of categories involved, while retaining a high proportion of the original information. This means that considering interdependence increases the complexity of nested trajectories under study, but at the same time it reduces the number of relevant combinations, compared to cross-combining results from independent one-dimensional OMA. We also test the resistance of MCSA to noisy data. It appears that MCSA is less sensitive to increasing noise in the data than are combinations of regular one-dimensional OMA. MCSA uses the interdependence that exists between nested trajectories as an additional source of information to identify relevant multidimensional patterns, even when some of the data are missing. This paper also examines the issue of sequence distortion produced by the use of insertions and deletions that change sequence length. Carrying out an alignment modifies the correspondence between actual age and the position in the sequence prior to alignment. In measuring the distortion resulting from MCSA and conventional OMA, we find that MCSA was nearly always superior or equivalent to conventional OMA in minimizing this distortion; that is, MCSA performed better MULTICHANNEL SEQUENCE ANALYSIS 33 by keeping the length of aligned sequences as close as possible to that of the original sequences. We demonstrated the ability of MCSA to produce less distorted alignments and take the timing of episodes more accurately into account than combinations of conventional OMA. This feature is particularly important when considering not only relative duration but also dimensions such as social age, which take into account the fact that some social statuses or transitions are more common at certain points in life than at others. In other words, MCSA produces alignments that optimize the relationship between age and social statuses over time. Finally, using random data, we demonstrate that distances produced by MCSA differ from those produced by either summing pairwise distances in independent OMA or by recoding the data prior to analysis. Furthermore, MCSA yields the strongest correlations with the results of alternative measures of multidimensional distances. It therefore appears to be the most representative technique among those that we examined. Given the number of dimensions that may play a role in the variability of results obtained through either method, this paper provides initial guidance on the potential advantages of MCSA. Further developments of the method, along with in-depth testing, are needed to continue improving MCSA. Our main expectation regarding MCSA is to significantly reduce the “signal differences” between channels when channels are related. It is precisely the correlation between channels that allows the alignment procedure to benefit from the information contained in one sequence but not another, and ultimately to produce multichannel alignments that reduce the complexity, distortion, and loss of signal due to such noise or to missing values. This informational asymmetry between channels may vary over time (i.e., it is positionspecific), and it may depend on specific stages of the life course or on specific features of social age. In some stages, for instance, occupational status is poorly or not at all informative (e.g., during school years). In such cases, information from the other channel(s) should be given preference. In other words, if one channel is more informative than another at a given point in the sequence, we should rely more heavily on the more informative channel to compute the multichannel alignment. In the same way, if there are missing values on one channel, we should “let the other channel talk” by giving it more weight. Future developments should implement heuristic procedures to systematize methods for dealing with such information asymmetries between channels. In 34 GAUTHIER ET AL. this paper, we have set the combination of substitution costs at one point to the average value of the two substitution costs involved at that point. An alternative would be to follow some weighting scheme based in theory (e.g., costs set to the highest or the lowest value), or to rely on empirically determined costs of substitution. Overall, MCSA presents two main advantages over onedimensional OMA. It allows both for the discovery of regularities within multidimensional trajectories and for the reduction of the effects of noise, whether due to missing data, poorly recorded information, or heterogeneous information content. APPENDIX The computations presented in this paper are encapsulated in the program SALTT (Search Algorithm for Life Trajectories and Transitions), an open source freeware written in C (Notredame, Bucher, Gauthier, and Widmer 2005). It can be compiled and installed on any UNIX-like platform including Linux, Cygwin, and MacOSX. The package and its documentation can be downloaded from: http://www.tcoffee.org/saltt/. REFERENCES Abbott, A. 1992. “From Causes to Events. Notes on Narrative Positivism.” Sociological Methods and Research 20 (4):428–55. ———. 2001. Time Matters: On Theory and Method. Chicago, IL: University of Chicago Press. Abbott, A., and A. Hrycak. 1990. “Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.” American Journal of Sociology 96 (1):144–85. Abbott, A., and A. Tsay. 2000. “Sequence Analysis and Optimal Matching Methods in Sociology.” Sociological Methods and Research 29 (1):3–33. Aisenbrey, S., and A. E. Fasang. 2010. “New Life for Old Ideas: The ‘Second Wave’ of Sequence Analysis Bringing the ‘Course’ Back Into the Life Course.” Sociological Methods and Research, 38 (3):420–62. Blair-Loy, M. 1999. “Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.” American Journal of Sociology 104 (5):1346–97. Blossfeld, H.-P., and G. Rohwer. 1995. Techniques of Event History Modeling. Mahwah, NJ: Lawrence Erlbaum. Butts, C., and J. Pixley. 2004. “A Structural Approach to the Representation of Life History Data.” Journal of Mathematical Sociology 28 (2):81–124. MULTICHANNEL SEQUENCE ANALYSIS 35 Clausen, J. A. 1986. The Life Course: A Sociological Perspective. Toronto: PrenticeHall. Claverie, J.-M., and C. Notredame. 2003. Bioinformatics for Dummies. New York: Wiley. Confais, J., Y. Grelet, and M. Le Guen. 2005. “La Proc´ dure FREQ de SAS. Tests e d’independance et mesures d’association dans un tableau de contingence.” La Revue Modulad 33:188–224. Dijkstra, W., and T. Taris. 1995. “Measuring the Agreement Between Sequences.” Sociological Methods and Research 24 (2):214–31. Doolitle, R. F. 1981. “Similar Amino Acid Sequences: Chance or Common Ancestry.” Science 214 (4517):149–59. Duda, R. O., and P. E. Hart. 1973. Pattern Classification and Scene Analysis. New York: Wiley. Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 2002. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, England: Cambridge University Press. Elder, G. H., ed. 1985. Life Course Dynamics: Trajectories and Transitions, 1968– 1980. Ithaca, NY: Cornell University Press. Elder, G. H., M. Kirkpatrick Johnson, and R. Crosnoe. 2003. “The Emergence and Development of Life Course Theory.” Pp. 3–19 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer. Elzinga, C. H. 2003. “Sequence Similarity: A Non-Aligning Technique.” Sociological Methods and Research 31(4):3–29. Esser, H. 1996. “What Is Wrong with ‘Variable Sociology’?” European Sociological Review 12 (2):159–66. ¨ Gabadinho, A., G. Ritschard, M. Studer, and N. S. Muller. 2008. Mining Sequence Data in R with the TraMineR Package: A User’s Guide. University of Geneva. Retrieved January 21, 2010. (http://mephisto.unige.ch/traminer). Gauthier J.-A., E. D. Widmer, P. Bucher, and C. Notredame 2009. “How Much Does It Cost? Optimization of Costs in Sequence Analysis of Social Science Data.” Sociological Methods and Research 38 (1):197–231. George, L. K. 1993. “Sociological Perspectives on Life Transitions.” Annual Review of Sociology 19:353–73. Giele, J. Z., and G. H. Elder Jr., eds. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Goodman, L. A., and W. H. Kruskal. 1979. Measures of Association for Cross Classification. New York: Springer-Verlag. Groves, R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. 2004. Survey Methodology. Wiley Series in Survey Methodology. New York: Wiley. Han, S.-K., and P. Moen. 1999. “Clocking Out: Temporal Patterning of Retirement.” American Journal of Sociology 105 (1):191–236. Heinz, W. R. 2003. “From Work Trajectories to Negotiated Careers.” Pp. 185–204 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer. Heinz, W. R., and V. W. Marshall, eds. 2003. Social Dynamics of the Life Course: Transitions, Insitutions, and Interrelations. New York: Aldine de Gruyter. 36 GAUTHIER ET AL. ¨ Hopflinger, F. C., and A. Debrunner. 1991. Familenleben und Berufsarbeit. Zurich, Switzerland: Seismo. Kohli, M. 1986. “The World We Forgot: A Historical Review of the Life Course.” Pp. 271–303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. Kruskal, J. 1983. “An Overview of Sequence Comparison.” Pp. 1–44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Don Mills, Ontario: Addison-Wesley. Lesesne, C. A., and C. Kennedy. 2005. “Starting Early: Promoting the Mental Health of Women and Girls Throughout the Life Span.” Journal of Women’s Health 14 (9):754–63. Lesnard, L., and T. de Saint Pol. 2006. “Introduction aux m´ thodes d’appariement e optimal (Optimal Matching Analysis).” Bulletin of Sociological Methodology 90:5–25. Levine, J. H. 2000. “But What Have You Done for Us Lately?” Sociological Methods and Research 29 (1):34–40. Levy, R. 1991. “Status Passages as Critical Life Course Transition: A Theoretical Sketch.” Pp 87–114 in Theoretical Advances on Life Course Research, edited by W. R. Heinz. Weinheim, Germany: Deutscher Studien Verlag. ———. 1996. “Toward a Theory of Life Course Institutionalization.” Pp. 83–108 in Society and Biography, edited by A. Weymann and W. R. Heinz. Weinheim, Germany: Deutscher Studien Verlag. Levy, R., J. A. Gauthier, E. D. Widmer. 2006. “Entre contraintes institutionnelle et domestique: les parcours de vie masculins et f´ minins en Suisse.” Canadian e Journal of Sociology 31 (4):461–89. Macmillan, R., ed. 2005. The Structure of the Life Course: Standardized? Individualized? Differentiated?, Vol. 9. Amsterdam: JAI Press. Macmillan, R., and S. R. Eliason. 2003. “Characterizing the Life Course as Role Configurations and Pathways. A Latent Structure Approach.” Pp. 529–54 in Handbook of the Life Course, edited by J. T. Mortimer and M. J. Shanahan. New York: Kluwer Academic. May, A. C. W. 2004. “Percent Sequence Identity: The Need to Be Explicit.” Structure 12:737–38. Milligan, G. W., and M. C. Cooper. 1985. “An Examination of Procedures for Determining the Number of Clusters in a Data set.” Psychometrika 50 (2):159– 79. ———. 1987. “Methodology Review: Clustering Methods.” Applied Psychological Measurement 11 (4):329–54. Moen, P. 1985. “Continuities and Discontinuities in Women’s Labor Force Activity.” Pp. 113–55 in Life Course Dynamics: Trajectories and Transitions, 1968– 1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Mojena, R. 1977. “Hierarchical Grouping Methods and Stopping Rules: An Evaluation.” The Computer Journal 20:359–63. Mortimer, J. T., and M. J. Shanahan, eds. 2003. Handbook of the Life Course. New York: Kluwer Academic. MULTICHANNEL SEQUENCE ANALYSIS 37 ¨ Muller, N. S., A. Gabadinho, G. Ritschard, and M. Studer. 2008. “Extracting Knowledge from Life Courses: Clustering and Visualization.” Pp. 176–85 in Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery. Turin, Italy: Springer-Verlag. Nargundkar, S., and T. J. Olzer. 1998. “An Application of Cluster Analysis in the Financial Services Industry.” Presented at the Sixth annual meeting of the South East SAS Users Group (SESUG), Norfolk, Virginia. National Centre for Biotechnology Information (NCBI) 2004. Glossary. Retrived October 15, 2004 (http://www.ncbi.nlm.nih.gov/ Education/BLASTinfo/glossary2.html). Needleman, S. B., and C. D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48:443–53. Notredame, C., P. Bucher, J.-A. Gauthier, and E. Widmer. 2005. TCoffee/SALTT: User Guide and Reference Manual. Retrieved October 15, 2005. (http://www.tcoffee.org/saltt) Olzak, M., and G. Ritschard. 1995. “The Behaviour of Nominal and Ordinal Partial Association Measures.” The Statistician 44 (2):195–212. Piccarreta, R., and O. Lior. 2010. “Exploring Sequences: A Graphical Tool Based on Multi-Dimensional Scaling.” Journal of the Royal Statistical Society, Series A: Statistics in Society 173 (1):165–84. Pollock, G. 2007. “Holistic Trajectories: A Study of Combined Employment, Housing, and Family Careers Using Multiple Sequence Analysis.” Journal of the Royal Statistical Society, Series A: Statistics in Society 170:167–83. Raghava, G. P. S., and G. Barton. 2006. “Quantification of the Variation in Percentage Identity for Protein Sequence Alignments.” BMC Bioinformatics 7 (1):415. Repetti, R. L., S. E. Taylor, and T. E. Seeman. 2002. “Risky Families: Family Social Environments and the Mental and Physical Health of Offspring.” Psychological Bulletin 128(2):330–66. Sankoff, D., and J. Kruskal. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Don Mills, Ontario: AddisonWesley. R SAS Institute. 2004. SAS/STAT 9.1 User’s Guide. Cary, NC: SAS Institute Inc. Sheridan, J. T. 1997. The Effects of the Determinants of Women’s Movement into and out of Male-Dominated Occupations on Occupational Sex Segregation. Madison: Department of Sociology, Center for Demography and Ecology, University of Wisconsin. Siegel, S., and N. J. Castellan. 1988. Nonparametric Statistics for the Behavioural Sciences, 2nd ed. New York: McGraw-Hill. Spruijt, E., and M. de Goede. 1997. “Transitions in Family Structure and Adolescent Well-Being.” Adolescence 32(128):897–911. Stovel, K., M. Savage, and P. Bearman. 1996. “Ascription into Achievement: Models of Career Systems at Lloyds Bank, 1890–1970.” American Journal of Sociology 102(2):358–99. 38 GAUTHIER ET AL. Thompson, J., D. G. Higgins, and T. Gibson. 1994. “CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.” Nucleic Acids Research 22:4673–80. Tillmann, R., and E. Zimermann. 2004. “Introduction: The Swiss Household Panel and the Nature of This Book.” Pp. 1–25 in Vivre en Suisse 1999–2000 [Living in Switzerland 1999–2000], edited by R. Tillmann and E. Zimermann. Bern, Switzerland: Peter Lang. Tufte, E. R. 1997. Visual Explanation, Images and Quantities, Evidence and Narrative. Cheshire, CO: Graphic Press. Wall, L., T. Christiansen, and J. Orwant. 2000. Programming Perl, 3rd ed. Sebastopol, CA: O’Reilly. Ward, J. H. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58(301):236–44. Wasserman, S., and K. Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge, England: Cambridge University Press. Wetzler, H. P., and R. J. Ursano. 1988. “A Positive Association Between Physical Health Practices and Psychological Well-Being.” The Journal of Nervous and Mental Disease 176 (5):280–83. Widmer, E., R. Levy, A. Pollien, R. Hammer, and J.-A. Gauthier. 2003. “Une analyse exploratoire des insertions professionnelles et familiales: Trajectoires de couples r´ sidant en Suisse.” Revue suisse de Sociologie 29(1):35–67. e Wu, L. L. 2000. “Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect’.” Sociological Methods and Research 29(1):41–64. BIOINFORMATICS Vol. 19 no. 1 2003, pages i1–i7 DOI: 10.1093/bioinformatics/btg1029 APDB: a novel measure for benchmarking sequence alignment methods without reference alignments Orla O’Sullivan 1, Mark Zehnder 3, Des Higgins 1, Philipp Bucher 3, ´ Aurelien Grosdidier 3 and Cedric Notredame 2, 3,∗ 1 Department of Biochemistry, University College, Cork, Ireland, 2 Information ´ ´ Genetique et Structurale, CNRS UMR-1889, 31, Chemin Joseph Aiguier, 13402 Marseille, France and 3 Swiss Institute of Bioinformatics, Chemin des Boveresse, 155, 1066 Epalinges, Switzerland Received on January 6, 2000; revised on Month xx, 2000; accepted on February 20, 2000 Author please check use of A and B heads is correct ABSTRACT Motivation: We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efficiently and objectively benchmark multiple sequence alignment methods. Results: Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages. Availability: APDB is implemented in C, its source code and its documentation are available for free on request from the authors. Contact: cedric.notredame@gmail.com INTRODUCTION We introduce APDB (Analyze alignments with PDB), a new method for benchmarking and improving multiple sequence alignment packages with minimal human intervention. We show how it is possible to avoid the use of reference alignments when PDB structures are available for at least two homologous sequences in a test alignment. Using this method it should become possible to systematically benchmark or train multiple sequence alignment methods using all known structures, in a completely automatic manner. There are strong justifications for improving multiple sequence alignment methods. Many sequence analysis ∗ To whom correspondence should be addressed. techniques used in bioinformatics require the assembly of a multiple sequence alignment at some point. These include phylogenetic tree reconstruction, detection of remote homologues through the use of profiles or HMMs, secondary and tertiary structure prediction and more recently the identification of the nsSNPs (non synonymous Single Nucleotide Polymorphisms) that are most likely to alter a protein function. All of these important applications demonstrate the need to improve existing multiple sequence alignment methods and to establish their true limits and potential. Doing so is complicated, however, because most multiple sequence alignment methods rely on a complicated combination of greedy heuristic algorithms meant to optimize an objective function. This objective function is an attempt to quantify the biological quality of an alignment. Almost every multiple alignment package uses a different empirical objective function of unknown biological relevance. In practice, most of these algorithms are known to perform well on some protein families and less well on others, but it is difficult to predict this in advance. It can also be very hard to establish the biological relevance of a multiple alignment of poorly characterized protein families. See Duret and Abdeddaim (2000) and Notredame (2002) for two recent reviews of the wide variety of techniques that have been used to make multiple alignments. Given such a wide variety of methods and such poor theoretical justification for most of them, the main option for a rational comparison is systematic benchmarking. This is usually accomplished by comparing the alignments produced by various methods with ‘reference’ alignments of the same sequences assembled by specialists with the help of structural information. Barton and Sternberg (1987) made an early systematic attempt to validate a multiple sequence alignment method using structure based alignments of globins and immunoglobulins. Later on, 1 Bioinformatics 19(1) c Oxford University Press 2003; all rights reserved. O.O’Sullivan et al. Notredame and Higgins (1996) used another collection of such alignments assembled by Pascarella and Argos (1992). More recently, it has become common practice to use BAliBASE (Thompson et al., 1999); a collection of multiple sequence alignments assembled by specialists and designed to systematically address the different types of problems that alignment programs encounter, such as the alignment of a distant homologue or long insertions and deletions. In this work, we examined two such reference collections: BaliBase and Homstrad (Mizuguchi et al., 1998), a collection of high quality multiple structural alignments. There are two simple ways to use a reference alignment for the purpose of benchmarking Karplus and Hu (2001). One may count the number of pairs of aligned residues in the target alignment that also occur in the reference alignment and divide this number by the total number of pairs of residues in the reference. This is the Sum of Pairs Score (SPS). The main drawback is that it is not very discriminating and tends to even out differences between methods. The more popular alternative is the Column Score (CS) where one measures the percentage of columns in the target alignment that also occur in the reference alignment. This is widely used and is considered to be a stringent measure of alignment performance. In order to avoid the problem of unalignable sections of protein sequences (i.e. segments that cannot be superimposed), it is common practice to annotate the most reliable regions of a multiple structural alignment and to only consider these core regions for the evaluation. In BaliBase, the core regions make up slightly less than 50% of the total number of alignment columns. Such use of multiple sequence alignment collections for benchmarking is very convenient because of its simplicity. However, a major problem is the heavy reliance on the correctness of the reference alignment. This is serious because, by nature, these reference alignments are at least partially arbitrary. Although structural information can be handled more objectively than sequence information, the assembly of a multiple structural alignment remains a very complex problem for which no exact solution is known. As a consequence, any reference multiple alignment based on structure will necessarily reflect some bias from the methods and the specialist who made the assembly. The second drawback is that given a set of structures there can be more than one correct alignment. This plurality results from the fact that a structural superposition does not necessarily translate unambiguously into one sequence alignment. For instance, if we consider that the residues to be aligned correspond to the residues whose alpha carbons are the closest in the 3-D superposition, it is easy to imagine that sometimes an alpha carbon can be equally close to the alpha carbons of two potential homologous residues. Most structure based sequence i2 alignment procedures break this tie in an arbitrary fashion, leading to a reference alignment that represents only one possible arrangement of aligned residues. This problem becomes most serious when the sequences one is considering are distantly related (less than 30% identity). Unfortunately, this is also the most interesting level of similarity where most sequence alignment methods make errors and where it is important to accurately benchmark existing algorithms. The APDB method that we describe in this work has been designed to specifically address this problem and remove, almost entirely, the need for arbitrary decisions when using structures to evaluate the quality of a multiple sequence alignment. In APDB, a target alignment is not evaluated against a reference alignment. Rather, we measure the quality of the structural superposition induced by the target alignment given any structures available for the sequences it contains. By treating the alignment as the result of some sort of structure superposition, we simply measure the fraction of aligned residues whose structural neighborhoods are similar. This makes it possible to avoid the most expensive and controversial element of the MSA benchmarking methods: the reference multiple sequence alignment. APDB requires just three parameters. This is tiny if we compare it with any reference alignment where each pair of aligned residue can arguably be considered as a free parameter. In this work we show how the APDB measure was designed and characterized on a few carefully selected pairs of structures. Among other things we explored its sensitivity to parameter settings and various sequence and structure properties, such as similarity, length, or alignment quality. Finally, APDB was used to benchmark known methods using two popular data sets: BaliBase and Homstrad. These were either used as standard reference alignments or as collections of structures suitable for APDB. It should be noted that there are several methods for evaluating the quality of structure models and predictions using known structures. The development of these has been driven by the need to evaluate entries in the CASP protein structure prediction competition and have been reviewed by Cristobal et al. (2001). These all depend on generating structure superpositions between the model and the target and evaluating the quality of the match using, for example, RMSD between the two or using some measure of the number of alpha carbons that superimpose well (e.g. MaxSub by Siew et al. (2000)). In principle, this could also be used to benchmark alignment methods. However, one serious disadvantage is the requirement for a superposition, which is itself a difficult problem. A second disadvantage is the way RMSD measures behave with different degrees of sequence divergence and their sensitivity to local or global alignment differences. We APDB have carefully designed APDB so that on the one hand it remains very simple but on the other hand it is able to measure the similarity of the structural environments in a manner that lends itself to measuring alignment quality. correct X :Y (Z : W) is equal to 1 if d(X, Z ) and d(Y,W) are sufficiently similar as set by T 1. aligned(X : Y) is equal to 1 if most pairs Z : W in the X : Y bubble are correct as set by T 2. considered X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad correct X :Y (Z : W ) = 1 if d(X, Z ) < Brad and d(Y, W ) < Brad and |d(X, Z ) − d(Y, W )| < T 1 (1) (2) Author please resupply fig as not reproducing correctly, thanks SYSTEM AND METHODS The APDB scoring function APDB is a measure designed to evaluate how consistent an alignment is with the structure superposition this alignment implies. Let us imagine that A and B are two homologous structures. If the structure of sequence A tells ˚ us that the residues X and Z are 9A apart, then we expect to find a similar distance between the two residues Y and W of sequence B that are aligned with X and Z. The difference between these two distances is an indicator of the alignment quality. ________9Å_____ _ _ A aaaaaaaaaaaXaaaaaaaaaaaaaaaZaaaaaaa B bbbbbbbbbbbYbbbbbbbbbbbbbbbWbbbbbbb _________________ 9 Å? aligned(X : Y) = 1 Z :W Corr ect X :Y (Z : W ) × 100 > T 2 (3) if Z :W Consider ed X :Y (Z : W ) Finally, the APDB measure for the entire alignment is defined as: APDB Score = X :Y Aligned(X : Y ) N (4) In APDB we take this idea further by measuring the differences of distances between X:Y (X aligned with Y) and Z:W within a bubble of fixed radius centered around X and Y. The bubble makes APDB a local measure, less sensitive than a classic RMSD measure to the existence of non-superposable parts in the structures being considered. Furthermore it ensures that a bad portion of the alignment does not dramatically affect the overall alignment evaluation. The typical radius of this bubble is ˚ 10A, and it contains 20 to 40 amino acids. We consider two residues to be properly aligned if the distances from these two residues to the majority of their neighbors within the bubble are consistent between the two structures. In other words, we check whether a structural neighborhood is supportive of the alignment of the two residues that sit at its center. This can be formalized as follows: X : Y and is a pair of aligned residues in the alignment N Number of aligned pairs of residues d(X, Z) is the distance between the Cα of the two residues X and Z within one structure. Brad is the radius of the bubble set around residues X ˚ and Y (Brad = 10 A by default). T1 is the maximum difference of distance between ˚ d(X, Z ) and d(Y, W ) (T 1 = 1 A by default). T2 is the minimal percentage of residues that must respect the criterion set by T 1 for X and Y to be considered correctly aligned (70% by default). considered X :Y (Z : W) is equal to 1 if the pair Z : W is in the bubble defined by pair X : Y Given a multiple alignment of sequences with known structures, the APDB score can easily be turned into a sum of pairs score by summing the APDB score of each pair of structures and dividing it by the total number of sequence pairs considered. Design of a benchmark system for apdb In order to study the behavior of APDB, we used two established collections of reference alignments: BAliBASE (Thompson et al., 1999) and HOMSTRAD (Mizuguchi et al., 1998). First we extracted 9 structure based pair-wise sequence alignments from HOMSTRAD, which we refer to as HOM 9. These reference alignments were chosen so that their sequence identities (as measured on the HOMSTRAD reference alignments) evenly cover the range 17 to 90%. These alignments are between 200 and 300 residues long and are used for detailed analysis and parameterization of APDB. The PDB names of the pairs of structures are given in the figure legend for Figure 2. Next, in order to assemble a discriminating test set, we selected the most difficult alignments from HOMSTRAD. We chose alignments which had at least 4 sequences and where the average percent identity was 25% or less. This resulted in a selection of 43 alignments, which we refer to as HOM 43. BAliBASE version 1 has 141 alignments divided into 5 reference groups. We chose all alignments where 2 or more of the sequences had a known structure. This resulted in a subset of 91 alignments from the first 4 reference groups of BAliBASE which we refer to as BALI 91. Minor adjustments had to be made to ensure consistency between BAliBASE sequences and the corresponding PDB files. HOM 43 and BALI 91 test sets are available in the APDB distribution. i3 O.O’Sullivan et al. A second method for generating sub-optimal alignments was based on the PROSUP package (Lackner et al., 2000). PROSUP takes two structures, makes a rigid body superposition and generates all the sequence alignments that are consistent with this superposition, thus producing alternative alignments that are equivalent from a structural point of view. Typically PROSUP yields 5 to 25 alternative alignments within a very narrow range of RMSDs. Comparison of apdb with other standard measures In order to compare the APDB measure with more conventional measures, we used the Column Score (CS) measure as provided by the aln compare package (Notredame et al., 2000). CS measures the percentage of columns in a test alignment that also occur in the reference alignment. In BAliBASE this measure is restricted to those columns annotated as core region in the reference. Although alternative measures have recently been introduced (Karplus and Hu, 2001), CS has the advantage of being one of the most widely used and the simplest method available today. Fig. 1. Tuning of Brad, the bubble radius using sub-optimal alignments of two sequences from HOM 9 Each graph represents the correlation between CS and APDB for 4 different Bubble Radius ˚ values (Brad of 6, 8, 10 and 12A). In each graph, each dot represents a sub-optimal alignment from HOM 9, sampled from the genetic algorithm. Generation of multiple alignments We compared the performance of APDB on two different multiple alignment methods. We tested the widely used ClustalW version 1.80 (Thompson et al., 1994). We also tested the more recent T-Coffee version 1.37 (Notredame et al., 2000) using default parameters. Generation of suboptimal alignments In order to evaluate the sensitivity of APDB to the quality of an alignment, we used an improved version of the genetic algorithm SAGA (Notredame and Higgins, 1996) in order to generate populations of sub-optimal alignments. In each case a pair of sequences was chosen in HOM 9 and 50 random alignments were generated and allowed to evolve within SAGA so that their quality gradually improved (as measured by their similarity with the HOMSTRAD reference alignment). Ten alignments were sampled at each generation in order to build a collection of alternative alignments with varying degrees of quality. This algorithm was stopped when optimality was reached, thus typically yielding collections of a few hundred alignments. i4 RESULTS AND DISCUSSION Fine tuning of apdb Three parameters control the behaviour of APDB: Brad (the bubble radius), T1 (the difference of distance threshold) and T2 (the fraction of the bubble neighbourhood that must support the alignment of two residues). We exhaustively studied the tuning effect of each of these parameters using HOM 9 and parameterised APDB so that its behaviour is as consistent as possible with the behaviour of CS on HOM 9. In Figure 1 we show the relationship between CS and APDB for 250 sub-optimal alignments generated by genetic algorithm for one of the 9 test cases from HOM 9 over 4 different settings of Brad, the Bubble Radius. While the two scoring schemes are in broad agreement, the correlation improves dramatically as Brad increases. This trend can be summarised using the correlation coefficient measured on each of the graphs similar to those shown in Figure 1. The overall results for all nine HOM 9 test cases are shown in Figure 2. These results clearly show that the ˚ behaviour of APDB is best for values of Brad of 10 A or above. With these values the level of correlation between CS and APDB increases and so does the agreement across ˚ all 9 test cases. We chose 10 A as the default value in order to ensure a proper behaviour while retaining as much as possible the local character of the measure. Given the ˚ default value of 10 A for Brad, we examined T1 and T2 in a similar fashion and found the most appropriate values as ˚ 1 A for T1 and 70% for T2. APDB Fig. 2. Correlation between the Column Score measure (CS) and APDB on HOM 9 Each HOM 9 test set is labelled according to its average percent sequence identity as measured on the HOMSTRAD reference. The horizontal axis indicates the value of Brad. The vertical axis indicates the correlation coefficient between CS and APDB as measured on a population of sub-optimal alignments similar to the ones in Figure 1. Each dot indicates a correlation coefficient measured on one HOM 9 test set, using the indicated value of Brad. Each HOM 9 test set is an alignment between two sequences whose PDB names are as follows: 17: 2gar versus 1fmt, 18: ljfl versus lb74, 33: 1isi versus 11be, 43: 2cev versus 1d3v, 52: 1aq0 versus 1ghs, 63: 2gnk versus 2pii, 71: 1hcz versus 1cfm, 82: 1dvg versus 1qq8, 89: 1k25 versus 1qme. Sensitivity of apdb to sequence and structure similarity It is important to verify that the behaviour of APDB remains consistent across a wide range of sequence similarity levels. It is especially important to make sure that when two different alignments of the same sequences are evaluated, the best one (as judged by comparison with the HOMSTRAD reference) always gets the best APDB score. In order to check for this, we used the genetic algorithm to generate sub-optimal alignments for each test case in HOM 9. In each case, we gathered a collection of 250 sub-optimal alignments with CS scores of 0–40%, 41–60%, 61–80% and 81–100%. The CS score measures the agreement between an alignments and its reference in HOMSTRAD. We then measured the average APDB score in each of these collections. Each of these measures corresponds to a dot in Figure 3 where vertically aligned series of dots correspond to different measures made on the same HOM 9 test set. Figure 3 clearly shows that regardless of the percent identity within the HOM 9 test set being considered, alignments with better CS scores always correspond to a better APDB score (this results in the lines never crossing one another on Fig. 3). We did a similar analysis using the RMSD as measured on the HOMSTRAD alignment in place of sequence identity. The behaviour was the same and clearly indicates that APDB gives consistent results regardless of the structural similarity between the structures being considered. Suitability of apdb for analysing sub-optimal alignments Collections of sub-optimal alignments for each of the nine HOM 9 test sets were generated using SAGA and evaluated for their CS scores and APDB scores. These results were pooled and are displayed on the graph shown on Figure 4. This Figure indicates good agreement between the CS and the APDB score regardless of the level of optimality within the alignment being considered. It suggests that APDB is informative over the complete range of CS values. It also confirms that APDB is not ‘too generous’ with sub-optimal alignments We also checked whether sequence alignments that are structurally equivalent obtain similar APDB scores even if they are different at the sequence level. For this purpose, we used PROSUP (Lackner et al., 2000). Given a pair of structures, PROSUP generates several alignments that are equally good from a structure point of view (similar RMSD), but can be very different at the sequence level (different Column Score). We manually identified two such test sets in HOMSTRAD and the results are summarized in Table 1. For each of these two test sets, we i5 O.O’Sullivan et al. Fig. 3. Estimation of the sensitivity of APDB to sequence identity On this graph, each set of vertically aligned dots corresponds to a single HOM 9 test set. The 9 HOM 9 test sets are arranged according to their average identity (17–89%, see Figure 2). Each dot represents the average APDB score of a population of 250 sub-optimal alignments (generated by genetic algorithm) with a similar CS score (binned in four groups representing CS of <40%, 41–60%, 61–80% and 81–100%) generated for one of the 9 HOM 9 test sets. Table 1. Evaluating PROSUP suboptimal alignments with APDB Set 1 2 St1 1e96B 1e96B 1cd8 1cd8 St2 1a17 1a17 1qfpa 1qfpa ALN aln1 aln2 aln1 aln2 RMSD ˚ 1.45A 1.50˚ A ˚ 2.95A 2.95˚ A CS 100.0 65.6 100.0 55.1 APDB 80.2 80.7 18.7 17.9 Set indicates the test set index, St1 and St2 indicate the two structures being aligned by PROSUP, ALN indicates the alignment being considered, RMSD shows the RMSD associated with this alignment, CS indicates its CS score, with the CS score of aln1 alignments being set to 100 because they are used as references. APDB indicates the APDB score. Fig. 4. Correlation between CS and APDB on the complete HOM 9 test set Each dot corresponds to a sub-optimal alignment of one of the HOM 9 test cases, generated by genetic algorithm. For each alignment the graph plots the APDB score against its CS counterpart. In both test sets, using aln1 as a reference for the CS measure leads to the conclusion that aln2 is mostly incorrect (cf. CS column of Table 1). This is not true since these alignments are structurally equivalent as indicated by their RMSDs. In such a situation, APDB behaves much more appropriately and gives to each couple aln1/aln2 scores that are nicely consistent with their RMSD, thus indicating that APDB can equally well reward two suboptimal alignments when these are equivalent from a structural point of view. selected in the output of PROSUP two alignments (aln1 and aln2) to which PROSUP assigns similar RMSDs. aln1 is used as a reference and therefore gets a CS score of 100 while the CS score of the second alignment (aln2) is computed by direct comparison with its aln1 counterpart. i6 Using apdb to benchmark alignment methods Table 2 shows the average CS and APDB scores for the test sets in each of the four Bali 91 categories being considered here and in HOM 43. The highest scores in all cases, for both measures, come from the reference column (the last column). This is desirable providing the reference alignments really are consistent with the APDB Table 2. Correlation between APDB and CS on BaliBase and Homstrad Set N CS ClustalW APDB 59.9 26.6 38.5 59.5 60.2 CS T-Coffee APDB 58.3 47.1 46.9 64.5 61.6 Reference CS APDB 100 100 100 100 100 64.7 55.2 53.2 65.7 72.9 B91 R1 B91 R2 B91 R3 B91 R4 H43 35 23 22 11 43 70.1 32.7 46.4 52.0 35.4 67.7 33.9 48.6 52.5 38.9 Test Set: indicates the test set being considered, either one of the BaliBase 91 references (B91R#) or HOM 43(H43), a subset of HOMSTRAD. N indicates the number of test alignments in this category. ClustalW indicates a set of measures made on alignments generated with ClustalW. T-Coffee indicates similar measures made on T-Coffee generated alignments. Reference indicates measures made on the reference alignments as provided in BaliBase or in Homstrad. CS columns are the Column Score measures while APDB indicates similar measures made using APDB. local evaluation and the absence of a reference alignment, the only possible effect of non-superposable regions is to decrease the proportion of residues found aligned in a structurally optimal sequence alignment, thus yielding scores lower than 100 in the case of distantly related structures. A key advantage of APDB is its simplicity. It only requires three parameters and a few PDB files. Most importantly, APDB does not require any arbitrary manual intervention such as the assembly of a reference alignment. In the short term, all the existing collections of reference alignment could easily be integrated and extended with APDB. In the longer term, APDB could also be used to evaluate and compare existing collections of alignments such as profiles, when structures are available. REFERENCES Barton,G.J. and Sternberg,M.J.E. (1987) A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Cristobal,S., Zemla,A., Fischer,D., Rychlewski,L. and Elofsson,A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5. Duret,L. and Abdeddaim,S. (2000) Multiple alignment for structural, functional, or phylogenetic analyses of homologous sequences. Bioinformatics, Sequence, Structure and Databanks. Higgins,D. and Taylor,W. (eds), Oxford University Press, Oxford. Karplus,K. and Hu,B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720. Lackner,P., Koppensteiner,W.A., Sippl,M.J. and Domingues,F.S. (2000) ProSup: a refined tool for protein structure alignment. Protein Eng., 13, 745–752. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Notredame,C. (2002) Recent progress in multiple sequence alignments. Pharmacogenomics, 3, 131–144. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel algorithm for multiple sequence alignment. J. Mol. Biol., 302, 205–217. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Siew,N., Elofsson,A., Rychlewski,L. and Fischer,D. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776–785. Thompson,J., Higgins,D. and Gibson,T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J., Plewniak,F. and Poch,O. (1999) BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15(1), 87–88. underlying structures. If we now compare the columns two by two, we find that every variation on CS from one column to another agrees with the corresponding variation of APDB. For instance in row 1 (Bali 91 Ref1), when T-Coffee/CS is lower than ClustalW/CS, T-Coffee/APDB is also lower. This observation is true for the whole table, regardless of the pair of results being considered. When considering the 134 alignments one by one, this observation remains true in more than 70 % of the cases. CONCLUSION This work introduces APDB, a novel method that makes it possible to evaluate the quality of a sequence alignment when two or more tertiary structures of the sequences it contains are available. This method does not require a reference alignment and it does not depend on any complex procedure such as structure superposition or sequence alignment. We show here that APDB sensitivity is comparable with that of CS, a well-established measure that compares a target alignment with a reference alignment. Our results also indicate that APDB can discriminate better than CS between structurally correct sub-optimal sequence alignments and structurally incorrect sequence alignments, even when the structures being considered are distantly related. Apart from the cost associated with their assembly, a serious problem with reference alignments is that they need to be annotated to remove from the evaluation regions that correspond to non-superposable portions of the structures. This is necessary because otherwise these regions (whose alignment cannot be trusted) will bias a CS evaluation toward rewarding the arbitrary alignment conformation displayed in the reference. Table 2 illustrates well the fact that such an annotation is not necessary in APDB. In our measure, thanks to the combination of i7 BIOINFORMATICS Vol. 14 no. 5 1998 Pages 407-422 COFFEE: an objective function for multiple sequence alignments Cédric Notredame 1, Liisa Holm 1 and Desmond G. Higgins 2 1EMBL OutstationćThe European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland Received on January 19, 1998; revised and accepted on February 24, 1998 Abstract Motivation: In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. Results: We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments. Availability: The package is available along with the test cases through the WWW: http://www.ebi.ac.uk/∼cedric Contact: cedric.notredame@ebi.ac.uk Introduction Multiple alignments are among the most important tools for analysing biological sequences. They can be useful for structure prediction, phylogenetic analysis, function prediction and polymerase chain reaction (PCR) primer design. Unfortunately, accurate multiple alignments may be difficult to build. There are two main reasons for this. First of all, it is difficult to evaluate the quality of a multiple alignment. Secondly, even when a function is available for the evaluation, E Oxford University Press it is algorithmically very hard to produce the alignment having the best possible score (optimal alignment). Cost functions or scoring functions roughly fall into two categories. First of all, there are those that rely on a substitution matrix. These are the most widely used. They require a substitution matrix (Dayhoff, 1978; Henikoff and Henikoff, 1992) that gives a score to each possible amino acid substitution, a set of gap penalties that gives a cost to deletions/insertions (Altschul, 1989), and a set of sequence weights (Altschul et al., 1989; Thompson et al., 1994b). Under this scheme, an optimal multiple alignment is defined as the one having the lowest cost for substitutions and insertion/deletions. One of the most widely used scoring methods of this type is the ‘weighted sums of pairs with affine (or semi affine) gap penalties’ (Altschul and Erickson, 1986). The main limitation of these scoring schemes is that they rely on very general substitution matrices, usually established by statistical analysis of a large number of alignments. These may not necessarily be adapted to the set of sequences one is interested in. To compensate for this drawback, a second type of scoring scheme was designed: profiles (Gribskov et al., 1987) and Hidden Markov Models (HMMs) (Krogh and Mitchison, 1995). Profiles allow the design of a sequencespecific scoring scheme that will take into account patterns of conservation and substitution characteristic of each position in the multiple alignment of a given family. To some extent, HMMs can be regarded as generalized profiles (Bucher and Hofmann, 1996). In HMMs, sequences are used to generate statistical models. The sequences of interest are then aligned to the model one after another to generate the multiple sequence alignment. The main drawback of HMMs is that to be general enough, the models require large numbers of sequences. However, this can be partially overcome by incorporating in the model some extra information such as Dirichlet mixtures (the equivalent of a substitution matrix in an HMM context) (Sjolander et al., 1996). Whatever scoring scheme one wishes to use, the optimization problem may be difficult. There are two types of optimization strategies: the greedy ones that rely on pairwise alignments and those that attempt to align all the sequences 407 C.Notredame, L.Holm and D.G.Higgins simultaneously. The main tool for making pairwise alignments is an algorithm known as dynamic programming (Needleman and Wunsch, 1970) and is often used for optimizing the sums of pairs. The complexity of the algorithm makes it hard to apply it to more than two sequences (or two alignments) at a time. Nevertheless, it allows greedy progressive alignments as described by Feng and Doolittle (1987) or Taylor (1988). In such a case, the sequences are aligned in an order imposed by some estimated phylogenetic tree. The alignment is called progressive because it starts by aligning together closely related sequences and continues by aligning these alignments two by two until the complete multiple alignment is built. Some of the most widely used multiple sequence alignment packages like ClustalW (Thompson et al., 1994a), Multal (Taylor, 1988) and Pileup (Higgins and Sharp, 1988) are based on this algorithm. They have the advantage of being fast and simple, as well as reasonably sensitive. Their main drawback is that mistakes made at the beginning of the procedure are never corrected and can lead to misalignments due to the greediness of the strategy. It is to avoid this pitfall that the second type of methods have been designed. They mostly involve aligning all the sequences simultaneously. For the sums of pairs, this is a difficult problem that has been shown to be NP-complete (Wang and Jiang, 1994). However, using the Carrillo and Lipman (1988) algorithm implemented in the Multiple Sequence Alignment program MSA (Lipman et al., 1989), one can simultaneously align up to 10 sequences. Other global alignment techniques using the sums of pairs cost function involve the use of stochastic heuristics such as simulated annealing (Ishikawa et al., 1993a; Godzik and Skolnik, 1994; Kim et al., 1994), genetic algorithms (Ishikawa et al., 1993b; Notredame and Higgins, 1996) or iterative methods (Gotoh, 1996). Simulated annealing can also be used to optimize HMMs (Eddy, 1995). The stochastic methods have two main advantages over the deterministic ones. First of all they have a lower complexity. This means that they do not have strong limitations on the number of sequences to align or on the length of these sequences. Secondly, these methods are more flexible regarding the objective function they can use. For instance, MSA is restricted to an approximation of the sums of pairs using semi-affine gap penalties (Lipman et al., 1989) instead of the natural ones shown to be biologically more realistic (Altschul, 1989). This is not the case with simulated annealing (Kim et al., 1994). The main drawback of stochastic methods is that they do not guarantee optimality. However, in some previous work, we showed that with the Sequence Alignment Genetic Algorithm (SAGA), results similar to MSA could be obtained (Notredame and Higgins, 1996). We also showed that the package was able to handle test cases with sizes much beyond the scope of MSA. The robustness of SAGA as an optimizer was confirmed by results obtained on a different objective function for RNA alignment (Notredame et al., 1997) and motivated our choice to use SAGA for optimizing the new objective function described here. The main argument for aligning all the sequences simultaneously instead of making a greedy progressive alignment is that using all the available information should improve the final result. However, one limitation of such methods is that regions of low similarity may induce some noise that will weaken the signal of the correct alignment (Morgenstern et al., 1996). In order to avoid this, one would like a scheme that filters some of the initial information and allows its global use. The approach we propose here is an attempt to do so. The underlying principle is to generate a set of pairwise alignments and look for consistency among these alignments. In this case, we define the optimal multiple alignment as the most consistent one and produce it using the SAGA package. The idea of using the consistency information in a multiple sequence alignment context is not new (Gotoh, 1990; Vingron and Argos, 1991; Kececioglu, 1993). In his scheme, Gotoh (1990) proposed the identification of regions that are fully consistent among all the pairwise alignments. These regions are used as anchor points in the multiple alignment, in order to decrease complexity. A similar strategy was described by Vingron and Argos (1991), allowing the computation of a multiple alignment from a set of dot matrices. Although very interesting, these methods had several pitfalls, including a sensitivity to noise (especially when some sequences are highly inconsistent with the rest of the set) and a high computational complexity. The work of Kececioglu (1993) bears a stronger similarity to the method we propose here. Kececioglu directly addressed the problem of finding a multiple alignment that has the highest level of similarity with a collection of pairwise alignments. Such an alignment is named ‘maximum weight trace alignment’ (MWT), and its computation was shown to be NP-complete. An optimization method was also described, based on dynamic programming and limited to a small number of sequences (six maximum). More recently, a method was described that allows the construction of a multiple alignment using consistent motifs identified over the whole set of sequences by a variation of the dynamic programming algorithm (Morgenstern et al., 1996). This algorithm should be less sensitive to noise than the one described by Vingron and Argos, but its main drawback is that it does rely on a greedy strategy for assembling the multiple alignment. An important aspect of multiple sequence alignment often overlooked is estimation of the reliability. Since all the alignment scoring functions available are known to be intrinsically inaccurate, identifying the biologically relevant portions of a multiple alignment may be more important than increasing the overall accuracy of this alignment. A few tech- 408 COFFEE: an objective function for multiple sequence alignments niques have been proposed to identify accurately aligned positions in pairwise (Vingron and Argos, 1990; Mevissen and Vingron, 1996) and multiple sequence alignments (Gotoh, 1990; Rost, 1997). We show here that our method allows a reasonable estimation of a multiple alignment local reliability. The measure we use for reliability is in fact very simple and could easily be extended much further to incorporate other methods such as the one described by Mevissen and Vingron (1996). Methods The overall approach relies on the definition of an objective function (OF) describing the quality of multiple protein sequence alignments. Given a set of sequences and an ‘allagainst-all’ collection of pairwise alignments of these sequences (library), the score of a multiple sequence alignment is defined as the measure of its consistency with the library. This objective function was optimized with the SAGA package. Sets of sequences with a known structure and for which a multiple structural alignment is available were extracted from the 3D_ali database (Pascarella and Argos, 1992) and used in order to validate the biological relevance of the new objective function. Two other test cases were designed using the DALI server (Holm and Sander, 1996a) and aligned using libraries made of structural pairwise alignments extracted from the FSSP database (Holm and Sander, 1993). Objective function The OF is a measure of quality for multiple sequence alignments. Ideally, the better its score, the more biologically relevant the multiple alignment. The method proposed here requires two components: (i) a set of pairwise reference alignments(library), (ii) the OF that evaluates the consistency between a multiple alignment and the pairwise alignments contained in the library. We named this objective function COFFEE (Consistency based Objective Function For alignmEnt Evaluation). Creation of the library A library is specific for a given set of sequences and is made of pairwise alignments. Taken together, these alignments should contain at least enough information to define a multiple alignment of the sequences in the set. In practice, given a set of N sequences, we included in the library a pairwise alignment for each of the (N2 – N)/2 possible pairs of sequences. This choice is arbitrary since in theory there is no limit regarding the amount of redundancy one can incorporate into a library. For instance, instead of each pair of sequences being represented by a single pairwise alignment, one could use several alternative alignments of this pair, obtained by various methods. In fact, the library is mostly an interface between any method one can invent for generating pairwise alignments, and the COFFEE function optimized by SAGA. However, the method follows the rule ‘garbage in/garbage out’ and the overall properties of the COFFEE function will most likely reflect the properties of the method used to build the library. The amount of time it takes to build the library depends on the alignment method used and increases quadratically with the number of sequences. Inside the evaluation algorithm, the library is stored in a look-up table. If each pair of sequences is represented only once, the amount of memory required for the storage increases quadratically with the number of alignments and linearly with their length. For the analyses presented here, two types of libraries were built. The first one relies on ClustalW. Given a set of N sequences, each possible pair of sequences was aligned using ClustalW with default parameters. The collection of output files obtained that way constitutes the library (ClustalW library). The motivation for using ClustalW as a pairwise method stems from the fact that Clustal is using local gap penalties, even for two sequences. In order to show that COFFEE is not dependent on the method used to construct the library, a second category of library was created using the FSSP database (Holm and Sander, 1996b). FSSP is a database containing all the PDB structures aligned with one another in a pairwise manner. For each test case, a set of sequences was chosen and the (N2 – N)/2 pairwise structure alignments involving these sequences were extracted from the FSSP database to construct an FSSP library. We also used as references the multiple alignments contained in FSSP. An FSSP entry is always based around a guide structure to which all the other structures are aligned in a pairwise manner. This collection of pairwise alignments can be regarded as a pairwise-based multiple alignment. This means that if one is interested in a set of N protein structures, FSSP contains the N corresponding pairwise-based multiple alignments, each using one structure of the set as a guide. Generally speaking, these N multiple alignments do not have to be consistent with one another, but only consistent with the subset of the pairwise alignments that was used to produce them. Evaluation procedure: the COFFEE function Let us assume an alignment of N sequences and an appropriate library built for this set. Evaluation is made by comparing each pair of aligned residues (i.e. two residues aligned with each other or a residue aligned with a gap) observed in the multiple alignment to those present in the library (Figure 1). In such a comparison, residues are identified by their position in the sequence (gaps are not distinguished from one another). In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the 409 C.Notredame, L.Holm and D.G.Higgins multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical). In practice, this scheme needs extra readjustments to incorporate some important properties of the sequence set. For instance, the significance of the information content of each pairwise alignment is not identical. Several schemes have been described in the literature for weighting sequences according to the amount of information they bring to a multiple alignment (Altschul et al., 1989; Sibbald and Argos, 1990; Vingron and Sibbald, 1993; Thompson et al., 1994a). In COFFEE, our main concern was to decrease the amount of noise produced by inaccurate pairwise alignments in the library. To do so, each pairwise alignment in the library is given a weight that is a function of its quality. For this purpose, we used a very simple criterion: the weight of a pairwise alignment is equal to the per cent identity between the two aligned sequences in the library. This may seem counterintuitive since weighting schemes are normally used in order to decrease the amount of redundancy in a set of sequences (i.e. down-weighting sequences that have a lots of close relatives). Doing so makes sense in the context of profile searches (Gribskov et al., 1987; Thompson et al., 1994b) where it is important to prevent domination of the profile by a given subfamily. However, in the case of multiple sequence alignments made by global optimization, it is more important to make sure that closely related pairs of sequences are correctly aligned, regardless of the background noise introduced by other less related sequences. In such a context, a weight can be regarded as a constraint. The consequence is that the alignment of a given sequence will mostly be influenced by its closest relatives. On the other hand, if a sequence lacks any really close relative, its alignment will mostly be influenced by the consistency of its pairwise alignments with the rest of the library. The COFFEE function can be formalized as follow. Given N aligned sequences S1 … SN in a multiple alignment, Ai,j is the pairwise projection (obtained from the multiple alignment) of the sequences Si and Sj , LEN(Ai,j ) is the length of this alignment, SCORE(Ai,j ) is the overall consistency (level of identity) between Ai,j and the corresponding pairwise alignment in the library and Wi,j is the weight associated with this pairwise alignment. Given these definitions, the COFFEE score is defined as follows: COFFEE score + ƪȍ ȍ N*1 N ƪȍ ȍ N*1 N W i,j SCORE(A i,j) ń i+1 j+i)1 W i,j LEN(A i,j) i+1 j+i)1 ƫ ƫ (1) with: SCORE(Ai,j ) = number of aligned pairs of residues that are shared between Ai,j and the library (2) The COFFEE function presents some similarities with the ‘weighted sums of pairs’ (Altschul and Erickson, 1986). Here as well, we consider all the pairwise substitutions in the multiple alignment, and weight these in a way that reflects the relationships between the sequences. The library plays the role of the substitution matrix. The main differences between the COFFEE function and the ‘weighted sums of pairs’ are that (i) no extra gap penalties are applied in our scheme, since this information is already contained in the library, (ii) the COFFEE score is normalized by the value of the maximum score (i.e. its value is between 0 and 1) and (iii) the cost of the substitutions is made position dependent, thanks to the library (i.e. two similar pairs of residues will have potentially different scores if the indices of the residues are different). Under this formulation, an alignment having an optimal COFFEE score will be equivalent to an MWT alignment using a ‘pairwise alignment graph’ (Kececioglu, 1993). The score defined above is a global measure for an entire alignment. It can also be adapted for local evaluation. We have defined two types of local scores: the residue score and the sequence score. The residue score is given below. S xis the i residue x in sequence S i and A x,yis the pair of aligned residues i,j S x and S y in the pairwise alignment A i,j. i j residue score(S x) + i ȍ j+1,j!+1 N W i,j OCCURRENCE(A x,y)ń i,j ȍ j+1,j!+1 N W i,j (3) OCCURRENCE( A x,y) is equal to the number of occuri,j rences of the pair A x,y in the reference library (0 or 1 when i,j using the libraries described here). The sequence score is the natural extension of the residue score. It is defined as the sum of the score of each residue in a sequence divided by the number of residues in that sequence. Optimizing an alignment for its COFFEE score: SAGA-COFFEE The aim is to create an alignment having the best possible COFFEE score (optimal alignment). Doing so is a difficult 410 COFFEE: an objective function for multiple sequence alignments Fig. 1. COFFEE scoring scheme. This figure indicates how a column of an ALIGNMENT is evaluated by the COFFEE function using a REFERENCE LIBRARY. Each pair in the alignment is evaluated (SCORE MATRIX). In the score matrix, a pair receives a score of 0 if it does not appear in the library or a score equal to the WEIGHT of the pair of sequences in which it occurs in the PAIRWISE LIBRARY. S ince the matrix is symmetrical, the column score is equal to the sum of half of the matrix entries, excluding the main diagonal. This value is divided by the maximum score of each entry (i.e. the sum of the weights contained in the library). The residue score is equal to the sum of the entries contained by one line of the matrix, divided by the sum of the maximum score of these entries. task. The computational complexity of a dynamic programming solution is known to be NP-complete (Kececioglu, 1993). For reasons discussed in the Introduction, we used SAGA V0.93 (Notredame and Higgins, 1996). 411 C.Notredame, L.Holm and D.G.Higgins Table 1. Accuracy of the prediction made on the category 5 of substitution Test case Length Nseq. Proportion (H+E) (%) 57 68 43 48 57 74 53 53 67 57 51 / / Avg Id. (%) 21 31 42 17 36 24 24 39 22 61 27 14 12 COFFEE score Clustal 0.48 0.72 0.84 0.49 0.86 0.78 0.63 0.87 0.59 0.96 0.69 / / SAGA 0.56 0.84 0.87 0.62 0.89 0.80 0.67 0.87 0.65 0.97 0.74 / / Accuracy (H+E) % Clustal SAGA 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 / / 50.2 64.5 90.7 47.0 83.1 85.2 78.1 72.3 64.7 96.9 66.6 / / Accuracy (ov.) % Clustal SAGA 35.2 50.0 88.3 35.7 76.7 82.1 65.6 72.3 60.4 93.6 57.7 / / 45.9 61.7 86.1 43.6 80.2 81.7 69.4 72.4 61.4 93.6 61.2 / / CPU time (s) 21 009 1003 699 936 91 28 477 110 453 256 388 644 44 978 13 756 43 568 N.G. ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot ceo vjs 248 500 146 136 52 183 194 213 90 331 229 882 1207 14 7 6 9 8 17 37 6 8 7 15 7 8 535 166 259 480 55 222 132 105 110 127 744 882 1400 Test case: generic name of the test case, taken from 3D_ali for the first 11 (ac prot: acid proteases, binding: sugar/amino acid binding proteins, cytc: cytochrome c’ ss, fniii: fibronectin type III, gcr: crystallins, globin: globins/phycocyanins/collicins, igb: immunoglobulin fold, lzm: lyzozymes/lactalbumin, phenyldiox: dihydroxybiphenyl dioxygenase, sbt: subtilisin, s_prot: serine protease fold) and from the DALI server for the last two. ceo includes: 1cbg, 1ceo, 1edg, 1byb, 1ghr, 1×yzA and vjs includes: 1cnv, 1vjs, 1smd, 2aaa, 1pamA, 2amg, 1ctn, 2ebn. Length: length of the reference alignment. Nseq: number of sequences in the alignment. Proportion (E+H): percentage of the substitutions involving E→E or H→H. Avg. Id.: average level of identity between the sequences. OF score: score measured with the COFFEE function using a ClustalW library on the ClustalW or the SAGA alignments. Accuracy (E+H): percentage of the (E+H) substitutions found identical between the SAGA (or ClustalW alignment) and the reference. Accuracy (ov.): percentage of substitutions similar in the SAGA (or ClustalW) alignment and in the reference. CPU time: cpu time in seconds using an alpha 8300 machine N.G.: number of generations needed by SAGA to find the solution. The results for the two last test cases analysis are presented in Table 6. SAGA follows the general principles of genetic algorithms as described by Goldberg (1989) and Davis (1991). In SAGA, a population of alignments evolves through recombination, mutation and selection. This process goes on through series of cycles known as generations. Every generation, the alignments are evaluated for their score (COFFEE). This score is turned into some fitness measure. In an analogy with natural selection, the fitter an alignment, the more likely it is to survive and produce an offspring. From one generation to the next, some alignments will be kept unchanged (statistically the fittest), others will be randomly modified (mutations), combined with another alignment (cross-over) or will simply die (statistically the less fit). The new population generated that way will once again undergo the same chain of events, until the best scoring alignment cannot be improved for a specific number of generations (typically 100). Operators play a central role in the GA strategy. They can be subdivided between two categories. First the cross-overs, which combine the content of two multiple alignments. Thanks to them, and to the pressure of selection, good blocks tend to be merged into the same alignments. On their own, cross-overs cannot create new blocks, this needs to be done by the second category of operators: the mutations. These are specific programs that input an alignment and modify it by inserting or moving patterns of gaps. Mutations can be slightly greedy (attempt to make some local optimization) or completely random. A key concept in the genetic algorithm strategy is that the fitness-based selection is not absolute but statistical. To select an individual, a virtual wheel is created. On this wheel, each individual is given a number of slots proportional to its fitness. To make a selection, the wheel is spun. Therefore, the best individuals are simply more likely to survive, or to be selected for a cross-over or a mutation. This form of selection protects the GA search from excessive greediness, hence preventing it from converging onto the first local minimum encountered during the search. SAGA V0.93 is mostly similar to the Version 0.91 described in Notredame and Higgins (1996). Most of the changes between Version 0.93 and 0.91 have to do with some improvement in the implementation and the user interface, but do not affect the algorithm itself. To optimize the COFFEE scores, SAGA was run using the default parameters described for SAGA 0.91 in Notredame and Higgins (1996). SAGA was also modified so that it could evaluate any alignment (including a ClustalW alignment) using the COFFEE function. 412 COFFEE: an objective function for multiple sequence alignments Test cases To assess the biological accuracy of the COFFEE function and the efficiency of its optimization by SAGA, 13 test cases were designed. They are based on sequences with known three-dimensional structures, for which a structural alignment is available. This choice was guided by the fact that structure-based alignments are usually biologically more correct than any other alternative, especially when they involve proteins with low sequence similarity. For this reason, we used these structure-based alignments as a standard of truth in our analyses. Eleven test cases were extracted from the 3D_ali Release 2.0 (Pascarella and Argos, 1992). Alignments were selected according to the following criteria: alignments with more than five structures and with a consensus length larger than 50. In 3D_ali, 18 alignments meet this requirement. Among these, we removed those for which ClustalW produces a multiple alignment >95% identical to its structural counterpart (four alignments). We also removed three alignments which were impossible to align accurately using ClustalW or SAGA/ COFFEE. These consist of sets of very distantly related sequences with unusually long insertions/deletions (barrel, nbd and virus in 3D_ali). These alignments are beyond the scope of conventional global sequence alignment algorithms. This leaves a total of 11 alignments used in our analyses. Their characteristics are shown in Table 1. The two last test cases were extracted from the FSSP database. As opposed to the 11 other test cases, they have been specifically designed for making a multiple sequence alignment using a structure-based reference library. This explains their low level of average sequence identity, as can be seen in Table 1 (the two last entries, vjs and ceo). Evaluation of the COFFEE function accuracy When evaluating a new OF with SAGA, two main issues are involved: the quality of the optimization and the biological relevance of the optimal alignment. Another aspect involves the comparison of the new OF with already existing methods. The evaluation of the biological relevance of the COFFEE function required the use of some references. The structural alignments described above were used for this purpose. Comparison between a sample alignment and its reference was made following the protocol described in Notredame and Higgins (1996), inspired by the method used by Vogt et al. (1995) for substitution matrix comparison and Gotoh (1996). All the pairs (excluding gaps) observed in the sample alignment were compared to those present in the reference. The level of similarity is defined as the ratio between the number of identical pairs in the two multiple alignments divided by the total number of pairs in the reference. This procedure gives access to an overall comparison. It does not reflect the fact that in a global structural alignment, some positions are not correctly aligned because they cannot be aligned (this is true of any position where the two structures cannot be superimposed). In practice, structural alignment procedures may deal with these situations in different ways, producing sequence alignments that are sometimes locally arbitrary (especially in the loops). While in DALI these regions are explicitly excluded from the alignment, it is not so obvious to identify them in the multiple sequence alignments contained in 3D_ali. To overcome this type of noise, a procedure was designed that should be less affected by misalignments. For this alternative measure of biological relevance, we only take into account substitutions that involve a conservation of secondary structural state in the reference alignment (helix to helix and beta strand to beta strand). In the text and the tables, this category of substitution is referred to as (E+H). In most of the test cases, the (E+H) category makes up the majority of the observed pairs, as can be seen in Table 1. For each of the first 11 test cases (3D_ali), the evaluation procedure involved making multiple alignments with five different methods (cf. the next section) and a ClustalW library (default parameters). The ClustalW library was used with SAGA for producing a multiple alignment having an optimal COFFEE score. The biological relevance of this alignment was then assessed by comparison with the structural reference, and compared to the accuracy obtained with the other methods on the same sequences. On the two last test cases (vjs and ceo), alignments were made using FSSP libraries. Alignment accuracy was assessed using the DALI scoring measure. Given a pairwise alignment, this is a measure of the quality of the structure superimposition implied by the alignment. The program used for this purpose (trimdali) returns the DALI score (Holm and Sander, 1993) and two other values: the length of the consensus (number of residues that could be superimposed) and the average RMS (the average deviation between equivalent Ca atoms). These values were computed for each possible pairwise projection of the multiple alignments and averaged. The scores obtained that way for the SAGA alignments were compared to similar scores measured on the FSSP pairwise-based multiple alignments. Comparison of COFFEE with other methods In total, six alignment methods where used to align the 3D ali test cases: ClustalW v1.6 (Thompson et al., 1994a), SAGA with the MSA objective function (SAGA-MSA) (Notredame and Higgins, 1996), PRRP (V 2.5.1), the iterative alignment method recently described by Gotoh (1996), PILEUP (Higgins and Sharp, 1988) in GCG v9.1 and SAM (v2.0), a HMM package (Hughey and Krogh, 1996) and SAGA-COFFEE. 413 C.Notredame, L.Holm and D.G.Higgins Apart from SAM, all these methods were used with the default parameters that came along with the package. In the case of SAM, since it is known that HMMs usually require large sets of sequences in order to evaluate a model, we used the Dirichlet mixture regularizer provided in the package, which is supposed to compensate for this type of problems. SAGA-MSA is the package previously described (Notredame and Higgins, 1996), Rationale 2 weights (Altschul et al., 1989) were computed using the MSA package. (Lipman et al., 1989). When possible, MSA was used on the same sequences as SAGA-MSA in order to control the quality of the optimization. Results were consistent with those previously reported. workstation, it takes ∼4 s/generation for the gcr test case and ∼7 min/generation for the igb test case. Unfortunately, establishing the complexity in terms of the number of generations needed to reach a global optimum is much harder. This depends on several factors: number of sequences, length of the consensus, relative similarity of the sequences, complexity of the pattern of gaps needed for optimality, operators used for mutations and cross-overs. Since the pattern of gaps is an unknown factor, it is impossible really to predict how many generations will be required for one specific test case. On the other hand, judging from the data in Table 1 (N generations column), it seems that the length of the alignment has a stronger effect than the number of sequences. Implementation The COFFEE function and the procedure for building ClustalW pairwise libraries have been implemented in ANSI C. These programs have been integrated in Version 0.93 of the SAGA package also written in ANSI C. These are available upon request from the authors (http://www.ebi.ac.uk/∼cedric). Comparison of the COFFEE function with other methods Multiple alignments were produced with SAGA-COFFEE using ClustalW libraries (best scoring alignment out of 10 runs). These alignments were compared to the structural references. Multiple alignments of the same sets, generated with five other methods, were also compared to the references in order to evaluate relative performances. Since in the way it is used here, the COFFEE function depends heavily on ClustalW, special emphasis was given to the comparison of these two methods (Table 1). The results are unambiguous. When considering the overall comparison, nine test cases showed that SAGA makes an improvement over ClustalW (in two of these, the improvement is >10%). The trend is similar when looking only at (E+H) substitutions, where 10 test cases out of 11 present an improvement. In the few cases where it occurs, the degradation made by SAGA is always <2%. The extent of the observed improvements usually correlates well with the differences in the scores measured with the COFFEE function. Degradation is only observed in the cases where the ClustalW alignment already has a high level of consistency with the reference library (>75%), as can be seen with the globin (COFFEE score of the ClustalW alignment = 0.78) and the cytochrome C (COFFEE score of the ClustalW alignment = 0.84). In order to put SAGA-COFFEE in a wider context, comparisons were made using five other different methods (Table 2). These results show that in most of the cases SAGA-COFFEE does reasonably well. When its alignment is not the best, it is usually within 3% of the best (except for the binding and gcr tests, for which the difference is greater). Apart from the HMM method (SAM) that has low performances, it is relatively hard to rank existing methods. PRRP is one of the newest methods available. It has been described as being one of the most accurate (Gotoh, 1996) and happens to be the only one that significantly outperforms SAGACOFFEE on some test cases. It is also interesting to notice that SAGA-COFFEE is always among the best for test cases Results Accuracy and complexity of the optimization Since our approach relies on the ability of SAGA to optimize the COFFEE function, we checked that this optimization was performed correctly. For each test case, a dummy library was created, containing sets of pairwise alignments identical to those observed in the reference multiple structure alignment. In such a case, the structural alignment has a score of 1 since it agrees completely with the library. Therefore, the maximum score that can be reached by SAGA also becomes 1. Since, under these artificial conditions, the score of the optimum is known, we could test the accuracy of SAGA’s optimization. Several runs made on the same set reached the optimum value in an average of 5.4 runs out of 10. The lowest reproducibility was found with the largest test cases of Table 1 (igb or s prot with a score of 1 being reached, respectively, one and two times out of 10 runs). However, even if the optimal score is not reached, we found that it is always possible to produce an alignment with a score better than 0.94. Although they do not constitute a full proof, these results support the assumption that SAGA is a good choice for optimizing the COFFEE function. An important aspect is the complexity of the program and the factors that influence it. As we previously reported when optimizing the sums of pairs with SAGA (Notredame and Higgins, 1996), establishing the complexity is not straightforward. The evaluation of a COFFEE score is quadratic with the number of sequences and linear with the consensus length. Using a given population size, the time required for one generation will vary accordingly. For instance, on a fast 414 COFFEE: an objective function for multiple sequence alignments having a low level of identity. This trend is confirmed by the results shown in Table 3, where the sequences are grouped according to their average similarity with the rest of their family (as measured on the reference structural alignment). In this table, we analysed the overall performance of each method and compared it with SAGA-COFFEE by counting (i) the overall per cent of (E+H) residues correctly aligned and (ii) the number of sequences for which SAGA-COFFEE makes a better (b)/worse (w) alignment than a given method. Overall, the results confirm that SAGA-COFFEE seems to do better than the other methods when the sequences have a low level of identity with the rest of their set. The poor performances of SAM can probably be explained by two reasons: the small number of sequences in each test case and perhaps some inadequate default settings in the program (in practice, SAM is often used as an alignment improver rather than on its own). Sequence identity: minimum and maximum average identity of the sequences of each category with the rest of their alignment as measured on the reference structural alignment. Nseq.: number of sequences in a category. Nres.: number of residues. SAGA-COFFEE percentage of the (E+H) substitutions present in the reference structural alignment observed in the SAGA-COFFEE alignment. ClustalW: (%), similar but using ClustalW alignment; (b), number of Table 2. Method comparison on the 3D_ali test cases Test case Avg. id. (%) 21 31 42 17 36 24 24 39 22 61 27 Nseq. SAGA COFFEE (%) 50.2 64.5 90.7 *47.0 83.1 85.2 *78.1 *72.3 *64.7 96.9 66.6 PRRP (%) 48.8 *76.2 89.4 36.3 *92.8 *87.0 74.9 71.1 49.9 96.7 64.3 sequences for which SAGA-COFFEE produces a better alignment than ClustalW; (w), number of sequences for which SAGA-COFFEE produces a worse alignment than ClustalW. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. [Note that the (b) and (w) categories do not necessarily add up to the overall number of sequences because they do not include sequences having the same score with the two method compared.] Test case: generic name of the test case, taken from 3D_ali (see 3D_ali for PDB identifiers), see Table 1 for more details. Nseq: number of sequences in the alignment. Avg. Id.: average level of identity between the sequences. SAGA-COFFEE accuracy of the alignments obtained with SAGA-COFFEE as judged by comparison with the structural alignment, only considering the (E+H) substitutions. ClustalW: similar but with ClustalW alignments. PRRP: similar but with alignments produced with the Gotoh PRRP algorithm (see the text). PILEUP: pileup method from the GCG package. SAGA-MSA: SAGA using the MSA objective function. SAM: sequence alignment modelling by Hidden Markov Model. ClustalW (%) 39.2 50.0 89.1 42.0 80.8 86.4 74.8 72.2 58.5 96.9 62.5 PILEUP (%) 40.9 66.6 *94.6 37.8 80.8 72.6 52.4 *72.3 37.4 *97.4 57.9 SAGA MSA (%) *51.2 64.2 67.3 45.2 80.8 78.0 70.1 *72.3 55.6 96.0 *68.5 SAM (%) 27.9 36.9 67.3 16.2 85.7 67.8 67.2 55.3 45.7 90.6 61.7 ac prot binding cytc fniii gcr globin igb lzm phenyldiox sbt s prot 14 7 6 9 8 17 37 6 8 7 15 *Indicates the method performing best on a test case. Table 3. Method comparison on the 3D_ali test cases: global results Sequence identity [00.0–20.0] [20.0–40.0] [40.0–100.0] Nseq. Nres. SAGA COFFEE (%) *63.3 *76.2 89.3 PRRP (%) 62.2 74.6 *90.9 ClustalW (%) b 49.7 66.1 84.6 11 68 20 PILEUP (%) b 42.4 60.2 89.8 23 80 3 SAGA MSA (%) b 56.9 69.7 87.8 20 63 16 SAM (%) w 7 24 2 36.4 59.1 64.3 18 84 25 3 2 0 b 18 57 14 w 8 31 4 w 6 20 3 w 3 8 15 b w 28 88 18 3424 12 010 3808 *Indicates the method performing best on a given range of identity. 415 C.Notredame, L.Holm and D.G.Higgins Fig. 2. Correlation between sequence score and alignment accuracy. (a) The average level of identity of each sequence with the rest of its alignment was measured on the reference structural alignment. The average level of accuracy of the SAGA-COFFEE alignment of each of these sequences was also estimated on the (E+H) category. The two values are plotted against one another. (b) For each sequence, the sequence score was measured on the SAGA-COFFEE alignment, this value was plotted against the accuracy of the sequence alignment. The coefficient of linear correlation was estimated on these points (r = 0.65). These results also indicate that there is no such thing as an ideal method. Even if COFFEE seems to do better on average, one can see in Table 2 and III that the alignments it produces are not always the best. In fact, it seems that depending on the test case any method can do better than the others. Unfortunately, as discussed by Gotoh (1996), it is hard to discriminate the factors that should guide the choice of a method. For this reason, being able to identify the correct portions in a multiple sequence alignment may be even more important than being able to do a very accurate alignment. Correlation between COFFEE score and alignment accuracy As mentioned in Methods, the score can be assessed at a local level (sequence score or residue score). One of the benefits of such evaluation is that local score and accuracy can be correlated, thus allowing the identification of potentially correct portions of an alignment with a known risk of error. The 3D_ali structure-based alignments were used once more to validate this approach. Generally speaking, a high residue score will indicate that the pairs in which a given residue is involved are also found in the pairwise library. On the other hand, if none of the pairings in which a given residue is involved are found in the library, this residue will be considered unaligned. We evaluated the COFFEE score of each sequence in each alignment. In each of these sequences the (E+H) average accuracy was also measured. The graph in Figure 2b shows the relationship between sequence score and (E+H) average accuracy. The correlation between these two quantities is reasonable (r = 0.65). When considering the values used for this graph, we found that for >85% of the sequences it is possible to predict the actual accuracy of the alignment with a ±10% error rate. In terms of prediction, this is a substantial improvement over what can be obtained when measuring the average level of identity between one sequence and its multiple alignment, as shown in Figure 2a. 416 COFFEE: an objective function for multiple sequence alignments Table 4. Average accuracy of the alignment of each sequence as a function of its sequence score (3D_ali test cases) Sequence score N. residues (%) ClustalW 5.8 36.8 57.4 19 242 residues N. sequences (%) ClustalW 6.7 40.3 53.0 Accuracy (E+H) (%) ClustalW 14.3 63.2 82.0 SAGA 2.6 33.7 63.7 134 sequences SAGA 3.0 38.1 59.0 SAGA 9.9 67.2 82.5 [0.00–0.33] [0.33–0.66] [0.66–1.00] TOTAL Sequence score: minimum and maximum score of the sequences in each category. N. residues: percentage of residues belonging to each category estimated on SAGA or ClustalW alignments. N. sequences: percentage of the total sequences belonging to each category of score as measured on the SAGA and the ClustalW alignments. Accuracy (E+H): accuracy associated with each category of score in the SAGA and ClustalW alignments. TOTAL: total number of residues and sequences in the comparison. Table 5. Accuracy of the prediction made on the category 5 of substitution Test case Accuracy (%) ClustalW 56.8 64.3 96.2 81.5 75.5 97.2 88.8 91.5 78.0 82.2 89.7 Correct substitution (%) ClustalW SAGA 9.6 31.5 72.1 13.8 63.4 63.1 39.0 61.3 34.3 45.2 85.2 15.7 40.0 73.5 15.6 74.5 66.5 42.3 61.5 40.0 50.1 87.0 SAGA 68.2 61.4 93.9 77.7 77.4 95.0 85.5 91.8 72.5 82.4 89.7 ac prot binding cytc fniii gcr globin igb lzm phenyl s prot sbt Test case: generic name of the test case taken from 3D_ali. Accuracy: ratio between the total number of substitutions in category 5 (in SAGA and ClustalW alignments) and the number of these substitutions present in the reference alignment. % Correct substitutions: percentage of the correct substitutions (over the total number, all substitution categories included) identified in the category 5 of substitution. The correlation between score and accuracy becomes slightly more apparent when looking at the data in a more global way (Table 4). In this case, the sequences have been grouped according to their score, and the accuracy of their alignment was measured. One can see that the higher the score of a sequence, the higher its average alignment accuracy. We also found that the distribution of the sequences among the three categories was modified when using ClustalW instead of SAGA. SAGA produces more sequences with a high score than does ClustalW. This means that not only are SAGA alignments more accurate than ClustalW, it is also possible to identify them for being so. In practice, the sequence score, as imperfect as it may seem, provides a fast and simple way to identify sequences that do not really belong to a set or that are so remotely related to the rest that their alignment should be considered with care. The sequence score is a global measure. It does not reflect the local variations that occur at the residue level. To analyse these kinds of data and assess their utility for predicting correct portions of an alignment, the score of each residue in each multiple sequence alignment was evaluated using equation (3). These scores were scaled in a range varying from 0 to 9. A residue has a score of 9 if >90% of the pairs in which it is involved are also present in the reference library, and so forth for 8 ([80–90%]), etc. Once residue scores have been evaluated, substitution classes can be defined. For instance, the class 5 of substitutions includes all the residues of a multiple alignment having a residue score superior to or equal to 5 (Figure 3a), the class 0 of substitution includes all the residues in the alignment. Figure 3a gives an example of such an evaluation. In this alignment, each residue is replaced by its score, and the residues that belong to the category 5 of substitution are boxed. Figure 3b shows the correctly aligned residues in this category. It is possible to see that using our measure, one can identify some of the correct portions in the SAGA fniii alignment. As can be gathered from Table 1, fniii is a very demanding test case. Except for the two first sequences, which are almost identical, all the other members of this set have a very low level of identity with one another. This is especially true for the sequence 2hft_1 which illustrates well the advantages and limits of our method. This sequence is not the most remotely related to the set. It has an average identity of 14%, whereas two other members (3hhr_c and 2hft) are more distantly related with 12% average identity. Despite this fact, 2hft_1 gets the lowest sequence score in the multiple alignment (0.29). This correlates well with the fact that it also has the lowest alignment accuracy of the set [18% overall, 20% for the (E+H) category]. Similarly, the only non-terminal stretch of this sequence that belongs to the category 5 is one of the only portions to be correctly aligned (Figure 3a and b). The same type of analyses were carried out on the 10 other test cases (Table 5). Our measures indicate that using the category 5 of substitution, a substantial portion of correctly aligned residues can be identified. When comparing Clus- 417 C.Notredame, L.Holm and D.G.Higgins Fig. 3. Evaluation of the accuracy of the fniii test case (fibronectin type III family). (a) Sequences in the fniii test case were aligned by SAGA-COFFEE using a ClustalW library. The alignment obtained that way was evaluated locally. The sequences names are the PDB id entifiers. The suffix _1, _2.. indicates that several portions of the same sequence have been used (cf. 3D_ali for further details). In th is alignment, each residue has been replaced by its score. The gray boxes indicate all the residues that belong to category 5 of substitution (i.e . having a score ≥5). The sequence score box lists the values measured on each sequence. (b) The accuracy in the category 5 of substitution (boxes) was evaluated by comparison with the reference alignment. Residues shadowed in gray in the boxes are correctly aligned to one another. Boxed residues not shadowed are not correctly aligned with each other or with the rest of the category 5 residues. Residues not contained in the boxes are not taken into account for this evaluation. talW and SAGA, we found that more correct residues can be identified with SAGA. This improvement is sometimes achieved at the cost of a slightly lower accuracy (more false positives) in the SAGA alignments. A global estimation was made to evaluate the accuracy that can be expected when using any of the 10 substitution categories on a SAGA alignment. The proportion of correct substitutions predicted that way was also measured. These results are presented in Figure 4a and b, respectively. Residues are grouped in three classes, depending on the score of the sequences in which they occur. Figure 4a confirms that high-scoring residues are usually correctly aligned (high accuracy). However, the higher the substitution category, the smaller the number of residues on which a prediction can be made, as shown in Figure 4b. These graphs confirm that the residue score can be used as a measure for predicting accuracy; they also indicate that the sequence score is informative when making a prediction on a residue. Making a multiple structural alignment The analysis carried out with the ClustalW libraries represents only one possible application for the COFFEE function. Generally speaking, the COFFEE scheme allows the combination of the information contained by any reference library, regardless of the method used for its construction. To illustrate this fact, we show that it is possible to build a structure-based multiple sequence alignment when a library of high-quality pairwise structural alignments is available. We used COFFEE on two sets of proteins (vjs and ceo) using appropriate FSSP libraries. It was impossible to improve significantly over FSSP for the ceo test case, made of endoglucanases and other related carbohydrate degradation enzymes. This can be explained by the fact that the FSSP alignment with the best DALI score (the one using 1ceo as a guide) already has a high level of consistency with the library (COFFEE score = 0.82). This shows quite clearly in the fact that this alignment is 88% similar to the SAGA-COFFEE one. The second set is made of amylases and other carbohydrate degradation enzymes. Table 6 is used to compare the SAGACOFFEE alignment of these sequences with the corresponding FSSP pairwise-based multiple alignments. These results clearly indicate that the alignment produced by SAGA is better than any of the FSSP multiple alignments, regardless of the criterion used to evaluate this improvement (DALI score, consensus length or RMS). This result was to be expected since SAGA makes use of much more information 418 COFFEE: an objective function for multiple sequence alignments Fig. 4. Prediction of correctly aligned residues using the residue COFFEE score. (a) The accuracy of the alignments (number of correct substitutions in one of the categories divided by the total number of substitutions in this category) of each sequence was meas ured. To do so, sequences were divided into three groups, depending on their sequence score. The graph was made for each of the three groups. (b) For each sequence, the number of correct substitutions contained in each category was evaluated and divided by the overall number of substitutions involving that sequence. This value was plotted against the category of substitution. than any of the FSSP alignments. In Table 6, entries are sorted according to the DALI score. This allows one to see that the DALI and COFFEE scores correlate well for the Table 6. Comparison of FSSP and SAGA multiple alignments Guide sequence 2ebn 1cnv 1vjs 1ctn 1smd 2amg 2aaa 1pamA SAGA-COFFEE Average DALI score 1152.6 1205.2 1258.4 1331.2 1667.1 1672.9 1766.8 1786.3 1860.0 FSSP alignments, and supports the idea that the COFFEE score is also a good indicator of the alignment quality when the library is based on structural alignments. Average consensus length 186.5 196.4 198.8 196.9 219.4 217.7 224.9 225.8 230.9 Average RMS (Å) 3.73 3.63 3.62 3.53 3.40 3.42 3.45 3.30 3.20 COFFEE score 0.53 0.59 0.50 0.60 0.65 0.67 0.69 0.70 0.79 Guide sequence: sequence used as a guide in the FSSP multiple alignment (SAGA indicates the alignment obtained with SAGA-COFFEE). Average DALI score: average DALI score of each pair of sequences in the alignment. The table is sorted according to the values of these entries. Average consensus length: average of the number of residues superimposable by DALI in each pair of sequence. Average RMS: the average of the RMS values measured by DALI on each pair of the alignment in Ångströms. COFFEE score: score given by SAGA to the multiple alignments using the same library. 419 C.Notredame, L.Holm and D.G.Higgins In theory, we could have used the DALI score as an objective function, and optimized it with SAGA. In such a context, DALI would be used to evaluate all the pairwise projections in order to give a score similar to the one shown in the ‘DALI score’ column of Table 6. However, this is not possible in practice because the computation of a DALI score is much more expensive than the evaluation of a COFFEE score. DALI score used on a multiple alignment is quadratic with the number of sequences and quadratic with the length of the alignment. The COFFEE score is also quadratic with the number of sequences, but only linear with the length of the alignment. In consequence, even if we were to assume a global DALI score to be biologically more realistic than the FSSP library-based COFFEE score, COFFEE still appears as a good trade-off between approximating DALI and saving on computational cost. Discussion In this work, we show that alignments can be evaluated for their MWT score using the COFFEE function and subsequently optimized with the genetic algorithm package SAGA. Given a heterogeneous, non-consistent collection of pairwise alignments, one can extract the corresponding multiple alignment with COFFEE and SAGA. We have shown here that the SAGA-COFFEE scheme outperforms most of the commonly used alternative packages when applied to sequences having low levels of identity. The comparison made with other global optimization techniques such as SAGA-MSA and PRRP indicates that the method is not only better because it does a global optimization, but also probably because of the way it uses information, filtering some of the noise through the library of pairwise alignments. The weighting scheme also plays a role in this improvement. It helps turning the relationship between the sequences into some of the constraints that define the optimal alignment. It is because all these constraints (library and weights) are unlikely to be consistent that the genetic algorithm strategy proves to be a very appropriate mean of optimization. There is little doubt that the performances of our method will also depend on the relationship between the sequences. Sets with a lot of intermediate sequences (i.e. a dense phylogenetic tree) are likely to lead to more accurate alignments. However, the fact that COFFEE proves able to deal with sequences having a very low level of identity is quite encouraging regarding the robustness of the method. One of the main advantages of the COFFEE strategy is the flexibility given to the user for defining the library. Here, by using two completely different pairwise alignment methods, we managed to produce high-quality multiple alignments in both cases. This is interesting, but constitutes only a first step. The structure of the libraries we have been using is very simple. They only rely on an ‘all-against-all’ comparison using one type of pairwise alignment algorithm per library. In practice, this scheme can easily be extended to much more complex library structures. It is common sense to have a higher confidence in results that can be reproduced using independent methods. Some prediction methods rely on this type of assumption, such as the block definition strategy described by Henikoff et al. (1995).These methods usually limit themselves to identifying highly conserved patterns among a set of solutions. With the COFFEE strategy, we go much further and make it possible to find a consensus solution whatever the number of constraints and whatever their relative compatibility. Of course, it is not enough for a solution to exist, one also needs to know how accurate this solution is. In this work, we have shown that the level of consistency of a solution is a good indicator of such accuracy. This accuracy prediction constitutes the other main aspect of the COFFEE function. Several methods have been proposed that attempt to predict correct portions of a pairwise alignment given a set of sub-optimal alignments (Gotoh, 1990; Vingron and Argos, 1990; Mevissen and Vingron, 1996). Using these methods, libraries could be designed with large numbers of sub-optimal alignments. Here again, the difference between the COFFEE method and other previously proposed approaches is that not only is it possible to predict the correct portions in an alignment, but it is also possible to optimize a multiple alignment for having as many reliable regions as possible. SAGA-COFFEE still needs to be improved on several accounts. For instance, further approaches will involve the definition of more complex libraries that will hopefully lead to more meaningful consistency indices. The main source of inspiration when doing so will be the work done on pairwise alignment stability (Mevissen and Vingron, 1996). The other direction we plan to take has to do with the combination of scoring schemes. We have seen here that there is no uniform solution to the multiple sequence alignment problem. For this reason, it would make sense to generate libraries containing alternative alignments made by all the available methods (PRRP, ClustalW, HMM, etc.). COFFEE could then be used to merge this information and hopefully extract the best of each alignment. This will require some improvement of the COFFEE function and its adaptation to highly redundant library. Another crucial aspect will be increasing the efficiency of the algorithm. At present, SAGA-COFFEE is an extremely slow method; however, we hope to improve on this by using a more appropriate type of seeding. Finally, another important aspect of our approach will involve the refinement of the method used here for building multiple structural alignments. The project will be based on a procedure similar to the one described above: the design of more efficient weights and an attempt to use the alternative 420 COFFEE: an objective function for multiple sequence alignments structural alignments that can be produced by the DALI method, using a wider range of DALI test cases. Acknowledgements The authors wish to thank Miguel Andrade and Thure Etzold for very useful comments and corrections. They also wish to thank the referees for their useful remarks and interesting suggestions. References Altschul,S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol., 138, 297–309. Altschul,S.F. and Erickson,B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603–616. Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) Weights for data related by a tree. J. Mol. Biol., 207, 647–653. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St Louis, MO. Carrillo,H. and Lipman,D.J. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48, 1073–1082. Davis,L. (1991) The Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. Dayhoff,M.O. (1978) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC. Eddy,S.R. (1995) Multiple alignment using hidden Markov models. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Feng,D.-F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Godzik,A. and Skolnik,J. (1994) Flexible algorithm for direct multiple alignment of protein structures and sequences. Comput. Applic. Biosci., 10, 587–596. Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, New York. Gotoh,O. (1990) Consistency of optimal sequence alignments. Bull. Math. Biol., 52, 509–525. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gribskov,M., McLachlan,M. and Eisenberg,D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. Henikoff,S., Henikoff,J., Alford,W. and Pietrokovsky,S. (1995) Automated construction and graphical representation of blocks from unaligned sequences. Gene, 163, GC17–26. Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. and Sander,C. (1996a) Alignment of three-dimensional protein structures: network server for database searching. Methods Enzymol., 266, 653–662. Holm,L. and Sander,C. (1996b) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res., 24, 206–210. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Applic. Biosci., 12, 95–107. Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M. (1993a) Multiple sequence alignment by parallel simulated annealing. Comput. Applic. Biosci., 9, 267–273. Ishikawa,M., Toya,T. and Tokoti,Y. (1993b) Parallel iterative aligner with genetic algorithm. In Artificial Intelligence and Genome Workshop, 13th International Conference on Artificial Intelligence, Chambery, France. Kececioglu,J.D. (1993) The maximum weight trace problem in multiple sequence alignmnet. Lecture Notes Comput. Sci., 684, 106–119. Kim,J., Pramanik,S. and Chung,M.J. (1994) Multiple sequence alignment using simulated annealing. Comput. Applic. Biosci., 10, 419–426. Krogh,A. and Mitchison,G. (1995) Maximum entropy weighting of aligned sequences of proteins or DNA. In Third International Conference on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England. AAAI Press, Menlo Park, CA. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. Mevissen,H.T. and Vingron,M. (1996) Quantifying the local reliability of a sequence alignment. Protein Eng., 9, 127–132. Morgenstern,B., Dress,A. and Wener,T. (1996) Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl Acad. Sci. USA, 93, 12098–12103. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Notredame,C. and Higgins,D.G. (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24, 1515–1524. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pascarella,S. and Argos,P. (1992) A data bank merging related protein structures and sequences. Protein Eng., 5, 121–137. Rost,B. (1997) AQUA Server. http://www.ebi.ac.uk/∼rost/Aqua/ aqua.html Sibbald,P.R. and Argos,P. (1990) Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J. Mol. Biol., 216, 813–818. 421 C.Notredame, L.Holm and D.G.Higgins Sjolander,K., Karplus,K., Brown,M., Huguey,R., Krogh,A., Saira,M. and Haussler,D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Applic. Biosci., 12, 327–345. Taylor,W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161–169. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4690. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gab excision. Comput. Applic. Biolsci., 10, 19–29. Vingron,M. and Argos,P. (1990) Determination of reliable regions in protein sequence alignment. Protein Eng., 3, 565–569. Vingron,M. and Argos,P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218, 33–43. Vingron,M. and Sibbald,P. (1993) Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc. Natl Acad. Sci., 90, 8777–8781. Vogt,G., Etzold,T. and Argos,P. (1995) An assessement of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol., 299, 816–831. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol., 1, 337–348. 422 8VLQJ PXOWLSOH DOLJQPHQW PHWKRGV WR DVVHVV WKH TXDOLW\ RI JHQRPLF GDWD DQDO\VLV Cédric Notredame and Chantal Abergel Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr, chantal.abergel@igs.cnrs-mrs.fr 1 $%675$&7 The analysis of multiple sequence alignments can generate essential clues in genomic data analysis. Yet, to be informative such analyses require some mean of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence alignment algorithm. We provide two examples of applications: one where the CORE index is used to identify correct blocks within a difficult multiple alignment and another where the CORE index is used on genomic data to identify the proper start codon and a frameshift within one of the sequence. 2  ,QWURGXFWLRQ Biological analysis largely relies on the assembly of elaborate models meant to summarize our knowledge of life complex mechanisms. For that purpose, vast amounts of data are collected, analyzed, validated and then integrated within a model. In an ideal world, an existing model would be available to explain every bit of experimental data. In the real world, this is rarely the case, and every day, existing models need to be modified to accommodate new findings. Sometimes, data that cannot be explained is kept at bay until the accumulation of new evidences prompts the design of an entirely new model. Unaccountable data can be viewed as the stuff inflating an inconsistency bubble. Eventually, the bubble bursts and a new model is designed. A multiple alignment is nothing less than such a model. Given a series of sequences and an alignment criteria (structure similarity, common phylogenetic origin) the multiple alignment contains a series of hypothesis regarding the relationship between the sequences it is made of. This alignment can accommodate data generated experimentally (e.g. alignment of two homologous catalytic residues) or combine the results of various sequence analysis methods. The importance of the use of multiple sequence alignments in the context of sequence analysis has been recognized for a long time and it is so well established that most bioinformatics protocols make use of it. Multiple alignments have been turned into profiles (Gribskov et al., 1987) and hidden Markov models (Krogh et al., 1994) to enhance the sensitivity and the specificity of database searches (Altschul et al., 1997). State of the art methods for protein structure prediction depend on the proper assembly of a multiple 3 sequence alignment (Jones, 1999) as do phylogenetic analysis (Duret et al., 1994). Over the last years multiple sequence alignment techniques have been instrumental to improvements made in almost every key area of sequence analysis. Yet, despite its importance, the accurate assembly of a multiple sequence alignment is a complex process, the biological knowledge and the computational abilities it requires are far beyond our current capacities. As a consequence, biologists are left to use approximate programs that attempt to assemble proper alignments without providing any guaranty they may do so. The lack of a ‘perfect’ or at least reasonably robust method explains why so many multiple sequence alignment packages exist. The variations among these packages are not only cosmetic; they include the use of very different algorithms, different parameters and generally speaking different paradigms. For a recent review of state-of-the-art techniques, see (Duret and Abdeddaim, 2000). Database searches, structure predictions, phylogenetic analysis are enough on their own to make multiple alignment compulsory in a genome analysis task. Yet, thanks to the sanity checks they provide, multiple alignments can also be instrumental at tackling the plague of genomic analysis: faulty data. When dealing with genomes, faulty data arises from two major sources: sequencing errors and wrong predictions. The consequence is that a predicted protein sequence may have accumulated errors both at the DNA level and when its frame was predicted (this will be especially true in eukaryotic genes where exons may be missed, added or improperly predicted). In the worst cases, the effect of such errors will be amplified in the high level analysis, leading to an improper analysis of the available data. On the other hand, once they have been identified, these errors are usually easily corrected either by extra sequencing or data extrapolation. Therefore, any method providing a reasonable sanity-check that earmarks areas of a genome likely to be problematic would be a major improvement. In this chapter we will show how multiple sequence alignments can be used to carry out part of 4 this task. For that purpose we will focus on the applications of T-Coffee, a recently described method (Notredame et al., 2000).  *HQHUDWLQJ 0XOWLSOH $OLJQPHQWV :LWK 7&RIIHH Despite the large variety of multiple sequence alignment methods publicly available, the number of packages effectively used for data analysis is surprisingly small and a vast majority of the alignments found in the literature are produced using only two programs: ClustalW (Thompson et al., 1994) and its X-Window implementation ClustalX. ClustalW uses the progressive alignment strategy described by Taylor (Taylor, 1988) and Doolitle (Feng and Doolittle, 1987), refined in order to incorporate sequence weights and a local gap penalty scheme. Recently, the ClustalW algorithm was further modified in order to improve the accuracy of the produced alignments by making the evaluation of the substitution costs position dependant. This improved algorithm is implemented in the T-Coffee package (Notredame et al., 2000). The aim of T-Coffee is to build a multiple alignment that has a high level of consistency with a library of pre-computed pair-wise alignments. This library may contain as many alignments as one wishes and it may also be redundant and inconsistent with itself. For instance it may contain several alternative alignments of the same sequences aligned using various gap penalties. It may also contain alternative alignments obtained by applying different methods onto the sequences. Overall, the library is a collection of alignments believed to be correct. Within this library, each alignment receives a weight that is an estimation of its biological likeliness (i.e. how much trust does one have in this alignment to be correct). For that 5 purpose, one may use any suitable criteria such as percent identity, P-Value estimation or any other appropriate method. The T-Coffee algorithm uses this library in order to compute the score for aligning two residues with one another in the multiple alignment. This score is named the extended weight because it requires an extension of the library. The extended weight takes into account the compatibility of the alignment of two residues with the rest of the alignments observed within the library, its derivation is extensively described in (Notredame et al., 2000). The principle is straightforward: in order to compute the extended weight associated with two residues R and S of two different sequences, one will consider whether when R is found aligned in the library with some residue X of a third sequence, S is also found aligned with that same residue X in another entry of the library. If that is the case, then the weight associated with R and S will be increased by the minimum of the two weights RX and SX. The final extended weight will be obtained when every possible X has been considered and the resulting contributions summed up. Although this operation seems to be very expensive from a computational point of view, its effective computational cost is kept low thanks to the scarceness of the primary library (i.e. for most pairs of residues RS, very few Xs need to be considered). In the end, a pair of residues is highly consistent (and has a high extended weight) if most of the other sequences contain at least one residue that is found aligned both to R and to S in two different pair-wise alignments. A key property of this weight extension procedure is to concentrate information: the extended score of RS incorporates some information coming from all the sequences in the set and not only from the two sequences contributing R and S. The main advantage of the extended weights is that they can be used in place of a substitution matrix. While standard substitution matrices do not discriminate between two identical residues (e.g. all the cysteins are the same for a Pam (Dayhoff et al., 1979) or a Blosum 6 (Henikoff and Henikoff, 1992)), the extended weights are truly position specific and make it possible to discriminate between two identical residues that only differ by their positions. Once the library has been assembled (potential ways of assembling that library are described later) and the extended weights computed, T-Coffee closely follows the ClustalW procedure using the extended weights instead of a substitution matrix. The overall T-Coffee strategy is outlined in Figure 1. All the sequences are first aligned two by two, using dynamic programming (Needleman and Wunsch, 1970) and the extended library in place of a substitution matrix. The distance matrix thus obtained is then used to compute a neighborjoining tree (Saitou and Nei, 1987). This tree guides the progressive assembly of a multiple sequence alignment: the two closest sequences are first aligned by normal dynamic programming using the extended weights to align the residues in the two sequences, no gap penalty is applied (because it has already been applied to generate the alignments contained in the library). This pair of sequences is then fixed and any gaps that have been introduced cannot be shifted later. Then the program aligns the next closest two sequences or adds a sequence to the existing alignment of the first two sequences, depending which is suggested by the guide tree. The procedure always joins the next two closest sequences or pre-aligned group of sequences. This continues until all the sequences have been aligned. To align two groups of pre-aligned sequences one uses the extended weights, as before, but taking the average library scores in each column of the existing alignments. The key feature of T-Coffee is the freedom given to the user to build his own library following whatever protocol may seem appropriate. For this purpose, one may mix structural information with database results, knowledge-based information or pre-established collections of multiple alignments. It may also be necessary to explore a wide range of parameters given some computer package. A simple library format was designed to fit that purpose, it is shown 7 on Figure 2. A library is a straightforward ASCII file that contains a listing of every pair of aligned residue that needs to be described. Any knowledge-based information can easily be added manually to an automatically generated library or the other way round. This figure also shows clearly that the library can contain ambiguities and inconsistencies (i.e. two alignments possible for the first residue of Seq1 with Seq2). These ambiguities will be resolved while the alignment is being assembled, on the basis of the score given by the extended weights. The library does not need to contain a weight associated with each possible pair of residues. On the contrary, an ideal library only contains pairs that will effectively occur in the correct multiple alignment (i.e. N2L pairs rather than N2L2 pairs). While this flexibility to design and assemble one’s own library is a very desirable property, in practice it is also convenient to have a standard automatic protocol available. Such a protocol exists and is fully integrated within the T-Coffee package. It is ran with the default mode and does not require the user to be aware of T-Coffee underlying concepts (Library, extension, progressive alignment). This default protocol extensively described and validated in (Notredame et al., 2000) requires two distinct libraries to be compiled and combined within the primary library before the extension. The first one contains a ClustalW pair-wise alignment of each possible pair of sequence within the dataset. For that purpose, ClustalW (Thompson et al., 1994) is run using default parameters. This library is global because it is generated by aligning the sequences over their whole length (global alignments) using a linear space version of the Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). The second library is local: for each possible pair of sequences, it contains the ten best non-overlapping local alignments as reported by the Lalign program (Huang and Miller, 1991) run with default parameters. In the local and the global libraries, each pair of residues found aligned is associated with a weight equal to the average level of identity within the alignment it came from. When a specific pair is found more than once, the weights associated with each occurrence are added. The main strength of 8 this protocol is to combine local and global information within a multiple alignment. The level of consistency within the library will depend on the nature of the sequences. For instance, if the sequences are very diverse, the requirement for long insertions/deletions will often cause the global alignments to be incorrect and inconsistent, while the local alignments will be less sensitive to that type of problems. In such a situation, the measure of consistence will enhance the local alignment signal and let it drive the multiple alignment assembly. Inversely, if the global alignments are good enough they will help removing the noise associated with the collection of local alignments (local alignments do not have any positional constraints). Overall, the current default T-Coffee protocol contains three distinct elements that lead to the collection of extended weights: the global library, the local library and the library extension that turns the sum of the two libraries into an extended library. Earlier work demonstrated that each of these components plays a significant part in improving the overall accuracy of the program. Table 1 shows that the current version of T-Coffee (Version 1.29) outperforms other popular multiple sequence alignment methods, as judged by comparison on BaliBase (Thompson et al., 1999), a database of hand made reference structural alignments that are based on structural comparison (See Table 1 legend for a description of BaliBase and the comparison protocol). These results illustrate well the good performances of T-Coffee on the wide range of situations that occur in BaliBase. It is especially interesting to point out that T-Coffee is the only method equally well suited to situations that require a global alignment strategy (categories 1, 2 and 3) and situations that are better served with a local alignment strategy (categories 4 and 5 with long internal and terminal insertions/deletions). The other methods are either good for global alignments (like ClustalW) or for local alignments (like Dialign2 (Morgenstern et al., 1998)). It should be noted that T-Coffee still uses ClustalW 1.69 to 9 construct the primary global library, because this was the last ‘naïve’ version of ClustalW, not tuned up on BaliBase. The latest version (1.81) has been tuned on the BaliBase references (hence its improved performances over the results originally reported for ClustalW). Using this ClustalW 1.81 version when benchmarking T-Coffee would make the process circular. Nonetheless, as good as it may seem, T-Coffee still suffers from the same shortcoming as any other package available today: LW LV QRW DOZD\V WKH EHVW PHWKRG . Even if on average it does better than any of its counterparts, one cannot guaranty that T-Coffee will always generate the best alignment. For instance, although Dialign2 is significantly less good, it T-Coffee outperforms on 17 test sets (11%). ClustalW is better than T-Coffee in 24% of the cases. We may conclude from this that in practice, there will always be situations where some alternative method beats T-Coffee. Furthermore, even in cases where the T-Coffee improvement over any alternative method is very significant, it may lead to an alignment much less than 100% correct. This may not be so helpful since for practical usage, it would be much more helpful to know where the correctly aligned portions lie. This is so true that a method 20% correct and a proper estimation of its reliability would be much more useful than a method more accurate ‘on average’. Several situations exist in which a biologist can make use of this reliability information. For instance, if the purpose of the alignment is to extrapolate some experimental data onto an otherwise un-characterized genomic sequence, one will need to be very careful not to deduce anything from an unreliable portion of the alignment. More generally, unreliable positions within a multiple sequence alignment should not be used for predictive purpose. For instance, when turning a multiple alignment into a profile in order to scan databases for remote homologues, it is essential to exclude regions whose alignment cannot be trusted and that may obscure some otherwise highly conserved position. Used in this fashion, reliability 10 information allows a significant decrease of the noise induced by locally spurious alignments. The other important application of a reliability measure is the identification of regions within a multiple alignment that are properly aligned without being highly conserved. These regions are extremely important when the alignment is used in conjunction with a predictive method that bases its analysis on mutation patterns. For instance, structure and phylogeny prediction methods require the presence of non-conserved positions to yield informative results. Any scheme that allows discriminating between positions that are degenerated but correctly aligned and positions that are simply misaligned may induce a dramatic improvement in the accuracy of these prediction methods. Furthermore a reliability measure will help identifying faulty data and provide some clues on how to correct it. In the next section, we show how consistency can be measured on a T-Coffee alignment and how that measure provides a fairly accurate reliability estimator.  0HDVXULQJ 7KH &RQVLVWHQF\ 2Q $ 0XOWLSOH 6HTXHQFH $OLJQPHQW T-Coffee is a heuristic algorithm that attempts to optimize the consistency between a multiple alignment and a list of pre-computed pair-wise alignments known as a library (Figure 2). By consistency we mean that a pair of residues described aligned in the library will also be found aligned in the multiple alignment. While the theoretical maximum for the consistency is 100%, the score of an optimal alignment will only be equal to the level of self-consistency within the library. Figure 2 shows the example of a library that is not self consistent because it 11 is ambiguous regarding the alignment of some of the residues it contains. Of course, the more ambiguous the library, the less consistency it will yield. For instance, given two residues and U T taken from two different sequences   ¡   ¢ 6 and 6 , one can easily measure the consistency (CS( 5 1 , 5 2 ) ) between the alignment of these two residues and all the other alignments contained in the library by comparing ES( 5 1 , 5 2 ), the extended score of the pair ¡ ¢     T and U, with the sum of the extended scores of all the other potential pairs that involve 6 and 6 and either U or T. If we want to use it as a quality factor, this measure suffers from two major drawbacks. Firstly it is expensive to compute: given a multiple alignment of N sequences and of length L, each pair of residues found in the multiple alignment needs O(L) operations of extension that require a minimum of O(N) operations each. “O(L)” is a standard notation called QRWDWLRQ ELJ2 , meaning that the computation time is proportional to L, up to a constant factor. Since there are L*N2 pairs of residues in a multiple alignment, this leads to O(L2N3) operations for an estimate of the CS of every pair. This cubic complexity becomes problematic with large numbers of sequences. The second limitation of this measure is that with sequences rich in repeats, the summation factor can become artificially high and cause a dramatic decrease of the consistency score. In practice, we found it much more effective to use the extended score of the best scoring pair contained in the alignment as a normalization factor. This defines the aCS (approximate Consistency measure). 12 ©     aCS 5 1 , 5  2 5 1 ,5 2 0D[ (6 5 ,5 ¨ §  § § § § § ( ) = ES( ) { ( )} (2) ¤ = 1 = 2 £ £ ¤  ¤ ¥ ¦ ¤ ¦ ¥ ¦ CS 5 1 , 5 ¥ 2 5 1 ,5 2 5 1 ,5 2 5 1 ,5 £ £ £ £ £ £ £ £ ( ) = ES( )/  ∑ ES(  ) + ∑ ES( 2 )  (1)   Our measurements on the BaliBase dataset indicate that the CS and the aCS are well correlated. An important criteria, when using the aCS as a reliability measure, is its ability to discriminate between correct and incorrect alignments within the so-called twilight zone (Sander and Schneider, 1991). Given two sequences, the twilight zone is a range of percent identity (between 0 and 30%) that has been shown to be non-informative regarding the relationship that exist among two sequences. Two sequences whose alignment yields less than 30% identity can either be unrelated or related and incorrectly aligned or related and perfectly aligned. To check how good the aCS is when used as an accuracy measure, every 142 BaliBase dataset was aligned using T-Coffee 1.29 and the similarity of each pair of sequence was measured within the obtained alignments. Pairs of sequences with less than 30% identity (5088) were extracted and the accuracy of their alignment was assessed by comparison with their counterparts in the reference BaliBase alignment, the aCS score was also assessed on each pair of aligned residues and averaged along the sequences. Figure 3a shows the scattered graph Identity Vs Accuracy (See Figure legend for the definitions). Despite a weak correlation between these two measurements, the percent identity is a poor predictor of the alignment accuracy. For 75% of the sequence pairs (identity lower than 25%) the accuracy indication given by the percent identity falls in a 40% range (i.e. the average identity indicates the average accuracy +/- 20%). On the other hand, when the accuracy is plotted against the aCS score (Figure 3b) the correlation is improved and for pairs of sequences having an aCS higher than 20 (this is true for 60% of the 5088 pairs) this measure is a much better alignment accuracy predictor than the percent identity. While they do not solve the twilight zone   With 5 , 5 any two residues found aligned in the multiple alignment.     13 problem, these results indicate that the aCS measure provides us with a powerful mean of assessing an alignment reliability within the twilight zone. Nonetheless, from a practical point of view, the aCS measure is not so useful since one is often more concerned by the overall quality (i.e. is residue r of sequence S correctly aligned to the rest of the sequences?) than by pair-wise relationships. In order to answer this type of questions, the aCS measure was used to derive three very useful non pair-wise indexes. 7KH &RQVLVWHQF\ RI WKH 2YHUDOO 5HVLGXH (YDOXDWLRQ (CORE) is obtained by averaging the scores of each of the aligned pairs involving a residue within a column. Where T and U are two residues found aligned in the same column. The CORE index and equivalent approaches have been shown in the literature to be good indicators of the local quality of a multiple sequence alignment (Heringa, 1999; Notredame et al., 1998), as judged by comparison with reference biological alignments. In the T-Coffee package, an option makes it possible to output multiple alignments with the CORE index (a rounded value between 0 and 9) replacing each residue. It is also possible to produce a colorized version (pdf, postscript or html) of that same multiple alignment where residues receive a background coloration proportional to their CORE index (blue/green for low scoring residues and orange/red for the high scoring ones). Such an output is shown on Figure 5 and 6. ! =1, ≠ " $ % " % CORE 5 D&6 5 ,5 " # ! # ( )= ∑ ! # ( )/ ( 1 − 1) (3) 14 The CORE index described in equation (3) is merely an average aCS measure, and whether that measure provides some indication on the multiple alignment quality is a key question. We tested that hypothesis on the complete BaliBase dataset. Given each T-Coffee alignment, residues were divided in 4 categories: (i) WUXH SRVLWLYHV (TP) are correctly aligned residues rightfully predicted to be so, (ii) WUXH QHJDWLYH (TN) are incorrectly aligned residues rightfully predicted to be so, (iii) IDOVH SRVLWLYH (FP) are residues predicted to be well aligned when they are not, (iv) IDOVH QHJDWLYH (FN) are residues wrongly predicted to be misaligned. Following previously described definitions (Notredame et al., 1998), a residue is said to be correctly aligned if at least 50% of the residues to which it was aligned in the reference alignment are found in the same column in the T-Coffee alignment. Each of the 10 CORE indexes (0 to 9) was used in turn as threshold to discriminate correctly and non-correctly aligned residues on the T-Coffee alignments. The BaliBase reference alignments were then used to evaluate the TP, TN, FP and FN. Sensitivity and the specificity were then computed according to Sneath and Sokal (Sneath and Sokal, 1973) and plotted on a graph (Figure 4). Our results indicate that the best trade off between sensitivity and specificity is obtained when CORE=3 is used as a threshold (i.e. every residue with a score higher or equal to 3 is considered to be properly aligned). In that case the specificity is 84% and the sensitivity is 82%. These high figures partly reflect the overall quality of the T-Coffee alignments in which 80.5% of the residues are correctly aligned according to the criteria used here. It is therefore more interesting to note that when the CORE index reaches 7, the specificity is 97.7% and the sensitivity is close to 50%. This means that thanks to the CORE index, half of the residues properly aligned in a multiple alignment can unambiguously be identified (e.g. more than 40 % of all the residues contained in BaliBase). In the next section we will see that this proper identification sometimes occurs in cases that are far from being trivial, even for an expert eye. Similar results were observed when applying the CORE index on multiple alignments obtained using 15 another method (i.e. ClustalW alignments evaluated with a standard T-Coffee library). This suggests that the CORE measure may be used to evaluate the local quality of a multiple alignment produced by any source. However, one should be well aware that the relevance of the CORE measure regarding the reliability of an alignment is entirely dependant on the way in which the library was derived. All the conclusions drawn here only apply to libraries derived using the standard T-Coffee protocol. 7KH VHTXHQFH &25( V&25( is obtained by averaging the CORE scores over all the residues contained within one sequence. =1 That measure can be helpful for identifying among the sequences an outlier, a sequence that should not be part of the set either because it is not homologous or because it is too distantly related to the other members to yield an informative alignment. 7KH DOLJQPHQW &25( (alCORE) may be obtained by averaging the sCOREs over all the sequences. Our analysis suggest that this index may be a reasonable indicator of the alignment overall accuracy. Yet, to be fully informative, it requires the sequence set to be homogenous (i.e. the standard deviation of the sCOREs should be as low as possible). ' 52& 5 ( ) sCORE(6[ ) = ∑ ' ( ( ))/ & / (4) 16  8VLQJ WKH &25( 0HDVXUH 7R $VVHVV /RFDO $OLJQPHQW 4XDOLW\ The driving force behind the development of the CORE index is the identification of correctly aligned blocks of residues within a multiple sequence alignment. It is common practice to identify these blocks by scanning the multiple alignment and marking highly conserved regions as potentially meaningful. ClustalW and ClustalX provide a measure of conservation that may help the user when carrying out this task. Unfortunately, situations exist where it is difficult to make a decision regarding the correct alignment of some residues within an alignment. Such an example is provided in Figure 5 with the BaliBase alignment known as 1pamA_ref1, made of 6 alpha-amylases. This set is difficult to align because it contains highly divergent sequences. Not only have these sequences accumulated mutations while they were diverging but they have also undergone many insertion/deletions that make it difficult to reconstruct their relationships with accuracy. The average level of identity measured on the BaliBase reference is 18%, the two closest sequences being less than 20% identical. As such, 1pamA_ref1 constitutes a classic example of a test set deceptive for most multiple sequence alignment methods. The fact that less than one third of the 1pam_ref1 reference alignment is annotated as trustable in BaliBase confirms that suspicion. When ran on this test-set, existing alignment programs generate different results, Prrp finds 37% of the columns annotated as trustable in BaliBase, ClustalW (1.81) 40%, T-Coffee 54% and Dialign2 56%. Regardless of the methods used, such an alignment is completely useless unless correctly aligned portions can be identified. It is exactly the information that the CORE index provides us with. An alignment colorized according to its CORE indexes is shown on Figure 5. 17 The results are in good agreement with those reported in Figure 4. Out of the 905 correctly aligned residues (42% of the total), 267 have a score higher than 7. No incorrectly aligned residue has a score higher or equal to 7. Using 7 as a prediction threshold gives a sensitivity of 29% and a specificity of 100%. Residues with a CORE index of 3 or higher (yellow pale) yield a sensitivity of 65% and a specificity of 79%. In this alignment, the main features are the red/dark-orange blocks: they are 100% correct. These blocks could be fed as they are to any suitable method (structure prediction, phylogeny….). They are not very well conserved at the sequence level and are therefore very informative for structural and phylogenetic analysis. For instance, the block II in Figure 5 is perfectly aligned although within that block, the average pair-wise identity is lower than 30% (41 % for the two most closely related sequences). The measure of consistency can also help questioning positions that may seem unambiguous from a sequence point of view. In the column annotated as I, the position marked with a “*” could easily be mistaken to be correct: it is within a block, aromatic positions are usually fairly well conserved and owing to their relative rarity, unlikely to occur by chance. Yet the green color code indicates that this position may be incorrectly aligned (the green tyrosine has a CORE index of 1). This is confirmed by comparison with the reference that shows the correct alignment to incorporate another tyrosine at this position. When analyzing these patterns, one should always keep in mind that the consistency information only has a positive value. In other words, inconsistent regions are those where the library does not support the alignment. This does not mean they are incorrectly aligned but rather that no information is at hand to support or disprove the observed alignment. 18  ,GHQWLI\LQJ )DXOW\ *HQH 3UHGLFWLRQV Another possible application of the T-coffee CORE index is to reveal and help resolving sequence ambiguities in predicted genes. In the structural genomic era, many projects involve hypothetical proteins, for which an accurate prediction of the start and stop codon is needed to properly express the gene product. Since over-predicted N or C-terminal are rarely conserved at the amino acid level, sequence comparison provides us with a very powerful mean of identifying this type of problems. A simple procedure consists of multiply aligning the most conserved members of a protein family before measuring the T-Coffee CORE index on the resulting alignment. Inspection of the CORE patterns offers a diagnostic regarding the correctness of the data. This approach can also be applied to frame-shift detection where the identification of abnormally low scoring segments may lead to their correction. Such an alignment will make it possible to decide if the abnormal length of a coding region could be due to a sequencing error (and the resulting frame-shift). At least the CORE measure will indicate that a thorough examination is needed. Of course, one could also detect these frameshifts using standard pair-wise comparison methods such as Gene-wise (Birney and Durbin, 2000), but the advantage of using a multiple sequence alignment is that the simultaneous comparison of several sequences can strengthen the evidence that the frame-shift is real. Furthermore, thanks to the multiple alignment, one may be able to detect mistakes in sequences that lack a very close homologue. To illustrate this potential usage of T-Coffee, we chose the example of an (VFKHULFKL FROL . gene (Accession # U00096) predicted to encode a protein of unknown function, yifB. Orthologous genes were found in complete genomes using BLAST (Altschul et al., 1997) and the four most conserved sequences (identity >70% relative to the (VFKHULFKLD FROL . gene, 19 see figure for ID numbers) were retrieved along with their flanking regions (80 nucleotides on the N-terminus side) in order to check whether these supposedly non coding regions did not contain any coding information. The ‘elongated’ sequences were translated in the same frame as their core coding region, their multiple alignment was carried out using T-Coffee and the CORE indexes were measured. The resulting alignment is displayed on Figure 6 with the CORE indexes color-coded (low CORE in blue and green, high CORE in orange and red). The main feature on the N-terminus is an abrupt transition (II) from low to high CORE indexes. This position is also a conserved methionine. The combination of these two observations suggests that the starting point of these five sequences is probably where the transition occurs, ruling out other methionines as potential starting points in the first sequence (I). Another discrepancy occurs in this alignment that is also emphasized by the CORE analysis: the sequence yifB_SALTY_1 yields a very low N-terminal CORE index, relatively to the other family members. The CORE score of this sequence becomes abruptly in phase with the other sequences at the position marked III. This pattern is a clear indication of a frame-shift: a protein highly similar to the other members of its family but locally unrelated. To verify that hypothesis, we used some data provided by SwissProt (Bairoch and Boeckmann, 1992) and found that in the corresponding entry, the nucleotide sequence has been corrected to remove the frame-shift we observed (entry P57015). The corrected sequence has been added to the bottom of the alignment on Figure 6 (non-colored sequence). The position where yifB_SALTY_1 and its corrected version start agreeing is also the position where the CORE score changes abruptly from a value of 2 (yellow) to a value of 7 (orange). That position also turns out to be the one where the frame-shift occurs in the genomic sequence. 20  &RQFOXVLRQ In this chapter, we introduced an extension of the T-Coffee multiple sequence alignment method: the CORE index. The CORE index is a mean of assessing the local reliability of a multiple sequence alignment. Using the CORE index, correct blocks within a multiple sequence alignment can be identified. This measure also makes it possible to detect potential errors in genomic data, and to correct them. The CORE index is a relatively add hoc measure and even if it may prove extremely useful from a practical point of view, it still needs to be attached to a more theoretical framework. One would really need to be able to turn the consistency estimation into some sort of P-Value. For instance, to assess efficiently the local value of an alignment, one would like to ask questions of the following kind: what is the probability that library X was generated using dataset Y? What is the probability that alignment A yields p% consistency with library X? Altogether these questions may open more venues to the automatic processing of multiple alignments. That issue may prove crucial for the maintenance of resources that rely on a large scale usage of multiple sequence alignments such as Hobacgene (Perriere et al., 2000), Hovergene (Duret et al., 1994)or Prodom (Corpet et al., 2000). 21 )LJXUH /HJHQGV Figure 1 /D\RXW RI WKH 7&RIIHH DOJRULWKP 22 This figure indicates the chain of events that lead from unaligned sequences to a multiple sequence alignment using the T-Coffee algorithm. Data processing steps are boxed while data structures are indicated by rounded boxes. 23 Figure 2 /LEUDU\ )RUPDW An example of a library used by T-Coffee. The header contains the sequences and their names. ‘# 1 2’ indicates that the following pairs of residues will come from sequences 1 and 2. Each pair of aligned residues contains three values: the index of residue 1, the index of residue 2 and the weight associated with the alignment of these two residues. No order or consistency is expected within the library. 24 Figure 3 a) 3HUFHQWDJH LGHQWLW\ 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH: the 5088 pairs of sequences that have less than 30% identity in the BaliBase reference alignments were extracted. The accuracy of 25 their alignment was measured by comparison with the reference, and the resulting graph was plotted. b) $SSUR[LPDWH &RQVLVWHQF\ 6FRUH D&6 9V $FFXUDF\ LQ WKH WZLOLJKW ]RQH : the aCS was measured on the 5088 pairs of sequences previously considered and was plotted against the average accuracy previously reported. The vertical line indicates aCS=25 and separates the pairs for which the aCS is informative from those whose aCS seems to be non-informative. 26 27 Figure 4 6SHFLILFLW\ DQG 6HQVLWLYLW\ RI WKH &25( PHDVXUH 28 The sensitivity and the specificity of the CORE index used as an alignment quality predictor were evaluated on the BaliBase test-sets. Measures were carried out on the entire BaliBase dataset. The sensitivity (z) and the specificity („) were measured on the T-Coffee alignments after considering that every residue with a CORE index higher than x was properly aligned (see text for details). 29 Figure 5 ,GHQWLI\LQJ FRUUHFW EORFNV ZLWK WKH &25( PHDVXUH 30 An example of the T-Coffee output on a BaliBase test set (1pamA_ref1) that contains five alpha amylases. This alignment was produced using T-Coffee 1.29 with default parameters and requesting the score_pdf output option. The color scale goes from blue (CORE=0, bad) to red (CORE=9, good). The residues in capital are correctly aligned (as judged by comparison with the BaliBase reference). Those in lower case are improperly aligned. Box I indicates a conserved position that is not properly aligned, box II indicates a block of distantly related segments that is correctly aligned by T-Coffee. 31 Figure 6 ,GHQWLI\LQJ IUDPH VKLIWV DQG VWDUW FRGRQV 32 The chosen sequences came are YifB_ECOLIA YifB_SALTY_1 (6DOPRQHOOD (+DHPRSKLOXV LQIOXHQ]DH WLSK\ (VFKHULFKLD FROL , accession # AE005174), , C18 chromosome, Sanger Center), YifB_HAIN PXOWRFLGD Accession # L42023), YifB_PASMU (3DVWHXUHOOD DHUXJLQRVD , Accession # AE004439) and YifB_PSEAE (3VHXGRPRQDV , Accession # AE004091), they were aligned using the standard T-Coffee alignment procedure and requesting the score_pdf output option. The corrected sequence of 6DOPRQHOOD WLSK\ YifB protein sequence was later added for further reference (YifB_SALTY, SP: P57015) but it was not used for coloring the residues (Non colored sequence) or improving the multiple alignment. The figure only shows the N-terminal portion of the alignment, and the arrow indicates the positions annotated as starting codons in SwissProt (except for salmonella tiphy). Box I indicates a putative starting codon in YifB_ECOLIA, Box II indicates the true starting codon in most sequences, and Box III indicates the position where the frame-shift occurs in YifB_SALTY_1. 33 7DEOH  cat 1 cat 2 cat 3 cat 4 cat 5 avg 1 avg 2 ------------------------------------------------------------------------cw prrp dialign2 T-Coffee 79.53 78.62 70.99 32.91 32.45 25.21 48.72 50.14 35.12 74.02 51.12 74.66 67.84 82.72 80.38 67.89 66.45 61.54 61.82 60.25 57.99 To produce this table each dataset contained in BaliBase was aligned using one of the methods (cw: ClustalW 1.81 (Thompson et al., 1994), Prrp (Gotoh, 1996), dialign2 (Morgenstern et al., 1998) and T-Coffee 1.29 (Notredame et al., 2000). In BaliBase, reference alignments are classified in 5 categories: category 1 contains closely related sequences, category 2 contains a group of closely related sequences and an outsider category 3 contains two groups of sequences that are distantly related, category 4 contains families with long internal indels, Category 5 contains sequences with long terminal indels. The resulting alignments were then compared to their reference counterpart in BaliBase, only using the regions annotated as trustable in BaliBase. Under this scheme we define the accuracy of an alignment to be the percentage of columns that are found totally conserved in the reference divided by the total number of columns within that reference. The comparison is restricted to the portions annotated as trustworthy in the reference alignment. results obtained on each of the 142 test cases, DYJ  DYJ  is the average of the results obtained in each category. Bold numbers indicate the best performing method. 34 E E 3 S 5 QVB2TQDU87 0 C 3 H Q64QTQDGQFPIG2F8D3BA9@864120 S 5 3 C S C R 3 9 0 1 H 3 9 E E C 5 7 5 3 is the average of the %LEOLRJUDSK\ Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. nucleic acids res. 25: 2289-3402. Bairoch, A. and Boeckmann, B., 1992. The SWISS-PROT protein sequence data bank. Nucleic Acids Res: 2019-2022. Birney, E. and Durbin, R., 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res 10: 547-548. Corpet, F., Servant, F., Gouzy, J. and Kahn, D., 2000. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. nucleic acids res 28: 267269. Dayhoff, M.O., Schwarz, R.M. and Orcutt, B.C., 1979. A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: M.O. Dayhoff (Editor), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C., pp. 353-358. Duret, L. and Abdeddaim, S., 2000. Multiple Alignment for Structural, Functional, or phylogenetic analyses of Homologous Sequences. In: D. Higgins and W. Taylor (Editors), Bioinformatics, Sequence, structure and databanks. Oxford University Press, Oxford. Duret, L., Mouchiroud, D. and Gouy, M., 1994. HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22: 2360-2365. 35 Feng, D.-F. and Doolittle, R.F., 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25: 351-360. Gotoh, O., 1996. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264: 823-838. Gribskov, M., McLachlan, M. and Eisenberg, D., 1987. Profile analysis: Detection of distantly related proteins. Proceedings of the National Academy of Sciences 84: 43555358. Henikoff, S. and Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Computers and Chemistry 23: 341364. Huang, X. and Miller, W., 1991. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12: 337-357. Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195-202. Krogh, A., Brown, M., Mian, I.S., Sjölander, K. and Haussler, D., 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235: 1501-1531. Morgenstern, B., Frech, K., Dress, A. and Werner, T., 1998. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Needleman, S.B. and Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453. 36 Notredame, C., Higgins, D.G. and Heringa, J., 2000. T-Coffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302: 205-217. Notredame, C., Holm, L. and Higgins, D.G., 1998. COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422. Perriere, G., Duret, L. and Gouy, M., 2000. HOBACGEN: database system for comparative genomics in bacteria. Genome Research 10: 379-385. Saitou, N. and Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425. Sander, C. and Schneider, R., 1991. Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9: 56-68. Sneath, P.H.A. and Sokal, R.R., 1973. Numerical Taxonomy. Freeman, W.H., San Francisco. Taylor, W.R., 1988. A flexible method to align large numbers of biological sequences. Journal of Molecular Evolution 28: 161-169. Thompson, J., Higgins, D. and Gibson, T., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4690. Thompson, J.D., Plewniak, F. and Poch, O., 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682-2690. 37 for Nucleic Acids Research Database Issue A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary v.3 Sabine Dietmann1, Jong Park1, Cedric Notredame2, Andreas Heger1, Michael Lappe1 and Liisa Holm1 1 Structural Genomics Group, EMBL-EBI, CB10 1SD Cambridge, UK 2 Structural and Genetic Information, C.N.R.S UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France Abstract The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank. The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to (1) supersecondary structural motifs (attractors in fold space), (2) the topology of globular domains (fold types), (3) remote homologues (functional families), and (4) homologues with sequence identity above 25 % (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10531 PDB entries comprising 17101 chains, which were partitioned into 5 attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures severalfold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores, and a comprehensive library of explicit multiple alignments of distantly related protein families. Introduction Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB), and a number of derived databases that organize this data into hierarchical classification schemes or in terms of structural neighbourhoods have appeared on the World Wide Web [1-4]. We maintain the Dali Domain Dictionary and FSSP database with continuous weekly updates. Because many structural similarities are between substructures (domains), i.e., parts of structures, protein chains are decomposed into domains using the criteria of recurrence and compactness [5]. Each domain is assigned a Domain Classification number D.C.l.m.n.p representing fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). The discrete classification presents views which are free of redundancy and simplify navigation in protein space. The structural classification is explicitly linked to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships (Figure 1). A map of fold space The central concept underlying the classification is a ’map of fold space’. This map is based on exhaustive neighbouring of all protein structures in the PDB. The all-against-all structure comparison is carried out using the Dali program. As a result of the exhaustive comparisons, each structure in the PDB is positioned in an abstract, high-dimensional space according to its structural similarity score to all other structures. The graph of structural similarities (between domains) is partitioned into clusters at four different levels of granularity. Coarse-grained overviews yield few clusters with many members that share broad architectural similarities, while fine-grained clustering yields many clusters within which structural similarities between members can extend to atomic detail due to functional constraints, for example, in binding sites. Continuing the practice from the FSSP database, fold types are defined by agglomerative clustering so that the members of a fold type have average pairwise Z-scores above 2. The threshold has been chosen empirically to group together structures with topological similarity. Dali Domain Dictionary v.3 introduces two new levels to the fold classification, one above and one below the fold type abstraction. The top level of the fold classification corresponds to secondary structure composition and supersecondary structural motifs. We have previously identified five attractor regions in fold space [1]. We partition fold space so that each domain is assigned to one of attractors I-V, which are represented by archetype structures, using a shortest-path criterion. Structures which are disconnected from other structures, are assigned to class X. Domains which are not clearly closer to one attractor than another, are assigned to the mixed class Y. Currently, class Y comprises about one sixth of the representative domain set. In the future, some of these may be assigned to emerging new attractors. An evolutionary classification The other new level of the classification infers plausible evolutionary relationships from strong structural similarities which are accompanied by functional or sequence similarities. Conceptually, this functional family level is equivalent to the ’superfamily’ level of scop [2]. The computational discrimination between physically convergent (analogous) and evolutionarily related, divergent (homologous) proteins has received much attention recently [6-8]. Structural similarity alone is insufficient to draw a line between the two classes. For example, lysozymes exhibit extreme structural divergence in regions supporting the active site, while coiled coils and beta-barrels are simple, geometrically constrained topologies which are believed to have emerged several times in protein evolution. To address the evolutionary classification problem, we have chosen to analyse functional and sequence-motif attributes on top of structural similarity in a numerical taxonomy. The more functional features two proteins have in common, the more likely it is that they do so due to a common descent rather than by chance. Currently, our feature set includes common sequence neighbours (overlap of PSI-Blast families), analysis of 3D clusters of identically conserved residues, enzyme classification (E.C. numbers) and keyword analysis of biological function. A neural network assigns weights to these qualitatively different features. The neural network was trained against the superfamily to fold transition in a manual fold classification [2]. To unify families, we exploit the empirical observation that Dali’s intramolecular distance comparison measure gives higher scores to pairs of homologues than to analogues. In practice, we require that functional families are nested within fold families in the fold dendrogram: functional families are branches of the fold dendrogram where all pairs have a high average neural network prediction for being homologous. The threshold for unification was chosen empirically and is conservative. 504 functional families unify two or more sequence families. Unified families have functional residues or sequence motifs that map to common sites in the 3D context of a fold. The strongest evidence is usually obtained for unifying enzyme catalytic domains. In some cases the expert system fails to capture enough evidence for unification of domains which are believed to be homologous, such as within the varied set of helix-turn-helix motif containing DNA binding domains where several functional families are defined at the same fold type level. A library of structure-based multiple alignments of remote homologues The Dali Domain Classification can be browsed interactively at http://www.ebi.ac.uk/dali/domain. The server is implemented on top of a MySQL database. The classification may be entered from the top of the hierarchy, or the user may make a query about a protein identifier or a node in the classification hierarchy. Multiple structural alignments including attributes of the proteins are generated on the fly for any user selection of structural neighbours. Precomputed alignments are available for each functional family. The T-Coffee program [9] is used to generate genuine consensus alignments of multiple structures from the library of pairwise Dali alignments. A reliability score is computed to indicate well defined regions (the structural core) and regions where structural equivalences are ambiguous. Technically, T-Coffee improves alignment quality in a few known cases of functional families where active site residues were inconsistently aligned in some of the pairwise Dali comparisons. Scientifically, the definition of functional families and reliable multiple structure alignments for each opens the door to sensitive sequence database searches using position-specific profiles, and to benchmarking the alignment accuracy of threading predictions. Acknowledgement S.D. and J.P. were supported by EU contract BIO4-CT96-0166. References 1 Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595-603. 2 Hubbard, T.J., Ailey, B., Brenner, S.E., Murzin, A.G. and Chothia, C. (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res., 27, 254-256. 3 Orengo, C.A., Pearl, F.M., Bray, J.E., Todd, A.E., Martin, A.C., Lo Conte, L. and Thornton, J.M. (1999) The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res., 27, 275-279. 4 Marchler-Bauer, A., Addess, K.J., Chappey, C., Geer, L., Madej, T., Matsuo, Y., Wang, Y. and Bryant, S.H. (1999) MMDB: Entrez’s 3D structure database. Nucleic Acids Res., 27, 240-243. 5 Holm, L. and Sander, C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88-96. 6 Russell RB, Saqi MA, Bates PA, Sayle RA, Sternberg MJ. (1998) Recognition of analogous and homologous protein folds--assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng. 11:1-9. 7 Kawabata, T. and Nishikawa, K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins 41, 108-122. 8 Wood, T.C., and Pearson, W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977-995. 9 Notredame C, Higgins DG, Heringa J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol,.302:205-17. 10 Bewley, M.C., Jeffrey, P.D., Patchett, M.L., Kanyo, Z.F. and Baker, E.N. (1999) Crystal structures of Bacillus caldevelox arginase in complex with substrate and inhibitors reveal new insights into activation, inhibition and catalysis in the arginase superfamily. Structure, 7, 435-438. 11 Finnin, M.S., Donigian, J.R.,Cohen, A., Richon, V.M., Rifkind, R.A., Marks, P.A., Breslow, R. and Pavletich, N.P. (1999) Structure of a histone deacetylase homologue bound to the TSA and SAHA inhibitors. Nature, 401, 188-193. 12 Kraulis, P. (1991) Appl. Crystallogr., 24, 946-950. A B Figure 1: Unification of the histone deacetylase and arginase families. Reuse and adaptation of existing structural frameworks for new cellular functions is widespread in protein evolution. Histone deacetylase and arginase are unified at the functional family level of the classification despite very little overall sequence similarity. The supporting evidence comes from structural and functional similarity. (A) Structure comparison of arginase (left: 1rlaA [10])) and histone deacetylase (right: 1c3pA [11]) yields a high Z-score of 12. Superimposition by Dali, drawing by Molscript [12]. (B) Joint structural, evolutionary and functional information for two segments around the active site. Structurally aligned positions are shaded. Arginase has a binuclear metal centre where residues D124, H126 and D234 bind one and residues H101, H128 and H232 the other manganese ion. The former site is structurally equivalent to the zinc binding site of histone deacetylase made up of residues D168, H170 and D258. Sequence variability from multiply-aligned sequence neighbours in HSSP (* means values 10 or larger; 0 means invariant) is shown above and the secondary structure summary from DSSP (E,B: beta-sheet, S bend, T,G: hydrogen-bonded turns) is shown below the amino acid sequences. 8VLQJ *HQHWLF $OJRULWKPV IRU 3DLUZLVH DQG 0XOWLSOH 6HTXHQFH $OLJQPHQWV Cédric Notredame Information Génétique et Structurale CNRS-UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email: cedric.notredame@igs.cnrs-mrs.fr  ,QWURGXFWLRQ 1.1 Importance of Multiple Sequence Alignment The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Given a set of homologous sequences, multiple alignments are used to help predict the secondary or tertiary structure of new sequences [51]; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families [6]; to suggest primers for the polymerase chain reaction (PCR) and as an essential prelude to phylogenetic reconstruction [19]. These alignments may be turned into profiles [25] or Hidden Markov Models (HMMs) [27, 9] that can be used to scour databases for distantly related members of the family. Multiple alignment techniques can be divided into two categories: global and local techniques. When making a global alignment, the algorithm attempts to align sequences chosen by the user over their entire length. Local alignment algorithms automatically discard portions of sequences that do not share any homology with the rest of the set. They constitute a greater challenge since they increase the amount of decision made by the algorithm. Most multiple alignment methods are global, leaving it to the user to decide on the portion of sequences to be incorporated in the multiple alignment. To aid that decision, one often uses local pairwise alignment programs such as Blast [3] or Smith and Waterman [56]. In the context of this chapter, we will focus on global alignment methods with a special emphasis on the alignment of protein and RNA sequences. Despite its importance, the automatic generation of an accurate multiple sequence alignment remains one of the most challenging problems in bioinformatics. The reason behind that complexity can easily be explained. A multiple alignment is meant to reconstitute relationships (evolutionary, structural, and functional) within a set of sequences that may have been diverging for millions and sometimes billions of years. To be accurate, this reconstitution would require an in-depth knowledge of the evolutionary history and structural properties of these sequences. For obvious reasons, this information is rarely available and generic empirical models of protein evolution [18, 28, 8], based on sequence similarity must be used instead. Unfortunately, these can prove difficult to apply when the sequences are less than 30% identical and lay within the so-called “twilight zone” [52]. Further, accurate optimization methods that use these models can be extremely demanding in computer resources for more than a handful of sequences [12, 62]. This is why most multiple alignment methods rely on approximate heuristic algorithms. These heuristics are usually a complex combination of ad hoc procedures mixed with some elements of dynamic programming. Overall, two key properties characterize them: the optimization algorithm and the criteria (objective function) this algorithm attempts to optimize. 1.2 Standard Optimization Algorithms Optimization algorithms roughly fall in three categories: the exact, the progressive, and the iterative algorithms. Exact algorithms attempt to deliver an optimal or a sub-optimal alignment within some well defined bounds [40], [57]. Unfortunately, these algorithms have very serious limitations with regard to the number of sequences they can handle and the type of objective function they can optimize. Progressive alignments are by far the most widely used [30, 14, 45]. They depend on a progressive assembly of the multiple alignment [31, 20, 58] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [43]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity even if it is by nature a greedy heuristic that does not guarantee any level of optimization. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvement can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one from a multiple alignment and realigning them to the remaining sequences [7] [24] [29]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [39], simulated annealing (SA) [37, 36, 33] and evolutionary computation such as genetic algorithms (GAs) [44, 47, 34, 64, 5, 23] and evolutionary programming [11, 13]. Their main advantage is to allow for a good separation between the optimization process and evaluation criteria (objective function). It is the objective function that defines the aim of any optimization procedure and in our case, it is also the objective function that contains the biological knowledge one tries to project in the alignment. 1.3 The Objective Function In an evolutionary algorithm, the objective function is the criteria used to evaluate the quality (fitness) of a solution (individual). To be of any use, the value that this function associates to an alignment must reflect its biological relevance and indicate the structural or the evolutionary relation that exists among the aligned sequences. In theory, a multiple alignment is correct if in each column the aligned residues have the same evolutionary history or play similar roles in the three-dimensional fold of RNA or proteins. Since evolutionary or structural information is rarely at hand, it is common practice to replace them with a measure of sequence similarity. The rationale behind this is that similar sequences can be assumed to share the same fold and the same evolutionary origin [52] as long as their level of identity is above the so-called "twilight zone" (more than 30% identity over more than 100 residues). Accurate measures of similarity are obtained using substitution matrices [18, 28]. A substitution matrix is a pre-computed table of numbers (for proteins, this matrix is 20*20, representing all possible transition states for the 20 naturally occurring amino acids) where each possible substitution/conservation receives a weight indicative of its likeliness as estimated from data analysis. In these matrices, substitutions (conservations) observed more often than one would expect by chance receive positive values while under-represented mutations are associated with negative values. Given such a matrix the correct alignment is defined as the one that maximizes the sum of the substitution (conservations) score. An extra factor is also applied to penalize insertions and deletions (Gap penalty). The most commonly used model for that purpose is named ’affine gap penalties’. It penalizes an insertion/deletion once for its opening (gap opening penalty, abbreviated GOP) and then with a factor proportional to its length (gap extension penalty, abbreviated GEP). Since any gap can be explained with one mutational event only, the aim of that scheme is to make sure that the best scoring evolutionary scenario involves only a small number of insertions or deletions (indels) in the alignment. This will result in an alignment with few long gaps rather than many short ones. The resulting score can be viewed as a measure of similarity between two sequences (pairwise). This measure can be extended for the alignment of multiple sequences in many ways. For instance, it is common practice to set the score of the multiple alignment to be the sum of the score of every pairwise alignment it contains (sums of pairs)[1]. While that scoring scheme is the most widely used, its main drawback stems from the lack of an underlying evolutionary scenario. It assumes that every sequence is independent and this results in an overestimation of the number of substitutions. It is to counterbalance that effect that probability based schemes were introduced in the context of HMMs. Their purpose is to associate each column of an alignment with a generation probability [39]. Estimations are carried out in a Bayesian context where the model (alignment) probability is evaluated simultaneously with the probability of the data (the aligned sequences). In the end, the score of the complete alignment is set to be the probability of the aligned sequences to be generated by the trained HMM. The major drawbacks of this model are its high dependency on the number of sequences being aligned (i.e. many sequences are needed to generate an accurate model) and the difficulty of the training. More recently, new methods based on consistency were described for the evaluation of multiple sequence alignments. Under these schemes, the score of a multiple alignment is the measure of its consistency with a list of pre-defined constraints [42, 46, 10, 45]. It is common practice for these pre-defined constraints to be sets of pairwise, multiple or local alignments. Quite naturally, the main limitation of consistencybased schemes is that they make the quality of the alignment greatly dependent on the quality of the constraints it is evaluated against. An objective function always defines a mathematical optimum, that is to say an alignment in which the sequences are arranged in such a manner that they yield a score that cannot be improved. The mathematically optimal alignment should never be confused with the correct alignment, the biological optimum. While the biological optimum is by definition correct, a mathematically optimal alignment is biologically only as good as it is similar to the biological optimum. This depends entirely on the quality of the objective function that was used to generate it. In order to achieve this result, there is no limit to the complexity of the objective functions one may design, even if in practice the lack of appropriate optimization engines constitutes a major limitation. What is the use of an objective function if one cannot optimize it and how is it possible to tell if an objective function is biologically relevant or not? Evolutionary algorithms come in very handy to answer these questions. They make it possible to design new scoring schemes without having to worry, at least in the first stage, about optimization issues. In the next section, we introduce one of these evolutionary techniques known as genetic algorithms (GA). GAs are described along with another closely related stochastic optimization algorithm: simulated annealing. Three examples are reviewed in details, in which GAs were successfully applied to sequence alignment problems.  (YROXWLRQDU\ $OJRULWKPV DQG 6LPXODWHG $QQHDOLQJ An evolutionary algorithm is a way of finding a solution to a problem by means of forcing sub-optimal solutions to evolve through some perturbations (mutations and recombination). Most evolutionary algorithms are stochastic in the sense that the solution space is explored in a random rather than ordered manner. In this context, randomness provides a non-null probability to sample any potential solution, regardless of the solution space size, providing that the mutations allow such an exploration. The drawback of randomness is that all potential solutions may not be visited during the search (including the global optimum). In order to correct for this problem, a large number of heuristics have been designed that attempt to bias the way in which the solution space is sampled. They aim at improving the chances of sampling an optimal solution. For that reason, most stochastic strategies (including evolutionary computation) can be regarded as a tradeoff between greediness and randomness. Two stochastic strategies have been widely used for sequence analysis: simulated annealing and genetic algorithms. Strictly speaking, SA does not belong to the field of evolutionary computation, yet, in practice, it has been one of the major source of inspiration for the elaboration of genetic algorithms used in sequence analysis. 2.1 Simulated Annealing Simulated annealing (SA) [38] was the first stochastic algorithm used to attempt solving the multiple sequence alignment problem [33, 37]. SA relies on an analogy with physics. The idea is to compare the solving of an optimization problem to some crystallization process (cooling of a metal). In practice, given a set of sequences, a first alignment is randomly generated. A perturbation is then applied (shifting of an existing gap or introduction of a new one) and the resulting alignment is evaluated with the objective function. If that new alignment is better than the previous one, it replaces it, otherwise it replaces it with a probability that depends on the difference of score and on the current temperature. The higher the temperature the more likely an important score difference will be accepted. Every cycle the temperature decreases slightly until it reaches 0. From the perspective of an evolutionary algorithm, SA can be regarded as a population with one individual only. Perturbations are similar to the mutations used in evolutionary algorithms. Apart from the population size of one, the main difference between SA and any true evolutionary algorithm is the extrinsic annealing schedule. While the principle is very sound and the adequacy to multiple alignment optimization and objective function evaluation is obvious, SA suffers from a very serious drawback: it is extremely slow. Most of the studies conducted on simulated annealing and multiple alignments concluded that although it does reasonably well, SA is too slow to be used for abinitio multiple alignments and must be restricted to being used as an alignment improver (i.e. improvement of a seed alignment). This serious limitation makes it much harder to use it as the black box one needs to evaluate the design new objective functions. 2.2 Genetic Algorithms It is in an attempt to overcome the limits of SA that evolutionary algorithm were adapted to the multiple sequence alignment problem. Evolutionary algorithms are parallel stochastic search tools. Unlike SA, which maintains a single line of descent from parent to offspring, evolutionary algorithms maintain a population of trials for a given objective function. Evolutionary algorithms are among the most interesting stochastic optimization tools available today. One of the reason why these algorithms have received so little attention in the context of multiple sequence alignment is probably due to the fact that the implementation of an evolutionary algorithm dedicated to multiple alignment is much less straightforward than with simulated annealing. In other areas of computational biology, evolutionary algorithms have already been established as powerful tools. These include RNA [26, 55, 48] and protein structure analysis [53, 60, 41]. Among all the existing evolutionary algorithms (genetic algorithms, genetic programming, evolution strategies, and evolutionary programming) genetic algorithms have been by far the most popular in the field of computational biology. Although one could argue on who exactly invented GAs, the algorithms we use today were formally introduced by Holland in 1975 [32] and later refined by Goldberg to give the Simple Genetic Algorithm[22]. GAs are based on a loose analogy with the phenomenon of natural selection. Their principle is relatively straightforward. Given a problem, potential solutions (individuals within a population) compete with one another (selection) for survival. These solutions can also evolve: they can be modified (mutations), or combined with one another (crossovers). The idea is that acting together, variation and selection will lead to an overall improvement of the population via evolution. Most of the concepts developed here about GAs are taken from [22, 16]. Two ingredients are essential to the GA strategy: the selection method and the operators. Selection is established in order to lead the search toward improvement. It means that the best individuals (as judged using the objective function) must be the most likely to survive. To serve the GA purpose, this selection strategy cannot be too strict. It must allow some variety to be maintained all along the search in order to prevent the GA population from converging toward the first local minimum it encounters. Evolution toward the optimal solution also requires the use of operators that modify existing solutions and create diversity (mutations) or optimize the use of the existing diversity (crossovers) by combining existing motifs into an optimal solution. Given such a crude layout, the potential for variation is infinite and the study of new GA models is a very active field in its own right. This being said, the main difficulty to overcome when adapting a GA to a problem like multiple sequence alignment is not the choice of a proper model, but rather the conception of a well suited series of operators. This is a well known problem that has also received some attention in the field of structure prediction both for proteins [50] and RNA [54]. A simple justification is that the operators (and the problem representation) largely control the manner in which a solution landscape is being analyzed. For instance, the neighborhood of a solution is mostly defined by the exploration capabilities of the operators. Well chosen they can smoothen very rugged landscapes. Yet, on the other hand, if they are too ’smart’ and too greedy, they may prevent a proper exploration to be carried out. Finding the right trade off can prove rather a complex task. When applying GAs to sequence alignments, previous work on SA proved instrumental. It provided researcher with a set of operators well tested and perfectly suitable for integration within most evolutionary algorithms. Attempts to apply evolutionary algorithms to the multiple sequence alignment problem started in 1993 when Ishikawa et al. published a hybrid GA [34] that does not try to directly optimize the alignment but rather the order in which the sequences should be aligned using dynamic programming. Of course, this limits the algorithm to objective functions that can be used with dynamic programming. Even so, the results obtained that way were convincing enough to prompt the development of the use of GAs in sequence analysis. The first GA able to deal with sequences in a more general manner was described a few years later by Notredame and Higgins[44], shortly before a similar work by Zhang [64]. In these two GAs, the population is made of complete multiple sequence alignments and the operators have direct access to the aligned sequences: they insert and shift gaps in a random or semi-random manner. In 1997, SAGA was applied to RNA analysis [47] and parallelized for that purpose using an island model. This work was later duplicated by Anabrasu et al. [5] who carried out an extensive evaluation of this model, using ClustalW as a reference. Over the following years, at least three new multiple sequence alignment strategies based on evolutionary algorithms have been introduced [23], [13] and [11]. Each of these relies on a principle similar to SAGA: a population of multiple alignments evolves by selection, combination and mutation. The population is made of alignments and the mutations are string-processing programs that shuffle the gaps using complex models. The main difference between SAGA and these recent algorithms has been the design of better mutation operators that improve the efficiency and the accuracy of the algorithms. These new results have strengthened the idea that the essence of the adaptation of GAs to multiple sequence alignments is the design of proper operators, reflecting as well as possible the true mechanisms of molecular evolution. In order to expose each of the many ingredients that constitute a GA specialized in sequence alignments, the example of SAGA will now be reviewed in details.  6$*$ D *$ 'HGLFDWHG WR 6HTXHQFH $OLJQPHQW 3.1 The Algorithm. SAGA is a genetic algorithm dedicated to multiple sequence alignment [44]. It follows the general principles of the simple genetic algorithms (sGA) described by Goldberg [22] and later refined by Davis [17]. In SAGA, each individual is a multiple alignment. The data structure chosen for the internal representation of an individual is a straightforward twodimensional array where each line represents an aligned sequence and each cell is either a residue or a gap. The population has a constant size and does not contain any duplicate (i.e. identical individuals). The pseudo-code of the algorithm is given on figure 1. Each of these steps is developed over the next sections.  ,QLWLDOL]DWLRQ The challenge of the initialization (also known as seeding) is to generate a population as diverse as possible in terms of ’genotype’ and as uniform as possible in terms of scores. In SAGA, generation 0 consists of a 100 multiple alignments randomly generated that only contain terminal gaps. These initial alignments are less than twice the length of the longest sequence of the set (longer alignments can be generated later). To create one of these individuals, a random offset is chosen for each sequence (between 0 and the length of the longest sequence); each sequence is shifted to the right, according to the offset and empty spaces are padded with null signs in order to give the same length L to all the sequences. Seeding can also be carried out by generating sub-optimal alignments using an implementation of dynamic programming that incorporates some randomness. This is the case in RAGA [47], an implementation of SAGA specialized in RNA alignment.  (YDOXDWLRQ Fitness is measured by scoring each alignment according to the chosen objective function. The better the alignment, the better its score and the higher its fitness (scores are inverted if the OF is meant to be minimized). To minimize sampling errors, raw scores are turned into a normalized value known as the expected offspring (EO). The EO indicates how many children an alignment is likely to have. In SAGA, EOs are stochastically derived using a predefined recipe: ’the remainder stochastic sampling without replacement’ [22]. This gives values that are typically between 0 and 2. Only the weakest half of the population is replaced with the new offspring while the other half is carried over unchanged to the next generation. This practice is known as overlapping generations [16].  %UHHGLQJ It is during the breeding that new individuals (children) are generated. The EO is used as a probability for each individual to be chosen as a parent. This selection is carried out by weighted wheel selection without replacement [22] and an individual’s EO is decreased by one unit each time it is chosen to be a parent. An operator is also chosen and applied onto the parent(s) to create the newborn child. Twenty-two operators are available in SAGA. They all have their own usage probability and can be divided in two categories: mutations that only require one parent and crossovers that require two parents. Since no duplicate is allowed in the population, a newborn child is only accepted if it differs from all the other members of the generation already created. When a duplicate arises, the whole series of operations that lead to its creation is canceled. Breeding is over when the new generation is complete, and SAGA proceeds toward producing the next generation unless the finishing criterion is met.  7HUPLQDWLRQ Conditions that could guarantee optimality are not met in SAGA and there is no valid proof that it may reach a global optimum, even in an infinite amount of time (as opposed to SA). For that reason an empirical criterion is used for termination: the algorithm terminates when the search has been unable to improve for more than 100 generations. That type of stabilization is one of the most commonly used condition to stop a GA when working on a population with no duplicate (i.e. a population where all the individuals are different from one another) [17]. 3.2 Designing the Operators As mentioned earlier, the design of an adequate set of operators has been the main point of focus in the work that lead to SAGA. According to the traditional nomenclature of genetic algorithms [22], two types of operators coexist in SAGA: crossover and mutation. An operator is designed as an independent program that inputs one or two alignments (the parents) and outputs one alignment (the child). Each operator requires one or more parameters that specify how the operation is to be carried out. For instance, an operator that inserts a new gap requires three parameters: the position of the insertion, the index of sequence to modify and the length of the insertion. These parameters may be chosen completely at random (in some pre-defined range). In that case, the operator is said to be used in a stochastic manner [44]. Alternatively, all but one of the parameters may be chosen randomly, leaving the value of the remaining parameter to be fixed by exhaustive examination of all possible values. The value that yields the best fitness is kept. An operator applied this way is said to be used in a semi-hill climbing mode. Most operators may be used either way (stochastic or semi-hill climbing). For the robustness of the GA, it is also important to make sure that the operators are completely independent from any characteristic of the objective function, unless one is interested in creating a very specific operator for the sake of efficiency. .  7KH &URVVRYHUV Crossovers are meant to generate a new alignment by combining two existing ones. Two types of crossover coexist in SAGA: the one point crossover that combines two parents through a single point of exchange (Figure 2a) and the uniform crossover that promotes multiple exchanges between two parents by swapping blocks between consistent bits (Figure 2b). The uniform crossover is much less disruptive than its one-point counterpart, but it can only be applied if the two parents share some consistency, a condition rarely met in the early stages of the search. Of the two children produced by a crossover, only the fittest is kept and inserted into the new population (if it is not a duplicate). Crossovers are essential for promoting the exchange of high quality blocks within the population. They make it possible to efficiently use existing diversity. However, the blocks present in the original population only represent a tiny proportion of all the possibilities. They may not be sufficient to reconstruct an optimal alignment, and since crossovers cannot create new blocks, another class of operators is needed: mutation.  0XWDWLRQV ([DPSOH RI WKH *DS LQVHUWLRQ 2SHUDWRU SAGA’s mutation operators are extensively described in [44]. We will only review here the gap insertion operator, a crude attempt to reconstitute backward some of the events of insertion/deletions through which a set of sequences might have evolved. When that operator is applied, alignments are modified following the mechanism shown on Figure 3. The aligned sequences are split into two groups. Within each group, every sequence receives a gap insertion at the same position. Groups are chosen by randomly splitting an estimated phylogenetic tree (as given by ClustalW [59]). The stochastic and the semi-hill climbing versions of this operator are implemented. In the stochastic version, the length of the inserted gaps and the two insertion positions are randomly chosen while in the semi-hill climbing mode the second insertion position is chosen by exhaustively trying all the possible positions and comparing the scores of the resulting alignments.  '\QDPLF 6FKHGXOLQJ RI WKH 2SHUDWRUV When creating a child, the choice of the operator is just as important as the choice of the parents. Therefore, it makes sense to allow operators to compete for usage, just as the parents do for survival, in order to make sure that useful operators are more likely to be used. Since one cannot tell in advance the good operators from the bad ones, they initially all receive the same usage probability. Later during the run, these probabilities are dynamically reassessed to reflect each operator individual performances. The recipe used in SAGA is the dynamic scheduling method described by Davis [16]. It easily allows the adding and removal of new operators without any need for retuning. Under this model, an operator has a probability of being used that is a function of its recent efficiency (i.e. improvement generated over the 10 last generations). The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement over its parents, the operator that was directly responsible for its creation gets the largest part of the credit (e.g. 50%); then the operator(s) responsible for the creation of the parents also get their share on the remaining credit (50% of the remaining credit); and so on. This credit report goes on for some specified number of generation (e.g. 4). Every 10 generations, results are summarized for each operator and the usage probabilities are reassessed based on the accumulated credit. To avoid early loss of some operators, each of them has a minimum usage probability higher than 0. It is common practice to set these minimal usage probabilities so that they sum to 0.5. To that effect one can use for each operator a minimum probability of 1/(2*number of operators)). A very interesting property of that scheme is that it helps using operators only when they are needed. For instance, uniform crossovers are generally more efficient than their one point counterpart. Unfortunately, they cannot be properly used in the optimization early stages because at that point there is not enough consistency within the population. The dynamic scheduling adapts very well to that situation by initially giving a high usage probability to the one point crossover, and by shifting that credit to the uniform crossover when the population has become tidy enough to support its usage. It is interesting to notice that these two operators are competing with one another although the GA does not explicitly know they both belong to the crossover category. 3.3 Parallelisation of SAGA Long running times were SAGA’s main limitation. This became especially acute when aligning very long sequences such as ribosomal RNAs (>1000 nucleotides). It is common practice to use parallelisation in order to alleviate such problems. The technique applied on SAGA is specific of GAs and known as an island parallelisation model [21]. Instead of having a single GA running, several identically configured GAs run in parallel on separate processors. Every 5 generations they exchange some of their individuals. The GAs are arranged on the leaves and the nodes of a N-branched tree and the population exchange is unidirectional from the leaves to the root of the tree (Figure 4). By default, the individuals migrating from one GA to another are those having the best score. The GA node where they come from keeps a copy of them but they replace low scoring individuals in the accepting GA [44]. Initially implemented in RAGA, the RNA version of SAGA, this model was also extended to SAGA, using a 3-branched trees with a depth of 3 that requires 13 GAs. These processes are synchronous and wait for each other to reach the same generation number before exchanging populations. This distributed models benefits from the explicit parallelisation and is about 10 times faster than a non-parallel version (i.e. about 80% of the maximum speedup expected when distributing the computation over 13 processors). It also benefits from the new constraints imposed by the tree topology on the structure of the population. It seems to be the lack of feedback that makes it possible to retain within the population a much higher degree of diversity than a single unified population could afford. These are the terminal leaves that behave as a diversity reservoir and yield a much higher accuracy to the parallel GA than a non-parallel version with the same overall population. Nonetheless, these preliminary observations remain to be firmly established, using some extra thorough benchmarking.  $SSOLFDWLRQV &KRRVLQJ DQ 2EMHFWLYH )XQFWLRQ The main motivation behind SAGA’s design was the creation of a robust platform or a black box on which any OF one could think of could be tested in a seamless manner. Such a black box allows discriminating between the functions that are biologically relevant and those that are not. For instance, let us consider the weighted sums of pairs. This function is one of the most widely used. It owes its popularity to the fact that algorithmic methods exist that allow its approximate optimization [43, 40]. Yet we know this function is not very meaningful from a biological point of view [4]. The three main limitations that have caught biologists’ attention are the crude modeling of the insertions/deletions (gap), the assumed independence of each position and the fact that the evaluation cannot be made position dependant. Thanks to SAGA, it was possible to design new objective functions that make use of more complex gap penalties, take into account non-local dependencies or use position specific scoring schemes and to ask if this increased sophistication results in an improvement of the alignments biological quality. The following section reviews three classes of objective functions that were successfully optimized using SAGA [44, 47, 46]. 4.1 The Weighted Sums of Pairs MSA [40] is an algorithm that makes it possible to deliver an optimal (or a very close suboptimal) multiple sequence alignment using the sums of pairs measure. This sophisticated heuristic performs multi-dimensional dynamic programming in a bounded hyper-space. It is possible to assess the level of optimization reached by SAGA by comparing it to MSA while using exactly the same objective function. The sums-of-pairs principle is to associate a cost to each pair of aligned residues in each column of an alignment (substitution cost), and another similar cost to the gaps (gap cost). The sum of these costs yields the global cost of the alignment. Major variations involve: i) using different sets of costs for the substitutions (PAM matrices [18] or BLOSUM tables [28]); ii) different schemes for the scoring of gaps [1]; iii) different sets of weights associated with each pair of sequence [2]. Formally, one can define the cost of a multiple alignment (A) as:   ¡ ¡   =1 = +1 ¤ £ $/,*10(17 &267 $ ΣΣ ¡ ¢ −1 ¢ : , &267 $  $ Where N is the number of sequences, Ai the aligned sequence i, COST is the alignment score between two aligned sequences (Ai and Aj) and Wi,j is the weight associated with that pair of sequences. The COST includes the sum of the substitution costs as given by a substitution matrix and the cost of the insertions/deletions using a model with affine gap penalties (a gap opening penalty and a gap extension penalty). Two schemes exist for scoring gaps: natural affine gap penalties and quasi-natural affine gap penalties [1]. Quasi-natural gap penalties are the only scheme that the MSA program can efficiently optimize. This is unfortunate since these penalties are known to be biologically less accurate than their natural counterparts [1] because of a tendency to over-estimate the number of gaps. Under both scheme, terminal gaps are penalized for extension but not for opening. It is common practice to validate a new method by comparing the alignments it produces with references assembled by experts. In the case of multiple alignments, one often uses structure based sequence alignments that are regarded as the best standard of truth available [24]. For SAGA, validation was carried out using 3Dali [49]. Biological validation should not be confused with the mathematical validation also required for an optimization method. In the case of SAGA, both validations were simultaneously carried out, and a summary of the results obtained when optimizing the sums of pairs is shown of Table 1. Firstly, SAGA was used to optimize the sums of pairs with quasi-natural gap penalties, using MSA derived alignments as a reference. In two thirds of the cases, SAGA reached the same level of optimization as MSA. In the remaining test sets, SAGA outperformed MSA, and in every case that improvement correlated with an improvement of the alignment biological quality, as judged by comparison with a reference alignment. Although they fall short of a demonstration, these figures suggest that SAGA is an adequate optimization tool that competes well with the most sophisticated heuristics. In a second aspect of that validation, SAGA was used to align test cases too large to be handled by MSA, and using as an objective function the weighted sums of pairs with natural gap penalties. ClustalW was the nonstochastic heuristic used as a reference. As expected, the use of natural penalties lead to some improvement over the optimization reached by ClustalW, and that mathematical improvement was also correlated with a biological improvement. Altogether, these results are indicative of the versatility of SAGA as an optimizer and of its ability to optimize functions that are beyond the scope of standard dynamic programming based algorithmic methods. 4.2 Consistency Based Objective Functions: The COFFEE Score Ultimately, a multiple alignment aims at combining within a single unifying model every piece of information known about the sequences it contains. However, it may be the case that a part of this information is not as reliable as one may expect. It may also be the case that some elements of information are not compatible with one another. The model will reveal these inconsistencies and require decisions to be made in a way that takes into account the overall quality of the alignment. A new objective function can be defined that measures the fit between a multiple alignments and the list of weighted elements of information. Of course, the relevance of that objective function will depend greatly on the quality of the pre-defined list. This list can take whatever forms one wishes. For instance, a convenient source is the generation of a list of pair wise alignments [46, 45] that given a set of N sequences will contain all the N2 possible pair wise alignments. In the context of COFFEE (Consistency Based Objective Function For alignmEnt Evaluation), that list of alignments is named a library, and the COFFEE function measures the level of consistency between a multiple alignments and its corresponding library. Evaluation is made by comparing each pair of aligned residues observed in the multiple alignments to the list of residue pairs that constitute the library. During the comparison, residues are only identified by their index within the sequences. The consistency score is equal to the number of pairs of residues that are simultaneously found in the multiple alignment and in the library, divided by the total number of pairs observed in the multiple sequence alignment. The maximum is 1 but the real optimum depends on the level of consistency found within the library. To increase the biological relevance of this function, each pair of residues is associated with a weight indicative of the quality of the pair-wise alignment it comes from (a measure of the percentage of identity between the two sequences). The COFFEE function can be formalized as follows. Given N aligned sequences S1...SN in a multiple alignment. Ai,j is the pair wise projection (obtained from the multiple alignment) of the sequences Si and Sj. LEN (Ai,j) is the number of ungapped columns in this alignment. SCORE (Ai,j) is the overall consistency between Ai,j and the corresponding pair-wise alignment in the library and Wi,j is the weight associated with this pair-wise alignment. COFFEE score= [Σ Σ W i=1 j= i+1 Ν -1 Ν i, j * SCORE (A i, j) / ] [Σ Σ W i=1 j =i +1 Ν -1 Ν i, j *LEN(A i, j) ] If we compare this function to the weighted sums of pairs developed earlier, we will find that the main difference is the library that replaces the substitution matrix and provides a position dependant mean of evaluation. It is also interesting to note that under this formulation an alignment having an optimal COFFEE score will be equivalent to a Maximum Weight Trace alignment using a ’pair-wise alignment graph’ [35]. Table 2 shows some of the results obtained using SAGA/COFFEE on 3Dali. For that experiment, the library of pair wise alignments had been generated using ClustalW alignments, and the resulting alignments proved to be of a higher biological quality than those obtained with alternative methods available at the time. Eventually, these results were convincing enough to prompt the development of a fast non-GA based method for the optimization of the COFFEE function. That new algorithm, named T-Coffee, was recently made available to the public [45]. 4.3 Taking Non-Local Interactions Into Account: RAGA. So far, we have reviewed the use of SAGA for sequence analysis problems that consider every position as independent from the others. While that approximation is acceptable when the sequence signal is strong enough to drive the alignment, this is not always the case when dealing with sequences that have a lower information content than proteins but carry explicit structural information, such as RNA or DNA. To illustrate one more usage of GAs it will now be interesting to examine a case where SAGA was used to optimize an RNA structure superimposition in which the OF takes into account local and non-local interactions altogether. RNA was chosen because its fold, largely based on Watson and Crick base pairings [63], generates characteristic structures (stems-loops) that are easy to predict and analyze [65]. Since the pairing potential of two RNA bases can be predicted with reasonable accuracy, the evaluation of an alignment can easily take into account structure (Se) and sequence (Pr) similarities altogether. The version of SAGA in which that new function is implemented is named RAGA (RNA Alignment by Genetic Algorithm) [47]. In RAGA, the OF evaluates the alignment of two RNA sequences, one with a known secondary structure (master) and one that is homologous to the master but whose exact secondary structure is unknown (slave). It can be formalized as follow: Alignment score = Pr + (λ* Se) - Gap Penalty (2) λ is a constant (1-3) and *DS SHQDOW\ is the sum of the affine gap penalties within the alignment. Pr is simply the number of identities. Se uses the secondary structure of the master sequence and evaluates the stability of the folding it induces onto the slave sequence. If two bases form a base pair (part of a stem) in the master, then the two ’slave’ bases they are aligned to should be able to form a Watson and Crick base pair as well. Se is the sum of the score of these induced pairs. The energetic model used in RAGA is very simplified and assigns 3 to GC pairs and 2 to UA and UG. Assessing the accuracy and the efficiency of RAGA is a problem very similar to the one encountered when analyzing SAGA. In this case, the reference alignments were chosen from mitochondrial ribosomal small subunit RNA sequence alignments established by experts [61]. The human sequence was used as a master and realigned by RAGA to seven other homologous mitochondrial sequences used as slaves. Evaluation was made by comparing the optimized pairwise alignments to those contained in the reference alignment. The results on Table 3 indicate very clearly, that a proper optimization took place and that the secondary structure information was efficiently used to enhance the alignment quality. This is especially sensible for very divergent sequences that do not contain enough information at the primary level for an accurate alignment to be determined on these bases alone. It is also interesting to point out that RAGA could also take into account some elements of the tertiary structure known as pseudoknots, that were successfully added to the objective function. These elements, that are beyond the scope of most dynamic programming based methods, lead to even more accurate alignment optimization [47].  &RQFOXVLRQ *$ YHUVXV +HXULVWLF 0HWKRGV Section 4 of this chapter illustrates three situations in which GAs proved able to solve very complex optimization problems with a reasonable level of accuracy. On its own, this clearly indicates the importance and the interest of these methods in the field of sequence analysis. Yet, when applied to that type of problems, GAs suffer from two major drawbacks: they are very slow and unreliable. By unreliable, we mean that given a set of sequences, a GA may not deliver twice the same answer, owing to the stochastic nature of the optimization process and to the difficulty of the optimization. This may be a great cause of concern to the average biologist who expects to use his multiple alignment as a prediction tool and possibly as a decision aide for the design of expensive wet lab experiments. How severe is this problem? If we consider the protein test cases analyzed here, SAGA reaches its best score in half of the runs on average. For RAGA, maybe because the solution space is more complex, this proportion goes down to 20%. If one is only interested in validating a new objective function, this is not a major source of concern since even in the worse cases the sub-optimal solutions are within a few percent of the best found solution. However, this instability is not unique to GAs and is not as severe as the second major drawback: the efficiency. Although much more practical than SA, GAs slowness means that they cannot really be expected to become part of any of the very large projects that require millions of alignments to be routinely made over a few days [15]. More robust, if less accurate, techniques are required for that purpose. Is the situation hopeless then? The answer is definitely no since two important fields of application exist for which GAs are uniquely suited. The first one is the analysis of rare and very complex problems for which no other alternative is available, such as the folding of very long RNAs. The second aspect is more general. GAs provide us with a unique way of probing very complex problems with little concern, at least in the first stages, for the algorithmic issues involved. It is quite remarkable that even with a very simple GA one can quickly ask very important questions and decide weather a thread of investigation is worth being pursued or should simply be abandoned. The COFFEE project is a good example of such a cycle of analysis. It followed this three steps process: (i) an objective function was first designed without any concern for the complexity of its optimization and the algorithmic issues. (ii) SAGA was used to evaluate the biological relevance of that function. (iii) This validation was convincing enough to prompt the conception of a new dynamic programming algorithm, much faster and appropriate for this function[45]. This non-GA based algorithm was named T-Coffee (Tree based COFFEE). The mere evocation of these two projects respective developing time makes a good case for the use of SAGA: the COFFEE project took four months to be carried out, while completion of the T-Coffee project required more than a year and a half for algorithm development and software engineering.  $YDLODELOLW\ SAGA, RAGA, COFFEE and T-Coffee are all available free of charge from the author either via Email (cedric.notredame@igs.cnrs-mrs.fr) or via the WWW (http://igs-server.cnrsmrs.fr/~cnotred).  $FNQRZOHGJHPHQWV The author wishes to thank Dr Hiroyuki Ogata and Dr Gary Fogel for very helpful comments and for an in-depth review of the manuscript.  5HIHUHQFHV [1] S. F. Altschul, *DS FRVWV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, J. Theor. Biol., 138 (1989), pp. 297-309. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] S. F. Altschul, R. J. Carroll and D. J. Lipman, :HLJKWV IRU GDWD UHODWHG E\ D WUHH, Journal of Molecular Biology, 207 (1989), pp. 647-653. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, %DVLF ORFDO DOLJQPHQW VHDUFK WRRO, Journal of Molecular Biology, 215 (1990), pp. 403-410. S. F. Altschul and D. J. Lipman, 7UHHV VWDUV DQG PXOWLSOH ELRORJLFDO VHTXHQFH DOLJQPHQW, SIAM J. Appl. Math., 49 (1989), pp. 197-209. L. A. Anabarasu, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ SDUDOOHO JHQHWLF DOJRULWKPV, , 7KH 6HFRQG $VLD3DFLILF &RQIHUHQFH RQ 6LPXODWHG (YROXWLRQ 6($/ , Canberra,australia, 1998. A. Bairoch, P. Bucher and K. Hofmann, 7KH 3526,7( GDWDEDVH LWV VWDWXV LQ , Nucleic Acids Research, 25 (1997), pp. 217-221. G. J. Barton and M. J. E. Sternberg, $ VWUDWHJ\ IRU WKH UDSLG PXOWLSOH DOLJQPHQW RI SURWHLQ VHTXHQFHV FRQILGHQFH OHYHOV IURP WHUWLDU\ VWUXFWXUH FRPSDULVRQV, Journal of Molecular Biology, 198 (1987), pp. 327-337. S. A. Benner, M. A. Cohen and G. H. Gonnet, 5HVSRQVH WR %DUWRQ V OHWWHU &RPSXWHU VSHHG DQG VHTXHQFH FRPSDULVRQ, Science, 257 (1992), pp. 1609-1610. P. Bucher, K. Karplus, N. Moeri and K. Hofmann, $ IOH[LEOH PRWLI VHDUFK WHFKQLTXH EDVHG RQ JHQHUDOL]HG SURILOHV, Comput Chem, 20 (1996), pp. 3-23. K. Bucka-Lassen, O. Caprani and J. Hein, &RPELQLQJ PDQ\ PXOWLSOH DOLJQPHQWV LQ RQH LPSURYHG DOLJQPHQW, Bioinformatics, 15 (1999), pp. 122-30. L. Cai, D. Juedes and E. Liakhovitch, (YROXWLRQDU\ FRPSXWDWLRQ WHFKQLTXHV IRU PXOWLSOH VHTXHQFH DOLJQPHQW, , &RQJUHVV RQ HYROXWLRQDU\ FRPSXWDWLRQ , 2000, pp. 829-835. H. Carrillo and D. J. Lipman, 7KH PXOWLSOH VHTXHQFH DOLJQPHQW SUREOHP LQ ELRORJ\, SIAM J. Appl. Math., 48 (1988), pp. 1073-1082. K. Chellapilla and G. B. Fogel, 0XOWLSOH VHTXHQFH DOLJQPHQW XVLQJ HYROXWLRQDU\ SURJUDPPLQJ, , &RQJUHVV RQ (YROXWLRQDU\ &RPSXWDWLRQ , 1999, pp. 445-452. F. Corpet, 0XOWLSOH VHTXHQFH DOLJQPHQW ZLWK KLHUDUFKLFDO FOXVWHULQJ, Nucleic Acids Res., 16 (1988), pp. 10881-10890. F. Corpet, F. Servant, J. Gouzy and D. Kahn, 3UR'RP DQG 3UR'RP&* WRROV IRU SURWHLQ GRPDLQ DQDO\VLV DQG ZKROH JHQRPH FRPSDULVRQV, nucleic acids res, 28 (2000), pp. 267-9. L. Davis, 7KH KDQGERRN RI *HQHWLF $OJRULWKPV, Van Nostrand Reinhold, New York, 1991. P. J. Davis and R. Hersh, 7KH PDWKHPDWLFDO H[SHULHQFH, Birkauser, Boston, 1980. M. O. Dayhoff, $WODV RI 3URWHLQ 6HTXHQFH DQG 6WUXFWXUH, National Biomedical Research Foundation, Washington, D. C., U. S. A., 1978. J. Felsenstein, 3+ M ou N Seq1b: MNBDEEBNDBEDM Seq2b: BDNEMEMNDBNDM well-known collection of 500 related human sequences known as the kinome (Manning et al. 2002). The procedure delivered a substitution matrix highly correlated to a standard point accepted mutation in which all the known mutational preferences between amino acids could easily be recognized. We then do the same by comparing three distinct sets of social sciences data representing the same sequential reality. Then, the training procedure is evaluated by testing its ability to correctly identify the closeness of two different symbols, using solely the information contained in the data. To do so, we use a set of sequences to compare a reference cost of substitution between two given symbols produced by the training procedure (e.g., AB), with the cost produced by the training procedure for the same substitution, in the case where one of the symbols (e.g., A) has been randomly split into two new symbols (M and N) not belonging to the alphabet. As symbols M and N are actually ‘‘hidden A,’’ we expect the training procedure to determine the substitution costs AB, MB, and NB as equivalent. Figure 4 shows for two given sequences how a symbol is randomly split into two new symbols not belonging to the original alphabet. Testing the Quality of the Clustering A third set of criteria pertains to quality testing of cluster analysis. One of the main difficulties with clustering methods lies in the determination of the number of clusters really present in the data (Milligan and Cooper 1985, 1987). There is no perfect method to establish this number, but several indicators may be used to help decide (Everitt 1979; Bock 1985; Hartigan 1985; Milligan and Cooper 1985; SAS Institute 2004). For Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 212 Sociological Methods & Research Milligan and Cooper (1987), there are two categories of tests concerning the quality of cluster analysis: The first considers that internal criteria are able to validate the results of the clustering, that is, to justify the number of clusters chosen. The second one uses external criteria. Such criteria represent information that is external to the cluster analysis and was not used at any other point in the cluster analysis (Milligan and Cooper 1987). In terms of internal criteria, Milligan and Cooper (1985) have evaluated and compared 30 statistics known as stopping rules that help in deciding how many ‘‘real’’ clusters are present in the data. The availability of such indices in main statistical software packages (such as SAS or SPSS) is of course a nonnegligible element of choice concerning what criteria to use. Two of the most efficient indices among the 30 that Milligan and Cooper (1985, 1987) have evaluated are part of the SAS software. The first one is a pseudo-F developed by Calinski and Harabasz (1974); it represents an approximation of the ratio between intercluster and intracluster variance. The second index is expressed as Je(2)/Je(1) (Duda and Hart 1973) and may be transformed into a pseudo-t2.6 The third criteria we used is R2, which expresses the size of the experimental effect. It is reasonable to look for a consensus among the three criteria (Nargundkar and Olzer 1998; SAS Institute 2004). We can then define the stopping rule for a statistically optimal cluster solution as a local peak of the pseudo-F (high ratio between inter- and intracluster variance), associated with a low value of pseudo-t2 that increases at the next fusion and a marked drop of the overall R2.7 Generally, a cluster solution is said to be statistically optimal when the number of classes is kept constant across strategies, when the intercluster variance is highest, and when the intracluster variance is lowest. Put another way, clusters should exhibit two properties, external isolation and internal cohesion (Punj and Stewart 1983). Therefore, using comparative scree plots is a straightforward way of dealing with the issue of testing cluster solutions drawn from distances based on various cost schemes, including the computationally derived one. A given cluster solution is retained for analysis only if at least two among those three criteria (pseudo-F, pseudo-t2, and R2) support its validity. External criteria refer to the extent to which clusters drawn from the data correlate with either independent variables or outcomes (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help in social research as the ultimate goal of social sciences is explanation rather than description. A third criterion is more intuitive: To what extent are empirical clusters easily comprehended, based on prior knowledge of the phenomenon and the central hypothesis of the research? This criterion Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 213 can be approached by using experts and computing interreliability estimates. The procedure in that case is as follows: Provide cluster solutions based on the various cost schemes, and have a set of raters who decide independently which is their favorite solution. Then one may compute interrater reliability and see which coding scheme comes up first in the list. Given the importance of the debate concerning the influence of sociostructural factors on the occupational trajectories of women in the sociological field and the availability of high-quality data on occupational status during entire life courses, we test these methods on data sets addressing this topic. Description of the Test Samples Considering the fact that women’s labor market participation is more diverse than that of men (Myrdal and Klein 1956; Levy 1977; Mott 1978; ¨ Elder 1985; Moen 1985; Hopflinger, Charles, and Debrunner 1991; Moen ¨ and Yu 2000; Blossfeld and Drobnic 2001; Kruger and Levy 2001; Levy, Widmer, and Kellerhals 2002; Moen 2003; Widmer, Kellerhals, and Levy ¨ 2003; Bird and Kruger 2005; Levy, Gauthier, and Widmer 2006), and in order to facilitate the comparisons between the data sets, for each database we selected only women who were married or living with a partner at the time of the interview. Moreover, in order to maximize the quality of the data, we retain only the trajectories that had less than 10 percent of missing values. Sample Test 1: Social Stratification, Cohesion, and Conflict in Contemporary Families The first sample of occupational trajectories is drawn from a retrospective questionnaire of the study Social Stratification, Cohesion, and Conflict in Contemporary Families (SCF) that was conducted in 1998 with 1,400 individuals living as couples in Switzerland (Widmer, Kellerhals, and Levy et al. 2003; Widmer, Kellerhals, and Levy 2004). Respondents were asked to provide information about every year of their occupational trajectory starting from age 16, onward to 64. Every year of the trajectories was coded using a seven-category code scheme: full-time employment, part-time employment, positive interruption (sabbatical, trip abroad, etc.), negative interruption (unemployment, illnesses, etc.), housework, retirement, and full-time education. Data were right truncated as most individuals had not yet reached the age of 64 at the time of the interview. Sociostructural indicators (socioeconomic status of orientation family, educational Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 214 Sociological Methods & Research level, number of children, and income) were measured for the time of the interviews only. The final sample size was 564 women. Sample Test 2: The Swiss Household Panel Since 1999, the Swiss Household Panel (SHP) has collected data on a representative sample of private households in Switzerland on a yearly basis.8 In its third wave, the SHP included a retrospective questionnaire sent to 4,217 households (representing 8,913 individuals). For reasons of validity, the analysis of the subsample of individuals who answered the retrospective questionnaire was restricted to those aged 30 and older, decreasing the sampled female population to 1,935. The SHP asked respondents to provide information on their educational and occupational status from their birth to the present. Each change in status is associated with a starting year and an ending year. We recoded these the same way as for Sample Test 1. Sociostructural indicators comparable to those in Sample 1 were also obtained. This sample included 1,107 women. Sample Test 3: Female Job Histories From the Wisconsin Longitudinal Study The Wisconsin Longitudinal Study (WLS) is a long-term study of a random sample of 10,317 men and women who graduated from Wisconsin high schools in 1957. This data set is for public use and available at the University of Wisconsin–Madison Web site (http://www.ssc.wisc.edu/ wlsresearch). The female job histories of 1957-1992 were constructed by Sheridan (1997) from the 1957, 1964, 1975, and 1992 WLS data collections. The data also include social background, youthful aspirations, schooling, military service, family formation, labor market experiences, and social participation. The ‘‘female job histories’’ data concern 5,042 women born in 1938 and 1939. We could retain only three main occupational statuses, namely, full-time paid work, part-time paid work, and fulltime housewife. There were 2,243 women in this sample. Results Production of Data-Driven Costs of Substitution From a sociological point of view, we could expect a relative stability of the costs of substitution from one set of sequences to another, the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 215 occupational trajectories of contemporary Swiss and North American women being to a certain extent comparable, at least in regard to the influence of the birth of children on the reduction or cessation of paid work. The individual sequences of occupational statuses are built by attributing a single symbol (a code corresponding to a given occupational status) to each year of life of the respondents.9 Table 1 compares the different costs of substitution either set arbitrarily to identity, following theoretical arguments concerning differences among types and rates of occupational activities (for details, see Widmer, Levy, et al. 2003), or by means of a training procedure in the different databases. Table 1 shows that the training procedure produces costs that are more differentiated than identity costs. The range of costs is also broader, partly because the procedure is sensitive to very rare substitutions. The stability of the trained costs of substitution from one database to another confirms the ability of the training to produce meaningful cost schemes. The training procedure reflects some relations between the different statuses that are sociologically relevant. Compared to identity costs that may not be differentiated between men and women, the trained costs reveal, for example, the relative ease (the low costs) with which women in the samples go from paid work to housework. The comparison of knowledge-based costs and trained costs of substitution shows a high similarity between the two sets of values, which are correlated at .68 (p < .01) with trained costs for SCF data, at .63 (p < .01) for SHP data, and at .73 (p < .05) for WLS data. Table 2 shows Pearson’s coefficient of correlation between the costs by method of cost setting and database. Table 2 shows that the trained costs of substitution are more strongly associated with each other from one data set to another than they are with costs set to either identity or to knowledge-based values. On the other hand, even if it remains relatively high, the associations between trained, knowledge-based, and identity costs are systematically weaker than those between trained costs. This confirms the stability of the results stemming from the training procedure and explains at least partly the slightly but systematically different (and more highly correlated) results it provides compared to the two other strategies (identity and knowledge based). Validation of the Training Procedure An important issue in the use of a computerized data–based determination of substitution costs is to assess the extent to which this procedure is able to process information in a sociologically relevant way. Three Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 216 Sociological Methods & Research Table 1 Comparisons of Identity, Knowledge-Based, and Trained Costs of Substitution for Three Data Sets: SCF, SHP, and WLS Costs of Substitution Identity Knowledge Trained Trained Trained Based SCF SHP WLS 1.0 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 1.0 0.3 1.0 1.0 1.0 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.8 1.0 0.8 1.0 1.0 1.0 0.3 0.8 0.8 1.0 0.8 0.8 0.3 1.0 1.0 1.0 0.8 0.3 1.0 1.0 0.8 0.3 1.0 1.0 0.3 1.0 0.3 0.3 0.5 0.6 0.7 0.7 0.5 0.9 0.5 0.6 0.7 0.9 0.5 1.0 0.7 0.7 1.5 0.7 1.3 0.8 0.9 0.9 1.1 0.8 0.9 1.0 0.6 0.6 1.5 1.3 0.7 0.8 0.5 0.7 0.7 0.5 0.8 0.5 0.6 0.7 0.8 0.5 0.8 0.7 0.7 1.2 0.8 0.9 0.9 1.0 0.8 1.4 0.8 0.9 0.8 0.7 0.7 1.6 1.5 0.7 0.8 0.5 0.4 Substitutions of Occupational Status Full-Time * Part-Time Full-Time * Negative Interruption Full-Time * Positive Interruption Full-Time * At Home Full-Time * Retirement Full-Time * Education Full-Time * Missing Part-Time * Negative Interruption Part-Time * Positive Interruption Part-Time * At Home Part-Time * Retirement Part-Time * Education Part-Time * Missing Negative Interruption * Positive Interruption Negative Interruption * At Home Negative Interruption * Retirement Negative Interruption * Education Negative Interruption * Missing Positive Interruption * At Home Positive Interruption * Retirement Positive Interruption * Education Positive Interruption * Missing At Home * Retirement At Home * Education At Home * Missing Retirement * Education Retirement * Missing Education * Missing Insertion or Deletion 0.5 0.5 0.5 0.5 0.5 Note: SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. different tests were used. The first one referred to the ability of the procedure to evaluate the closeness of a symbol belonging to the alphabet with an unknown symbol not belonging to it. The second one focused on the degree of agreement between classifications of social trajectories made by Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 217 Table 2 Pearson’s Correlation Between Costs Matrices, by Method of Cost Setting and (Full) Data Sets Identity Identity Knowledge based SCF trained SHP trained WLS trained 1.00 .98*** .66*** .61*** .71* Knowledge .98*** 1.00 .68*** .63*** .73* SCF Trained .66*** .68*** 1.00 .96*** .97*** SHP Trained .61*** .63*** .96*** 1.00 .94*** WLS Trained .71* .73* .97*** .94*** 1.00 Note: UNIX command line to produce the trained matrix: saltt -e ‘-in dataset.dat -action + pavie_seq2pavie_mat _TGEPF50_THR60_TWE04_SAMPLE50000_’. SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .05. ***p < .001. specialists in the field compared with classifications of the same data based on identity, knowledge-based, and trained costs of substitution. The third one consisted of measuring the extent to which clusters drawn from the data correlate with some independent sociostructural variables or outcomes. Identifying the Proximity of Unknown Symbols A first way of validating the training procedure consists of measuring the extent to which it is able to unravel the proximity of two given symbols, based on no other information than the data itself. We tested this for the SCF set of sequences by randomly replacing a given symbol of the sequences alphabet A = {A, B, C, D, E, F, G, X}, which corresponds in this case to an occupational status, with two symbols that did not belong to the original alphabet of that set of sequences, that is, symbols whose actual identity was hidden. Using the training procedure, we then compared the original costs for substituting, for example, Symbol A with Symbol B, with the costs we obtained after having randomly replaced every A with either the hidden symbol M or N (cf. Figure 4). In a second run, we did the same by replacing each B with the hidden symbol O or P. We finally got five different expressions of the same initial substitution (in this example, AB = NB = MB = AO = AP), each associated with a specific cost. This procedure was applied to all pairs of symbols of the data set in turn. If we consider Ei and Ej to be respectively the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 218 Sociological Methods & Research ith and jth elements of the original alphabet and their two random substitutes—respectively S1(Ei), S2(Ei) and S1(Ej), S2(Ej)—there are five costs of substitution to take into account if we consider only the substitutions involving at least one symbol belonging to the original alphabet. Under these conditions, as they are actually different expressions of the same initial substitution, we should expect those five trained costs to be identical, or at least close to each other. To compare all those values in a synthetic way for the entire alphabet, we computed a standardized difference between the trained costs of substitution associated with a given pair of symbols belonging to the original alphabet and the trained costs of substitution between one of those original symbols and the substitute of the other one, as shown in equation (8). Std Difference= ðcostðEi ½S1 ðEj ފÞ−costðEi Ej Þ+ðcost½S1 ðEi ފEj Þ−costðEi Ej ÞÞ : ð8Þ 2 à costðEi Ej Þ The proximity of the five substitution costs associated with a given original pair of symbols and their substitutes was compared in two ways, using either the first substitute of that pair of symbols (as shown in equation [8]) or the second one (where S2 replaces S1 in equation [8]). All those values were then tabulated in Table 3. Its lower part contains the standard differences between the substitutions of Ei, Ej, and their first substitute (cost EiEj compared to Ei[S1(Ej)] and [S1(Ei)]Ej), whereas the upper part contains the values associated with their second substitute (cost EiEj compared to Ei[S2(Ej)] and [S2(Ei)]Ej)). Table 3 shows clearly that the training procedure identifies very precisely the closeness of two distinct, but actually identical, symbols.10 Among the 56 different costs of substitution in Table 3, 49 (87 percent) show a difference not larger than 10 percent compared with the original cost. The greater differences may be attributed to the fact that the training procedure is relatively sensitive to rare symbols. For example, symbols C, D, F, and X represent altogether only about 2 percent of the total symbols used in the sequences. The great majority of the hidden costs differing notably from their original costs are concerned with such rare symbols. The difference is maximal when it concerns two rare symbols. The ability of the training procedure to identify the similarity of two unknown symbols based on the data set at hand is one of the main strengths of this way of setting costs of substitution. Even if it stays relatively close to identity costs of substitution, this procedure takes into account the real relations of the different symbols present in the sequences and is therefore highly informative. On one hand, it avoids particular relationships remaining undetermined; on the other hand, it works as a predictive tool in the sense Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 219 Table 3 Standardized Difference Between the Original Trained Costs of Substitution and Their Substitutes A (%) B (%) C (%) D (%) E (%) F (%) G (%) X (%) Relative Frequency (%) A B C D E F G X — 0 7 6 0 6 10 0 8 — 0 6 –10 10 0 0 7 0 — 5 –7 0 0 –15 6 6 0 — –6 0 6 6 0 0 –7 0 — 13 8 0 0 25 9 11 13 — 0 0 10 7 0 0 8 0 — –7 0 7 –15 –28 0 0 –7 — 33.5 19.5 0.5 1.0 31.0 0.1 14.0 0.4 Note: Rows and columns are given the name of a symbol belonging to the alphabet, although each cell of the table compares the substitution cost of three pairs of symbols (the original one and two substitutes) according to equation (8). that two different symbols with low substitution costs can be predicted to substitute easily for one another in real life. Automatic Versus Classification by Judges Another way to validate the training procedure is to test the extent to which automatic classification succeeds in replicating a classification of sequences made by experts on a small subset of well-identified sequences. To do so, we extracted a sample of 100 occupational trajectories of women from each data set. Four judges were asked to classify them in a number of clusters that corresponded to previous empirical findings (Widmer, Levy, et al. 2003; Levy, Gauthier, and Widmer 2006) and to theoretical schemes (i.e., Kohli 1986). In each case, we retain only the sequences that were classified the same way by at least three (out of four) judges. The interrater agreement lies between 83 percent and 88 percent. To keep the computation procedures as parsimonious as possible, we first exactly replicated with SALTT the results we obtained with TDA using two different cost settings (identity and knowledge based). That allowed us to produce optimal alignments and to compare the distance matrices for the three strategies (identity, knowledge based, and training) from within SALTT. For each set of sequences, we ran three optimal matching analyses, the first one using identity costs of substitution (for details, see above), the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 220 Sociological Methods & Research Table 4 Association (khi2 and Symmetric) Between Judges and Automatic Classification, by Method of Cost Setting Database Method Identity * Judges SCF 213.2454* (khi2, df = 16) symmetric 0.8034 (value) 0.0458 (ASE) 206.1951* (khi2, df = 16) symmetric 0.8120 (value) 0.0440 (ASE) 224.5436* (khi2, df = 16) symmetric 0.8291 (value) 0.0434 (ASE) SHP 213.4108* (khi2, df = 16) symmetric 0.7500 (value) 0.0582 (ASE) 228.4631* (khi2, df = 16) symmetric 0.7705 (value) 0.0623 (ASE) 235.1387* (khi2, df = 16) symmetric 0.7797 (value) 0.0602 (ASE) WLS 143.9678* (khi2, df = 9) symmetric 0.7037 (value) 0.0684 (ASE) 148.6864* (khi2, df = 9) symmetric 0.7196 (value) 0.0677 (ASE) 143.2652* (khi2, df = 9) symmetric 0.7037 (value) 0.0677 (ASE) Knowledge Based * Judges Trained * Judges Note: ASE = Asymptotic standard error; SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. *p < .001 second one using knowledge-based costs, and the third one using costs stemming from the training procedure. A distance matrix was computed for each set of sequences and for each strategy and then entered into a cluster analysis. Table 4 shows the degree of association of khi2 and (Goodman and Kruskal 1979; Olszak and Ritschard 1995) between the clusters made by the judges and those stemming from automatic classification. Table 4 shows that results provided by a trained matrix lead to significant associations with the classification by judges for the three data sets considered. For the Wisconsin study, results are about the same when using either identity or trained costs of substitution. Trained costs never lead to a weaker association ( symmetric) with judges’ classifications than identity costs or knowledge-based costs for the SCF and SHP data sets. Results are less straightforward concerning the WLS data, with knowledge-based costs performing slightly better than trained costs. The fact that Wisconsin data are less differentiated (sequences with only three Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 221 different statuses as opposed to seven in the other databases and respondents all about the same age) may explain why trained costs do not lead to a markedly different solution than the two alternative strategies. In all cases, the associations are quite high and significant, suggesting the ability of the method to provide meaningful cost schemes. Given the fact that the reference classification based on judges responses was very consensual and based on predefined categories, results of that test express the ability of the procedure to differentiate clear-cut, sociologically relevant categories out of the data rather than to evaluate the extent to which those results and the underlying costs of substitution reflect the theoretical and subjective conceptual frame of an expert. Association With External Criteria A third validation procedure consisted of measuring the extent to which clusters drawn from the data correlate with either independent sociostructural variables or outcomes (Milligan and Cooper 1987; Rapkin and Luke 1993), an approach that seemingly few studies have used so far (Milligan and Cooper 1987). Clusters that do not associate with these variables are of little help as the ultimate goal of social sciences is explanation rather than description. For each strategy, the three stopping-rule criteria aimed at determining the number of clusters in the data (pseudo-t2, pseudo-F, and R2) suggested the presence of three clusters in the SCF and SHP data and of four clusters in the WLS data. A closer look at the data reveal that those clusters correspond precisely to typical female trajectories, as described elsewhere ¨ (Moen 1985; Hopflinger et al. 1991; Erzberger and Prein 1997; Widmer, Levy, et al. 2003; Levy et al. 2006), namely, trajectories characterized by full-time employment, part-time employment, and full-time as a housewife. In the Wisconsin data, the clusters are the same, but with a fourth one representing a return to the labor market after a period at home. Such a cluster also appears when the clusters of SCF and SHP data are further subdivided. The greater homogeneity of WLS data in terms of age of respondents and completeness of the sequences (no right truncatures) may explain the better visibility (consensus between stopping-rule criteria) of that fourth cluster, which is also documented in the literature (Widmer, Levy, et al. 2003; Levy et al. 2006). We first ran a multinomial logistic regression11 on each data set (SFC, SHP and WLS), using cluster membership (which represents in this case types of occupational trajectories) as response variables and a set of Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 222 Sociological Methods & Research Table 5 khi2 Values of the Likelihood Ratio Test by Database and Cost-Setting Method Cost-Setting Method Identity Data Sets Set 1: SCF data, 3 clusters Set 1: SCF data, 5 clusters Set 2: SHP data, 3 clusters Set 2: SHP data, 5 clusters Set 3: WLS data, 4 clusters df 272 596 404 808 258 khi 2 Knowledge Based 2 Trained khi 2 p > khi khi 2 p > khi 2 p > khi2 290.19 .2143 553.02 .8956 568.36 < .0001 897.67 .0150 307.35 .0189 280.87 547.03 562.04 863.32 323.81 .3428 .9250 < .0001 .0865 .0034 288.60 .2339 522.11 .9867 512.34 < .0002 740.12 .9574 288.37 .0939 Note: SCF = Social Stratification, Cohesion, and Conflict in Contemporary Families; SHP = Swiss Household Panel; WLS = Wisconsin Longitudinal Study. indicators of social positioning (socioeconomic position of the orientation family, including level of education, number of children, age, and household income) generally considered (cf. description of the sample) as intervening variables in shaping female occupational trajectories. To be consistent with the stopping-rule criteria—that is, a consensus between pseudo-t2, pseudo-F, and R2—we retained in this first step the threecluster solutions that those criteria pointed out for each data set. As they are more homogeneous, they represent about the same social reality in each data set and therefore remain sociologically relevant. We then performed the tests on the five-cluster solutions for SCF and SHP data to check the efficiency of the different cost-setting methods on other empirically founded classifications (Widmer, Levy, et al. 2003; Levy et al. 2006). We felt justified in doing this because two new clusters emerged from further subdivision of the first three clusters defined by the proposed criteria (R2, pseudo-F, and pseudo-t2). Table 5 shows the test of likelihood ratio applied to those multinomial regressions. The likelihood tests compare a given model with the saturated one (a model that exactly replicates the data), meaning in this case that the smaller the value of khi2 (i.e., the larger the p value), the better the model fit to the data.12 One can read from Table 5 that the trained costs of substitution allow building a model that fit better to the data in all cases compared to identity costs and in four out of five cases compared to knowledge-based costs. Put another way, clusters produced by trained costs of substitution are Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 223 more sensitive to predictors than clusters produced by either identity costs or knowledge-based costs. This is true, although not with the same strength, for the three sets of sequences. The fit is significantly better (i.e., the model stemming from trained costs does not differ significantly from the saturated model, whereas the two others do) in two cases and with two data samples. Discussion Setting costs of substitution in the process of aligning sequences of social statuses is controversial because it may significantly influence the results of the analysis. We propose a method to determine costs of substitution empirically, which we tested using three distinct sets of social science data. The training procedure that we present appears to be, to our knowledge, the only one that is exclusively empirically grounded and optimized. First, we considered the correlation between the substitution matrices for a given alphabet across three data sets of the social sciences realm representing the same social realities (sequences of occupational statuses along the life course) and three cost-setting strategies. The training procedure leads to results that are very similar to those stemming from the two other methods (substitution costs represented as an identity matrix or following some knowledge-based weighting). In this sense, cost variability did not appear to modify the general results of the analysis. Nevertheless, the costs stemming from the training procedure may claim a greater legitimacy as they reflect the actual relationships of the symbols considered. That legitimacy is reinforced by the very high correlation existing between the substitution matrices stemming from the training procedures applied to the three data sets at hand. In this sense, the values of the trained cost matrices may even be considered as a validation a posteriori of the use of alternative costs of substitution (knowledge based or identity) found in the literature for the specific case of occupational trajectories. Moreover, the training procedure shows some interesting features that should be further explored, such as the possibility to differentiate specific substitution costs according, for example, to gender. The ability of the trained costs to provide a clustering that is better associated to some sociologically unequivocal classification of reference than the identity costs and the knowledge-based costs did illustrate the ability of the training procedure to discover some structural features of the data that are sociologically relevant. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 224 Sociological Methods & Research Second, based on likelihood ratio tests of multinomial logistic regressions, we compared the associations between cluster solutions (response variables) and a set of relevant sociostructural variables (intervening variables) for the three cost-setting strategies across the three data sets at hand. Here again, the training procedure led to better results than the identity and the knowledge-based costs did. That is, the data-driven costs of substitution contributed to classifications that fit better with widely recognized sociological models of women’s labor market participation than the two other strategies. Taking into account the actual structure of the data provides models that fit better with external factors than undifferentiated or knowledge-based costs schemes. Finally, the ability of the training procedure to discover certain actual internal relationships of the data and therefore to offer an efficient and empirically grounded way to determine costs of substitution is demonstrated in another way as it is able to accurately identify the closeness of two formally identical, but artificially differentiated, substitution costs (here, between two occupational statuses). Moreover, the degree of closeness between the substitution costs is also informative about the relative proximity of the symbols and the sociological reality they represent. The training procedure offers significant improvements compared to the methods generally used until now in social sciences. By revealing every symmetric relationship among those symbols, this procedure avoids assigning a cost based on prior knowledge that would later appear to be erroneous when compared to the actual data. The results show that for any pair of symbols of a given alphabet, the produced trained costs of substitution remain remarkably similar from one data set to another. It means that those costs do reflect some important information concerning the actual (in this case, social) significance of the symbols constituting the sequences and do not represent just abstract values varying from data set to data set (or from one training session to another). Therefore, these costs also constitute a predictive feature, in the sense that two different symbols with low substitution costs can be predicted to easily substitute for one another in real life. Identification of these low substitution costs therefore make it possible to predict situations likely to occur in similar contexts at similar ages and at similar frequencies. In comparison with approaches based on transitions costs, which are computed within each single sequence taken separately, the proposed method aims to determine substitution costs by looking for a match or mismatch at each specific position throughout all pairs of sequences. In this sense, the latter method is based on richer information and grants a Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 225 higher importance to time (i.e., to age and social age) and to the relations between sequences than cost schemes based on transitions rates. There is on one hand a constant and clear similarity between the results stemming from the three cost-setting strategies (identity, knowledge based, and training) and on the other hand a significant improvement in the tests of internal and external validity of the results provided by the training procedure. The conditions under which the method is most appropriate remain to be systematically tested. The experiments presented in this article point in several directions. First, the method provides a strong leverage when no or few theoretical arguments may be brought up to the scene in support of a cost solution or when contradicting theories propose different cost schemes. In other words, it is best suited for an exploratory research design. Second, this method is ideal whenever too many statuses have been used to characterize the data. We show, for instance, in this study that the proposed procedure reveals the identity between two statuses that may have been coded separately. Finally, the cost estimation provides a means for quantifying the relationship among symbols; as such, it can be used to identify and discover equivalences among categories. In itself, this means of quantification may prove to be a useful investigative tool for the social sciences. There are several limitations to the solution proposed in this article. First, the method deals poorly with symbols occurring rarely in sequences. Whenever this happens, the estimations of substitution costs are less accurate and more variable. Second, a key property of the optimal matching algorithm is to rely on the assumption that events defining a life trajectory are chronologically ordered and collinear among the considered sequences. This is, of course, a simplification, but it seems to hold reasonably well when considering sequences with a high percentage of identity. However, it should be mentioned that if recurrent subsequences were to be found scattered in different periods of life, they could probably be recovered using techniques related to the one that we describe here, such as Gibbs sampling (Lawrence et al. 1993; Abbott and Barman 1997) or the local alignment algorithm (Smith and Waterman 1981). Second, this algorithm, like other optimal matching algorithms, assumes the independence of each position constituting a sequence. This may be oversimplifying as one can argue that life trajectories are not homogeneous. They may be substructured in smaller units (life stages, transitions, turning points, specific life events, etc.), whose sizes may vary but should be kept intact in the alignments. This issue is likely to arise when comparing very distinct sequences. When this situation occurs, it may be worthwhile to modify the Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 226 Sociological Methods & Research proposed algorithm. Nevertheless, the issue remains to automatically identify meaningful borders defining those subsequences. In biology, multiple sequence alignments have been used successfully to identify the exact extent of subsequences conserved across related sequences (Notredame, Higgins, and Heringa 2000). It is certainly worthwhile to explore the potential of this method in the social sciences. Notes ¨ 1. This freeware is available from the Ruhr-Universitat Bochum Web site at http://steinhaus .stat.ruhr-uni-bochum.de/tda.html. 2. This freeware is available from the University of Chicago Web site at http://home .uchicago.edu/aabbott/om.html. ~ 3. This freeware is available from the Strasbourg Bioinformatics Platform Web site at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX. 4. ‘‘Thus, while substitution must be carefully handled, it is not a supersensitive task whose errors will be compounded by later stages in the analysis’’ (Abbott and Hrycak 1990:176). 5. Student’s t tests performed on the 10 values generated by the training procedures for each cost of substitution reveal that those values do not differ from the mean (p < .0001, df = 9). 6. Hotelling’s T2 is a statistical measure of the multivariate distance of each observation from the center of the data set. This is an analytical way to find the most extreme points in the data. 7. This is the ratio between interclass variance and total variance. 8. This data set is for public use. Access to the data is provided by the Swiss Household Panel (SHP) Web site at http://www.swisspanel.ch. 9. Following the availability of the data, the range considered is 16 to 65 years old for Social Stratification, Cohesion, and Conflict in Contemporary Families and SHP data, and 20 to 56 years old for Wisconsin Longitudinal Study data. 10. Spearman correlation coefficient = .734 (p < .01). 11. We used PROC CATMOD of the SAS software. 12. At p ≤ .05, the tested model is not statistically different than the saturated one. References Abbott, Andrew. 1984. ‘‘Event Sequence and Event Duration: Collocation and Measurement.’’ Historical Methods 17:192-204. Abbott, Andrew. 1990a. ‘‘Conception of Time and Events in Social Science Methods: Causal and Narrative Approach.’’ Historical Methods 23:140-50. Abbott, Andrew. 1990b. ‘‘A Primer on Sequence Methods.’’ Organization Science 1:375-92. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 227 Abbott, Andrew. 1995a. ‘‘A Comment on ‘Measuring the Agreement Between Sequences.’’’ Sociological Methods & Research 24:232-43. Abbott, Andrew. 1995b. ‘‘Sequence Analysis: New Methods for Old Ideas.’’ Annual Review of Sociology 21:93-113. Abbott, Andrew. 2001. Time Matters: On Theory and Method. Chicago: University of Chicago Press. Abbott, Andrew and Emily Barman. 1997. ‘‘Sequence Comparison Via Alignment and Gibbs Sampling: A Formal Analysis of the Emergence of the Modern Sociological Article.’’ Sociological Methodology 27:47-87. Abbott, Andrew and John Forrest. 1986. ‘‘Optimal Matching Methods for Historical Sequences.’’ Journal of Interdisciplinary History XVI:471-94. Abbott, Andrew and Alexandra Hrycak. 1990. ‘‘Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.’’ American Journal of Sociology 96:144-85. Abbott, Andrew and Angela Tsay. 2000. ‘‘Sequence Analysis and Optimal Matching Methods in Sociology.’’ Sociological Methods & Research 29:3-33. Aisenbrey, Silke. 2000. Optimal Matching Analyse. Anwendungen in Den Sozialwissenschaften (Optimal Matching Analysis: Applications in the Social Sciences). Opladen, Germany: Leske + Budrich. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jianzhi Zhang, Zhu Zhang, Webb Miller, and David Lipman. 1997. ‘‘Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs.’’ Nucleic Acids Research 25:3389-3402. Bateman, Alex, Evan Birney, Richard Durbin, Sean R. Eddy, Robert D. Finn, and Erik L. Sonnhammer. 1999. ‘‘Pfam 3.1: 1313 Multiple Alignments and Profile HMMs Match the Majority of Proteins.’’ Nucleic Acids Research 27:260-62. Billari, Francesco C. 2001. ‘‘Sequence Analysis in Demographic Research and Applications.’’ Canadian Studies in Population 28:439-58. ¨ Bird, Katherine and Helga Kruger. 2005. ‘‘The Secret of Transitions: The Interplay of Complexity and Reduction in Life Course Analysis.’’ Pp. 173-94 in Towards an Interdisciplinary Perspective on the Life Course, vol. 10, edited by R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, and E. Widmer. Amsterdam: Elsevier JAI. Blair-Loy, Mary. 1999. ‘‘Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.’’ American Journal of Sociology 104:1346-97. Blossfeld, Hans-Peter and Sonja Drobnic. 2001. Careers of Couples in Contemporary Society: From Male Breadwinner to Dual Earner Families. New York: Oxford University Press. Bock, Hans H. 1985. ‘‘On Some Significance Tests in Cluster Analysis.’’ Journal of Classification 2:77-108. Calinski, Tadeusz and Joachim Harabasz. 1974. ‘‘A Dendrite Method for Cluster Analysis.’’ Communication in Statistics 3:1-27. Chan, Tak Wing. 1995. ‘‘Optimal Matching Analysis: A Methodological Note on Studying Career Mobility.’’ Work and Occupations 22:467-90. Dayhoff, Margaret O., Robert M. Schwartz, and Bruce C. Orcutt. 1978. ‘‘A Model in Evolutionary Change in Proteins.’’ Pp. 345-52 in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, edited by M. O. Dayhoff. Washington, DC: National Biomedical Research Foundation. Dijkstra, Wil and Toon Taris. 1995. ‘‘Measuring the Agreement Between Sequences.’’ Sociological Methods & Research 24:214-31. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 228 Sociological Methods & Research Duda, Richard O. and Peter E. Hart. 1973. Pattern Classification and Scene Analysis. New York: John Wiley. Durbin, Richard, Sean E. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. Elder, Glen H. 1985. Life Course Dynamics: Trajectories and Transitions, 1968-1980. Ithaca, NY: Cornell University Press. Erzberger, Christian and Gerald Prein. 1997. ‘‘Optimal-Matching-Technik: Ein Analysever¨ fahren zur Vergleichbarkeit und Ordnung individuell differenter Lebensverlaufe.’’ [Optimal matching technique: an analytical process to compare and classify individual life courses] ZUMA-Nachrichten 40:52-80. Everitt, Brian S. 1979. ‘‘Unresolved Problems in Cluster Analysis.’’ Biometrics 35:169-81. Forrest, John and Andrew Abbott. 1989. ‘‘The Optimal Matching Method for Studying Anthropological Sequence Data: An Introduction and Reliability Analysis.’’ Journal of Quantitative Anthropology 1:151-70. Giddens, Anthony, Mitchell Duneier, and Richard P. Applebaum. 2003. Introduction to Sociology. New York: W. W. Norton. Giele, Janet Z. and Glen H. Elder. 1998. Methods of Life Course Research: Qualitative and Quantitative Approaches. Thousand Oaks, CA: Sage. Giuffre, Katherine A. 1999. ‘‘Sandpiles of Opportunity: Success in the Art World.’’ Social Forces 77:815-32. Goodman, Leo A. and William H. Kruskal. 1979. Measures of Association for Cross Classification. New York: Springer. Grauer, Dan and Wen-Hsiung Li. 2000. Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer. Halpin, Brendan and Tak Wing Chan. 1998. ‘‘Class Careers as Sequences: An Optimal Matching Analysis of Work-Life Histories.’’ American Sociological Review 14:111-30. Hartigan, John A. 1985. ‘‘Statistical Theory in Clustering.’’ Journal of Classification 2:63-76. Henikoff, Steven and Jorja G. Henikoff. 1992. ‘‘Amino Acid Substitution Matrices From Protein Blocks.’’ Proceedings of the National Academy of Sciences 89:10915-19. ¨ Hopflinger, Francois, Maria Charles, and Annelies Debrunner. 1991. Familienleben und Ber¸ ufsarbeit (Family Life and Professional Work). Zurich, Switzerland: Seismo. Hughey, Richard and Anders Krogh. 1996. ‘‘Hidden Markov Models for Sequence Analysis: Extension and Analysis of the Basic Method.’’ Computer Applications in Biological Science 12:95-107. Kohli, Martin. 1986. ‘‘The World We Forgot: A Historical Review of the Life Course.’’ Pp. 271-303 in Later Life: The Social Psychology of Aging, edited by V. W. Marshall. London: Sage. ´ ¨ Kruger, Helga and Rene Levy. 2001. ‘‘Linking Life Courses, Work and the Family: Theorizing a Not So Visible Nexus Between Women and Men.’’ Canadian Journal of Sociology 26:145-66. Kruskal, Joseph. 1983. ‘‘An Overview of Sequence Comparison.’’ Pp. 1-44 in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, edited by D. Sankoff and J. Kruskal. Toronto, Canada: Addison-Wesley. Lawrence, Charles E., Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. 1993. ‘‘Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment.’’ Science 262:208-14. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 229 Levine, Joel H. 2000. ‘‘But What Have You Done for Us Lately?’’ Sociological Methods & Research 29:34-40. Levitt, Barbara and Clifford Nass. 1989. ‘‘The Lid on the Garbage Can: Institutional Constraints on Decision Making in the Technical Core of College-Text Publishers.’’ Administrative Science Quarterly 34:190-207. ´ Levy, Rene. 1977. Der Lebenslauf als Statusbiographie. Die weibliche Normalbiographie in makrosziologisher Perspektive. [The life course as a sequence of statuses. The female standard biography in a macrosociological perpsective]. Stuttgart, Germany: Enke. ´ Levy, Rene, Jacques-Antoine Gauthier, and Eric Widmer. 2006. ‘‘Entre contraintes institu´ tionnelle et domestique: Les parcours de vie masculins et feminins en Suisse.’’ [Between institutional and domestic constraints: the life courses of women and men in Switzerland] Revue canadienne de sociologie 31:461-89. ´ Levy, Rene, Eric Widmer, and Jean Kellerhals. 2002. ‘‘Modern Family or Modernized Family Traditionalism? Master Status and the Gender Order in Switzerland.’’ Electronic Journal of Sociology 6(4). Manning, Gerard, David B. Whyte, Ricardo Martinez, Tony Hunter, and Sucha Sudarsanam. 2002. ‘‘The Protein Kinase Complement of the Human Genome.’’ Science 298:1912-34. Milligan, Glenn W. and Martha C. Cooper. 1985. ‘‘An Examination of Procedures for Determining the Number Clusters in a Dataset.’’ Psychometrika 50:159-79. Milligan, Glenn W. and Martha C. Cooper. 1987. ‘‘Methodology Review: Clustering Methods.’’ Applied Psychological Measurement 11:329-54. Moen, Phillis. 1985. ‘‘Continuities and Discontinuities in Women’s Labor Force Activity.’’ Pp. 113-55 in Life Course Dynamics: Trajectories and Transitions, 1968-1980, edited by G. H. Elder. Ithaca, NY: Cornell University Press. Moen, Phillis. 2003. It’s About Time: Couples and Careers. Ithaca, NY: Cornell University Press. Moen, Phyllis and Yan Yu. 2000. ‘‘Effective Work/Life Strategies: Working Couples, Work Conditions, Gender, and Life Quality.’’ Social Problems 47:291-326. Mott, Frank L. 1978. Women, Work and Family. Lexington, MA: Lexington Books. ¨ Muller, Tobias and Martin Vingron. 2000. ‘‘Modeling Amino Acid Replacement.’’ Journal of Computational Biology 7:761-76. Myrdal, Alva and Viola Klein. 1956. Women’s Two Roles: Home and Work. London: Routledge. Nargundkar, Satish and Timothy J. Olzer. 1998. ‘‘An Application of Cluster Analysis in the Financial Services Industry.’’ Presented at the sixth annual conference of the South East SAS Users Group, September 13-15, Norfolk, VA. Needleman, Saul B. and Christian D. Wunsch. 1970. ‘‘A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.’’ Journal of Molecular Biology 48:443-53. Ng, Pauline C., Jorja G. Henikoff, and Steven Henikoff. 2000. ‘‘PHAT: A TransmembraneSpecific Substitution Matrix. Predicted Hydrophobic and Transmembrane.’’ Bioinformatics 16:760-66. ´ Notredame, Cedric, Philipp Bucher, Jacques-Antoine Gauthier, and Eric Widmer. 2005. T-Coffee/saltt: User Guide and Reference Manual. Lausanne: Swiss Institute of Bioinformatics. Retrieved from http://www.tcoffee.org/saltt. ´ Notredame, Cedric, Desmond G. Higgins, and Jaap Heringa. 2000. ‘‘T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment.’’ Journal of Molecular Biology 302:205-17. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 230 Sociological Methods & Research ´ Notredame, Cedric, Liisa Holm, and Desmond G. Higgins. 1998. ‘‘Coffee: An Objective Function for Multiple Sequence Alignments.’’ Bioinformatics 14:407-22. Olszak, Michael and Gilbert Ritschard. 1995. ‘‘The Behavior of Nominal and Ordinal Partial Association Measures.’’ Statistician 44:195-212. Pentland, Brian T., Malu Roldan, Ahmed A. Shabana, Louise L. Soe, and Sidne G. Ward. 1998. ‘‘Lexical and Sequential Variety in Organizational Processes.’’ School of Labor and Industrial Relations, Michigan State University, East Lansing. Unpublished manuscript. Punj, Girish and David W. Stewart. 1983. ‘‘Cluster Analysis in Marketing Research: Review and Suggestions for Application.’’ Journal of Marketing Research 20:134-48. Rapkin, Bruce D. and Douglas A. Luke. 1993. ‘‘Cluster Analysis in Community Research: Epistemology and Practice.’’ American Journal of Community Psychology 21:247-77. ¨ ¨ Rohwer, Gotz and Ulrich Potter. 2002. TDA User’s Manual. Bochum, Germany: Ruhr ¨ Universitat Bochum. Retrieved from http://www.stat.ruhr-uni-bochum.de/pub/tda/doc/ tman63/tman-pdf.zip. ¨ Rohwer, Gotz and Heike Trappe. 1997. ‘‘Describing Life Courses. An Illustration Based on NLSY Data.’’ Pp. 30 in POLIS Project Conference. Florence, Italy: European University Institute. SAS Institute, Inc. 2004. SAS/STAT User’s Guide. Cary, NC: Author. ¨ Schaeper, Hildegard. 1999. ‘‘Erwerbesverlaufe von Ausbildungsabsolventinnen und -Absolventen: Eine Anwendung der Optimal-Matching-Technik.’’ [Employment history of girls and boys after completion of vocational education and training: an appli¨ cation of optimal matching technique]. Sonderforschungsbereich 186, Universitat Bremen, Germany. Scherer, Stefani. 2001. ‘‘Early Career Patterns: A Comparison of Great Britain and West Germany.’’ European Sociological Review 17:119-44. Sheridan, Jennifer T. 1997. ‘‘The Effects of the Determinants of Women’s Movement Into and Out of Male Dominated Occupations on Occupational Sex Segregation.’’ CDE Working Paper 97-07, Department of Sociology, University of Wisconsin, Madison. Smith, Temple F. and Michael S. Waterman. 1981. ‘‘Identification of Common Molecular Subsequences.’’ Journal of Molecular Biology 147:195-97. Stovel, Katherine and Marc Bolan. 2004. ‘‘Residential Trajectories: Using Optimal Alignment to Reveal the Structure of Residential Mobility.’’ Sociological Methods & Research 32:559-98. Stovel, Katherine, Michael Savage, and Peter Bearman. 1996. ‘‘Ascription Into Achievement: Models of Career Systems at Lloyds Bank, 1890-1970.’’ American Journal of Sociology 102:358-99. Thompson, Julie, Desmond G. Higgins, and Toby Gibson. 1994. ‘‘Clustal W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.’’ Nucleic Acids Research 22:4673-80. Turner, Jonathan H. 2001. ‘‘Sociological Theory Today.’’ Pp. 1-17 in Handbook of Sociological Theory, edited by J. H. Turner. New York: Kluwer Academic. ´ ´ ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2003. Couples contemporains: Cohesion, reg¨ ulation et conflits. [Contemporary couples: cohesion, regulation, conflicts] Zurich: Seismo. ´ Widmer, Eric, Jean Kellerhals, and Rene Levy. 2004. ‘‘Quelle pluralisation des relations familiales?’’ [What pluralization of family relations]. Revue francaise de sociologie ¸ 45:37-67. Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 Gauthier et al. / How Much Does It Cost? 231 ´ Widmer, Eric, Rene Levy, Alexandre Pollien, Raphael Hammer, and Jacques-Antoine Gauthier. 2003. ‘‘Entre standardisation, individualisation et sexuation: une analyse des trajectoires personnelles en Suisse’’ [Between standardization, individualization and gendering: an analysis of personal life courses in Switzerland] Revue suisse de sociologie 29:35-67 Wilson, W. Clarke. 1998. ‘‘Activity Pattern Analysis by Means of Sequence-Alignment Methods.’’ Environment and Planning A 30:1017-38. Wu, Lawrence L. 2000. ‘‘Some Comments on ‘Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect.’ ’’ Sociological Methods & Research 29:41-64. Yu, Yi-Kuo and Stefen F. Altschul. 2005. ‘‘The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins With Non-Standard Compositions.’’ Bioinformatics 21:902-11. Jacques-Antoine Gauthier is a senior lecturer in sociology at the University of Lausanne and member of the Center for Life Course and Lifestyle Studies (Pavie). He has worked in the fields of health, addiction, and family sociology. His latest publications have appeared in the Canadian Journal of Sociology, European Journal of Operational Research, and the Swiss Journal of Sociology. Eric D. Widmer is a professor of sociology at the University of Geneva, with an appointment at the Center for Life Course and Lifestyle Studies (Pavie). His long-term interests include life course research, family research, and social networks. His latest publications have appeared in the Journal of Personal and Social Relationships, European Sociological Review, and Journal of Marriage and Family. Philipp Bucher is a group leader at the Swiss Institute for Experimental Cancer Research and a founding member of the Swiss Institute of Bioinformatics. His long-term interests include the development of algorithms for the analysis of molecular sequences and the application of these algorithms in various areas of biomedical research. His latest publications have appeared in PLoS Computational Biology and Nucleic Acids Research. ´ Cedric Notredame is a group leader at the Centre for Genomic Regulation in Barcelona (Spain) and a research investigator for the Centre National de la Recherche Scientifique (France). The focus of his work is the development and improvement of multiple sequence alignment algorithms. His latest publications have appeared in the Journal of Molecular Biology and Nucleic Acid Research. He is also the coauthor, with J. M. Claverie, of a popular introductory textbook in bioinformatics, Bioinformatics for Dummies (New York: Wiley, 2003). Downloaded from http://smr.sagepub.com at Unithéque cantonale et universitaire de Lausanne on September 7, 2009 BIOINFORMATICS Vol. 17 no. 0 2001 Pages 1–3 Mocca: semi-automatic method for domain hunting ´ Cedric Notredame Information Genetique et Structurale, CNRS-UMR 1889, 31 Ch. Joseph Aiguier, 13 402 Marseille, France Received on Month xx, 2000; revised and accepted on Month xx, 2000 ABSTRACT Motivation: Multiple OCCurrences Analysis (Mocca) is a new method for repeat extraction. It is based on the TCoffee package (Notredame et al., JMB, 302, 205–217, 2000). Given a sequence or a set of sequences, and a library of local alignments, Mocca extracts every segment of sequence homologous to a pre-specified master. The Q: implementation is meant for domain hunting and makes Please it fast and easy to test for new boundaries or extend known supply recevied repeats in an interactive manner. Mocca is designed to date Q: deal with highly divergent protein repeats (less than 30% Applications amino acid identity) of more than 30 amino acids. Note? Q: Availability: Mocca is available on request (cedric. There are some notredame@gmail.com). The software is free of charge differences in the and comes along with complete documentation. electronic version and hardcopy. We follow electronic version information concerning the whereabouts of one of the repeats (master repeat), it allows the user to tune the parameters describing the repeat family (i.e. start position, length of the master repeat and stringency of the search), and extract other occurrences of that repeat within the dataset. The procedure is fast and simple. INTRODUCTION Many proteins consist of separately evolved, independent structural units called modules or domains. The great diversity of protein functions is partly due to the vast number of possibilities to arrange a finite number of those basic units (Chothia, 1992). It is generally agreed that a domain is a self-folding unit made of a minimum of 25 amino acids (Bairoch et al., 1997; Corpet et al., 1998). Many of these domains appear as homologous subsequences repeated within a sequence or within a set of sequences, hence the importance of repeats identification in the course of domain hunting. Many tools exist for discovering and extracting these repeats and without being exhaustive, one can cite PSi-Blast (Altschul et al., 1997), Dot matrices (Junier and Pagni, 2000); Repro (Heringa and Argos, 1993) and the Gibbs Sampler (Lawrence et al., 1993). More recently, Heger and Holm developed a method meant to scan databases for repeats without manual intervention (Heger and Holm, 2000). These automatic methods all share the same drawback: while none of them is 100% accurate, they give the user little scope for testing his own hypothesis in a seamless manner. Multiple OCCurrences Alignment (Mocca) addresses that specific problem. Given some approximate c Oxford University Press 2001 METHODS Mocca uses a pair-wise sequence alignment algorithm (Durbin et al., 1998). The cost associated with the alignment of each pair of residues uses the ‘library extension’ developed for T-Coffee (Notredame et al., 1998, 2000). Figure 1 outlines the strategy used to generate the T-Coffee scoring scheme. Firstly, a primary library is compiled; it contains a series of local alignments obtained using Lalign, an implementation of the Sim algorithm (Huang and Miller, 1991). Given two sequences, Lalign extracts the N top scoring non-overlapping local alignments. We used a modified version that compares two sequences (or a sequence with itself), and extracts every top scoring alignment having a length longer than ten residues and an average level of identity higher than 30%. Lalign reports each alignment along with a score that indicates its statistical significance. In our primary library, such local alignments appear as a series of pairs of residues where each pair receives a weight equal to the score of the alignment it comes from. Given a set of N sequences, the library contains the result of all the possible pair-wise comparisons (including the self-comparisons). This library is fed into T-Coffee to generate the position specific scoring scheme using the ‘library extension’ algorithm (Notredame et al., 2000). In Mocca, a pre-requisite to repeat extraction is the estimation of at least one basic unit repeat among the sequences being analysed (master repeat). In the context of this work, we made the estimation using dotlet, a Java-based dot matrix method (Junier and Pagni, 2000). The master repeat is a sub-string selected within the sequence(s) used to build the library. Mocca extracts every sub-string homologous to the master in a single pass over the target sequences. It is the library extension that 1 C.Notredame putational requirement is the Lalign library O(N2 L2 ), the motif extraction itself only requires little time (12 s on an IRIX O2 station for 20 sequences totalling 5000 residues). If the position of one of the repeats is known, the procedure can also be run automatically from the command line. It is recommended to use Mocca in conjunction with other means for the initial estimation of the repeat boundaries (PSi-Blast, Altschul et al., 1997; Dotlet, Junier and Pagni, 2000; Dotter, Sonnhammer and Durbin, 1995;. . . ). Our tests show that Mocca can properly deal with sets of repeats whose multiple alignment indicate less than 15% average identity. While we currently use Lalign as a source of local information, any other sensible source could be considered. For instance, structural information could easily be added to our procedure, using off the shelf libraries of local structural similarities such as the Dali Domain Dictionary (Holm and Sander, 1998). The input format of Mocca is straightforward and well documented. Mocca is a refinement tool for the discovery and the establishment of new domains. If the master repeat is replaced with a profile or a collection of known characterized repeats, Mocca could also be used to improve the model of a given repeat family and extend the predictive power of its profiles. Fig. 1. Layout of the Mocca strategy. The main steps required to extract a repeat with Mocca method are shown. Square blocks designate procedures while rounded blocks indicate data structures. ACKNOWLEDGEMENTS The author wishes to thank the following people: Des Higgins for very helpful comments. Jaap Heringa, Philipp Bucher and Kay Hoffmann for useful discussions and advice at an early stage of the project, Hiroyuki Ogata for helpful comments on the program. REFERENCES Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE database, its status in 1997. Nucleic Acids Res., 25, 217–221. Chothia,C. (1992) Proteins: 1000 families for the molecular biologist. Nature, 357, 543–544. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of protein domain families. Nucleic Acids Res., 26, 323–326. Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis. 1 vols, Cambridge University Press, Cambridge. Heger,A. and Holm,L. (2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins, 41, 224–237. Heringa,J. and Argos,P. (1993) A method to recognise distant repeats in protein sequences. Proteins: Struct. Funct. Genet., 17, 391–411. Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96. makes it possible for a single repeat to ‘recognize’ each of its homologues (even the distant ones). The extraction process relies on a very efficient dynamic programming procedure known as repeated matches (Durbin et al., 1998). This algorithm reports a series of non-overlapping sub-strings each of them having an alignment to the master associated with a score higher than some pre-specified threshold T h. T h is empirically set to be a function of the maser repeat length (L): Th = S ∗ L S has a value between 0 and 1. By default, S = 0.05, but its value can be modified interactively. Two other parameters can also be modified to increase sensitivity and accuracy: the gap opening penalty and the gap extension. Mocca is part of the T-Coffee package. It is written in Perl and ANSI C. It runs on any UNIX or LINUX platform. It is available free of charge along with documentation. Copies can be obtained on request by sending an e-mail to cedric.notredame@gmail.com. The main com2 Mocca and domain hunting Huang,X. and Miller,W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337–357. Junier,T. and Pagni,M. (2000) Dotlet: diagonal plots in a web browser. Bioinformatics, 16, 178–179. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. Notredame,C., Holm,L. and Higgins,D.G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel algorithm for multiple sequence alignment. JMB, 302, 205–217. Sonnhammer,E.L. and Durbin,R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene, 167, GC1-10. To be balanced at final stage 3 Optimization of ribosomal RNA profile alignments     ¢ ¡   ¢ © £     ©  ¢ !     $  $ ¢   © # ¨ § ¥ ¦ " ¥  ¤  £  ¢ ¡ ¡ ! ¢ ¡   Motivation: Large alignments of ribosomal RNA sequences are maintained at various sites. New sequences are added to these alignments using a combination of manual and automatic methods. We examine the use of profile alignment methods for rRNA alignment and try to optimize the choice of parameters and sequence weights. Results: Using a large alignment of eukaryotic SSU rRNA sequences as a test case, we empirically compared the performance of various sequence weighting schemes over a range of gap penalties. We developed a new weighting scheme which gives most weight to the sequences in the profile that are most similar to the new sequence. We show that it gives the most accurate alignments when combined with a more traditional sequence weighting scheme. Availability: The source code of all software is freely available by anonymous ftp from chah.ucc.ie in the directory /home/ftp/pub/emmet, in the compressed file PRNAA.tar. Contact: emmet@chah.ucc.ie, des@chah.ucc.ie Introduction Ribosomal RNA sequences (rRNA) are widely used to estimate the phylogenetic relatedness of groups of organisms (e.g. Sogin et al., 1986; Pawlowski et al., 1996), especially that of the small subunit (SSU rRNA). The SSU rRNA has been sequenced from thousands of different species and large alignments are maintained at several sites (Maidak et al., 1997; Van de Peer et al., 1997). The alignments are large and complex and the addition of new sequences is a demanding task, either for the alignment curators or for individuals who wish to align new sequences with existing aligned sequences. In simple cases, automatic alignment programs such as Clustal W (Thompson et al., 1994a) may be used to align groups of closely related sequences or as a prelude to manual refinement. There may be large stretches of unambiguous alignment with high sequence identity which may be useful for phylogenetic purposes. The fully automated, accurate alignment of rRNA sequences remains a difficult problem, however. In principle, one can use profile alignment methods (Gribskov, 1987) which use dynamic programming algorithms (Needleman and Wunsch, 1970, Gotoh, 1982) to align a new sequence against an existing ‘expert’ alignment. For example, one could take an alignment of all SSU rRNA sequences from one of the rRNA collections and one could use this as a guide; aligning each new sequence in turn, treating the large alignment as a profile. This approach has the advantage of simplicity and speed but the final accuracy may be limited by the lack of any ability to use secondary structure information. The RNALIGN approach (Corpet and Michot, 1994) or the stochastic context free grammar approach (Eddy and Durbin, 1994; Sakakibara et al., 1994) provide elegant methods for the alignment of rRNA sequences taking both primary sequence and secondary structure into account. These methods, however, are very demanding in computer resources and cannot deal easily with pseudoknots so that their immediate application to the alignment of SSU rRNA sequences is not trivial. In this paper, we examine, empirically, the effectiveness of profile alignment methods for the alignment of RNA sequences. We remove test sequences from existing ‘expert’ alignments and measure the extent to which they can be realigned with the original alignment, automatically. We use the eukaryotic SSU rRNA sequences from Van de Peer et al. (1997) as a test case. For a range of test sequences, we measure the number of positions that can be correctly realigned over a range of different parameters (gap opening and gap extension penalties). Sequence weighting has been shown to increase the reliability of profile alignments using amino acid sequences (Thompson et al., 1994b). This can be used to give less weight to clusters of closely related sequences and increased weight to sequences with no close relatives in order to counteract the effect of unequal sampling across a phylogenetic tree of possible sequences. We examine the effectiveness of one commonly used scheme (Thompson et al., 1994b). We also propose a new weighting scheme which is designed to give increased weight to those sequences in the profile (reference alignment) which are closest (highest sequence identity) to the new sequence being aligned. If a new mammalian sequence is being aligned, for example, it makes most sense to give a high weight to other mammalian sequences and decreasing weights to sequences that are more and more distantly related. Some sections of SSU rRNA sequences are from regions whose secondary structure is conserved across many species. These conserved, ‘core’, regions are relatively easy to align 332 Oxford University Press ” • h j 2 i ( h & h ' g 3 ) f U e Q d T ™ BIOINFORMATICS ˜ — — ” • “ ‘ – • ” “ ’ S ‘  5 R Q k P I 2 ( I w 2 c € ( € E B y & A x ) H ‰ b v A a G ˆ ) Y u 3 „ ` ‡ D q Y „ A 5 † & D q F p & & E E i F 3 e s I … D 6 ) e @ f X 0 f 6 „ 1 9 ) ( i & q D C „ 6 A 2 i 2 e B 3 ƒ 0 g A h W @ e ) 2 u 0 6 ‚ 9 V 6  € A 1 € & 0 y & x U 8 0 w 7 6 0 v 3 u 9 6 e 2 5 H t 4 p s 9 3 f 7 0 6 0 r 2 ( q & p 1 ) 1 i 0 3 e ) 4 h g ( 2 e 6 ' f 3 & 6 e d 5 % rRNA profiles with high accuracy but are interspersed with less conserved regions that may be very difficult to align. We empirically determine which regions of the eukaryotic reference alignment can be aligned with high accuracy by a simple jack-knife experiment. We remove each sequence, one at a time, and try to realign it with the rest. It is then a simple matter to count how often each nucleotide of each sequence is correctly realigned. This gives a definition of conserved core regions that is purely empirical and which can be used by users to delimit regions of alignment which can be safely used in phylogenetic research. Finally, we examine the effect of G+C content of each sequence on the accuracy of alignment. Sequences of high or low G+C may be expected to be more difficult to align than those with more balanced nucleotide compositions. umn in the profile (just one of the four residues), with no gaps will get a score of 1.0 when aligned with the same residue in the test sequence and a score of 0 otherwise. Other columns score in proportion to the frequency of each of the four residue types. In positions in the profile where one or more of the sequences has a gap, gaps were treated as a class of residue for frequency calculations. Other methods have been proposed for generating profiles using the natural logarithms of residue frequencies which may be normalized by overall residue frequencies to give log-odds scores (see Henikoff and Henikoff, 1996 for a review). We carried out some tests using the latter scheme and found that performance was comparable although slightly inferior to that using simple frequencies. Therefore we only present results obtained using the frequencies. Gap penalties System and methods Small subunit ribosomal RNA An alignment of eukaryotic, nuclear SSU rRNA sequences (that dated May 6, 1997) was obtained from the World Wide Web server at http://www-rrna.uia.ac.be/ssu/index.html (Van de Peer et al., 1997). After removal of columns which consist only of gaps, the two incomplete sequences of Butomus umbellatus and the unaligned sequence Babesia bovis 4 the alignment contains 1517 sequences and is 5370 characters long. Individual sequences vary widely in length, from<1300 nucleotides to >2500. Sixteen test sequences were removed from and realigned with the reference alignment in order to measure the accuracy with which it was possible to recreate their original alignment. The sequences used were Drosophila melanogaster, Xenopus laevis, Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Oryza sativa, Dictyostelium discoideum, Euglena gracilis, Ammonia beccarii, Physarum polycephalum, Entamoeba histolytica 1, Vahlkampfia lobospinosa, Giardia sp., Naegleria gruberi, Hexamita sp. and Trypanosoma brucei. These sequences were chosen based on a phylogenetic tree of all the sequences in the alignment, in order to give a spread of test cases over a wide range of different positions in the tree. Re-alignment was carried out over a range of gap penalties and using a number of sequence weighting schemes as described below. A range of gap opening and extension penalties were used in alignment generation. For each test sequence and each weighting scheme, a total of 81 alignments were carried out. Gap opening penalties were used ranging from 1 to 9 in increments of 1, and gap extension penalties ranging from 0.1 to 0.9 in increments of 0.1. This range of ratios between gap penalties and residue match scores was chosen as it encompasses values empirically shown to give alignments of biological relevance. Terminal gaps were penalized solely with an extension penalty. Position-specific gap opening penalties were derived from the frequency of gaps at each position along the alignment. At each position, a value equal to the number of residues (nongap characters) in the column divided by the number of sequences in the alignment was derived. This value was then multiplied by the gap opening penalty, as taken from the range above, to give a specific gap opening penalty at each position. This gives gap opening penalties which are higher in positions at which residues mostly occur in comparison with positions which are occupied mostly by gaps. Sequence weighting By default, each sequence in the existing alignment will have an equal effect on the alignment of new sequences with the profile. If additional information is available concerning the relationships of sequences within the alignment to each other and to the sequence being aligned, this may not be optimal. For example, if a new sequence is identical to a sequence already in the alignment, the correctly aligned position of each residue in the new sequence could be deduced solely from that one identical sequence, and no information concerning the other sequences is necessary. Further, sampling bias can lead to an unequal representation of taxa within the alignment (e.g. there might be very many sequences from some taxa and very few from others), and it is possible to use sequence weighting to correct for this also. Three different weighting schemes Dynamic programming The reference alignment was converted into a profile (Gribskov et al., 1987) which contains information on the frequency of each residue and gaps at each position. The test sequences were aligned with this using a dynamic programming algorithm (Needleman and Wunsch, 1970). We used Gotoh’s algorithm (Gotoh, 1982) and maximized the similarity between the sequence and the profile. A homogenous col- 333 E.A.O’Brien, C.Notredame and D.G.Higgins Fig. 1. Tree of the sequences that were used as test cases. The weights for these sequences under different weighting schemes are given in Table 1. were applied to the sequences in the SSU rRNA alignment, and compared with the default of equal weights. The first weighting scheme, referred to as tree-based weights, is based on a phylogenetic tree of the sequences in the alignment. A neighbour-joining tree (Saitou and Nei, 1987) of all the sequences in the profile was generated using the DNADIST and NEIGHBOR programs of the PHYLIP package (Felsenstein, 1989). Weights were then derived from the branch lengths as described by Thompson et al. (1994b). These weights are then normalized to have a mean of 1.0. This gives a total weight for the profile equal to that where each sequence is weighted equally, which is necessary in order to keep the effects of changing gap penalties congruent across the different schemes. The general effect of these tree-based weights is to downweight sequences with many close relatives in order to prevent the more densely populated regions of the tree exerting a disproportionate effect on the alignment of sequences from other regions of the tree. The second weighting scheme is based on the level of similarity between the sequence being aligned and each individual sequence in the alignment, and is referred to as identitybased weighting. The new sequence is first aligned with the profile using equal weights. A distance is then calculated between the new sequence and each other sequence in the alignment equal to the mean number of differences per site in this initial approximate alignment. This is percent difference divided by 100 and there is no correction for multiple hits or unequal rates of transition and transversion. The recip- rocal of this distance is used as a weight for each sequence and these are again normalized to give a mean of 1.0. This weighting scheme has the effect of upweighting sequences more similar to the sequence being added relative to those that are more distantly related. The upweighting effect increases as the sequences become more similar to the sequence being aligned. The third scheme is a combination of these weighting schemes, in which the weight derived for each sequence based on branch lengths is multiplied by the weight derived from sequence identities, and the values are again renormalized. This scheme is referred to as combination weights. Table 1 shows the values given by the various weighting schemes for the case shown in the example tree in Figure 1. The tree-based weights are independent of the new sequence that is to be added, being derived wholly from the structure of the existing data. Weights are calculated using the method of Thompson et al. (1994b), which are then renormalized to give a mean of 1, leaving the values shown. The identitybased weights are derived by taking the distance of each sequence in the tree from the new sequence, defined as the mean number of differences per aligned pair of residues, ignoring any pairs with a gap in either sequence. The reciprocals of these values are renormalized around 1 to give the figures shown. For the final set of combination weights, the product is taken of the weights in each of the preceding columns and again renormalized to give a mean of 1. 334 rRNA profiles Table 1. The weights assigned to the sequences in the test tree shown in Figure 1 when the sequences Mus musculus and Plasmodium gallinaceae were added a Ammonia beccarii Caenorhabditis elegans Dictyostelium discoideum Drosophila melanogaster Entamoeba histolytica Euglena gracilis Giardia sp. Hexamita sp. Homo sapiens Naegleria gruberi Oryza sativa Physarum polycephalum Saccharomyces cerevisiae Trypanosoma brucei Vahlkampfia lobospinosa Xenopus laevis 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 b 0.746 0.974 0.875 0.727 1.194 1.519 1.340 1.266 0.411 1.212 0.511 1.435 0.516 1.488 1.398 0.383 c 0.273 0.289 0.250 0.349 0.225 0.198 0.193 0.206 10.628 0.204 0.390 0.205 0.377 0.211 0.196 1.798 d 1.256 1.008 1.049 1.054 0.984 0.809 0.773 0.854 1.053 0.942 1.235 0.856 1.302 0.846 0.889 1.082 e 0.379 0.522 0.406 0.470 0.500 0.557 0.481 0.484 8.088 0.459 0.370 0.547 0.361 0.583 0.508 1.278 f 0.991 1.038 0.968 0.809 1.241 1.298 1.094 1.141 0.456 1.205 0.667 1.298 0.708 1.329 1.313 0.438 Columns represent the following schemes: (a) equal sequence weights, (b) tree-based sequence weights, (c) identity-derived weights for each sequence for the alignment of Mus musculus, (d) identity-derived sequence weights for each sequence for the alignment of Plasmodium gallinaceae, (e) combination of tree and identity-derived weights for Mus musculus, (f) combination of tree and identity-derived weights for Plasmodium gallinaceae For each of the three defined weighting schemes and the default of equal weights, alignments were generated using position-specific gap-opening penalties across the range of gap extension penalties and base gap opening penalties described above. This procedure was repeated for each of the test sequences. The number of residues correctly placed in each alignment was determined by comparison with the sequence as originally aligned in the reference alignment, and this was then divided by the total number of residues in the sequence to give a percentage score for the alignment. From the scores for the alignments across the range of gap opening and gap extension penalties for each test case, the gap penalties giving the best performance across all or most of the test cases were obtained. Results The performance of a set of weights was judged by its efficacy across the range of gap opening and gap extension penalties used. The peak score and the range of gap penalties giving a comparable score were taken into account in making this judgement (Table 2). For scoring purposes, each residue is counted as distinct, and is only considered correctly aligned if it is in the same position as the same residue in the reference sequence. The score for a sequence is counted as the percentage of the total number of residues in the sequence that have been correctly realigned. The main results are presented in Table 2. In the first column, the percentage accuracy of alignment scores are given for each of the 16 test cases. These scores are the best obtained across the range of gap opening and extension penalties with no sequence weights. The scores are low and range from 43% (Euglena) up to 88% (Oryza). The addition of position specific gap penalties has a dramatic effect. The scores all increase by about 10–15% which represents an improvement of several hundred residues in the original sequences that have been correctly aligned. The use of sequence weights yields further improvements, although not as dramatically as this. It should be noted that an improvement in score of just 1% is the equivalent of 20 residues in a molecule of 2000 nucleotides. We only give the peak scores from across the full range of gap opening and extension penalties. These were all obtained with a gap opening penalty of between 5.0 and 7.0 and a gap extension penalty of either 0.1 or 0.2. Implementation Programs were developed and/or run on DEC Alpha workstations running DEC UNIX. All new code was written in the C programming language and is freely available by anonymous FTP (login as anonymous to chah.ucc.ie and transfer the compressed tar archive PRNAA.tar). The code is not designed for portability and users will have to down load their own rRNA alignments and build their own profiles; a JAVA version of the programs is being developed which will be used to provide future access to all the methods via the Internet. 335 E.A.O’Brien, C.Notredame and D.G.Higgins Table 2.The highest % identity between the reference alignment and the realigned sequence obtained using each of the weighting schemes a A.beccarii C.elegans D.discoideum D.melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberi O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 71.65 69.26 64.42 70.14 55.68 43.12 55.00 56.13 79.88 50.37 88.85 53.62 86.71 47.62 46.23 82.47 b 84.19 83.98 78.95 82.72 73.50 60.22 73.89 73.10 91.01 63.60 97.08 65.02 93.94 62.86 56.20 93.59 c 83.66 83.98 78.95 82.97 74.83 60.22 73.96 73.61 92.88 63.74 97.13 64.66 94.55 63.39 55.69 95.18 d 84.05 86.99 79.59 81.11 75.04 60.22 76.81 78.39 91.49 67.81 96.69 68.64 93.38 64.77 56.20 94.25 e 83.96 87.84 79.06 84.02 78.17 61.08 77.29 77.16 92.30 67.86 97.35 67.52 94.10 65.04 58.96 95.07 (a) Fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based weights, (d) position-specific gap penalties and tree-based weights, (e) position-specific gap penalties and combination weights. The underlined values are the absolute maximum scores obtained for each sequence Table 3. Alignment percentage accuracy scores for various weighting schemes and gap penalties Gap extension penalty (a) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 58 58 59 58 58 58 58 57 57 59 59 60 59 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 62 62 62 62 62 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 61 62 63 63 63 63 62 62 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Cont.... 46 32 16 13 4 2 2 1 1 47 32 17 12 4 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 17 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 47 33 16 12 5 2 2 1 1 45 31 17 10 6 5 4 3 3 46 33 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 46 32 16 12 6 5 4 3 3 Trypanosoma brucei gap opening penalty 1 2 3 4 5 6 7 8 9 Vahlkampfia lobospinosa gap opening penalty 1 2 3 4 5 6 7 8 9 336 rRNA profiles Table 3. Continued Gap extension penalty (c) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (d) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (e) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 62 60 61 60 60 59 59 59 59 64 62 62 62 60 61 61 61 62 64 63 63 62 62 63 62 62 62 65 64 64 64 63 63 63 62 62 65 64 64 64 64 63 63 62 62 65 64 64 64 64 62 63 62 62 65 64 64 64 64 62 63 62 62 65 63 64 64 64 63 63 62 62 65 63 64 64 64 63 63 62 62 56 56 55 55 55 55 55 55 55 58 57 57 57 57 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 58 58 57 57 58 57 58 57 57 62 61 60 61 60 59 59 59 59 63 63 63 62 61 61 61 61 61 65 64 63 63 62 62 61 61 61 65 64 64 64 64 63 62 62 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 64 64 64 64 64 63 61 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 53 53 53 54 54 55 54 55 55 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 58 59 59 58 58 58 58 57 57 59 59 60 58 59 59 60 60 60 60 61 62 60 60 61 61 61 61 62 62 63 61 61 61 62 62 62 62 62 63 63 63 63 62 62 62 61 62 63 63 63 63 62 62 62 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 61 62 62 62 62 63 62 61 61 51 50 51 51 51 51 52 51 51 53 53 53 54 54 53 53 54 54 55 54 54 54 54 54 54 54 54 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 56 55 55 55 55 55 55 55 55 Trypanosoma brucei gap opening penalty Vahlkampfia lobospinosa gap opening penalty Italics represent those regions at or above the highest score attainable with equal sequence weights. Underlining represents the highest score attained across all the different parameters. Parameter sets are: (a) fixed gap penalties and equal sequence weights, (b) position-specific gap penalties and equal sequence weights, (c) position-specific gap penalties and identity based sequence weights, (d) position-specific gap penalties and tree-derived sequence weights, (e) position-specific gap penalties and weights derived from combination of tree-based and identity-based weights. In nine out of the 16 test cases, the single best alignment score generated across the ranges of gap penalties was obtained using the combined weights (the last column of Table 2). In three of the remaining cases, tree-based weights give the best performance (column c). The identity weights give the highest score in three cases, and Ammonia beccarii is aligned most accurately with equal weights. Both identitybased and tree-based methods of sequence weighting are shown to improve over equal weights in most cases, with the combination of both these weights giving the best overall performance. Two examples are shown in detail in Table 3. Here the scores for all values of gap opening and gap extension penalties are given for each weighting scheme for just two of the test cases: Vahlkampfia lobospinosa and Trypanosoma brucei. In both cases, the results with uniform gap penalties, shown in row (a), are very poor and depend strongly on the exact value of the parameters. There is a huge improvement in row (b) where the values for position specific gap penalties are shown. Here, the values are much higher than in row (a) and there is almost no dependence on the exact values chosen for the gap penalties. In the case of Vahlkampfia there is no noticeable difference between the use of tree-based or identity-based weights [the results are shown in rows (c), (d) and (b)]. Use of the combined weighting scheme, as seen in row (e), gives a consistent improvement, showing increase of 2% across the entire range of gap penalties. In the case of Trypanosoma the relative performance of each weighting scheme is more dis- 337 E.A.O’Brien, C.Notredame and D.G.Higgins tinct. In comparing identity weights to equal weights in this case, there is improvement for some values of gap penalty. The effect of using tree-based weights is to produce improvement across a larger range of gap penalties, particularly for gap extension penalties <0.3. The combination of the two weighting schemes again shows a synergistic effect, with a further increase visible across the range of gap penalties. The values of gap opening and gap extension penalties giving the maximum scores for each test case are given in Table 4. These are the optimum parameters when using the combined weighting scheme with position specific gap penalties. They all fall in a very narrow range. Table 4. Gap opening and extension penalties giving optimum alignment scores for each test case using combined weights Gap opening A.beccarii C.elegans D.discoideum D. melanogaster E.histolytica E.gracilis Giardia sp. Hexamita sp. H.sapiens N.gruberii O.sativa P.polycephalum S.cerevisiae T.brucei V.lobospinosa X.laevis 6.0 6.0 6.0 5.0 6.0 6.0 7.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 Gap extension 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.2 sequences with average G+C contents (∼50%). As expected, sequences with extreme nucleotide compositions (very high or very low G+C content) tend to be less easy to align accurately. High levels of a particular nucleotide increase the chance that a residue in the sequence being aligned may align with the wrong column in the profile. The test cases cover a range of G+C content from 38.4% (Entamoeba histolytica) to 68.5% (Giardia sp.). Discussion The generation of alignments under various parameters shows that position-specific gap opening penalties have a very strong positive effect on the accuracy with which alignments can be generated. Fixed gap penalties perform extremely poorly, particularly at high values of gap extension penalty. This corresponds to situations in which the long gaps that occur in virtually all sequences in certain regions of the alignment, which correspond to long insertions in a few sequences, are penalized very heavily and do not occur in an alignment giving an optimum score. Experimentation with position-specific gap extension penalties did not give any further improvement. Sequence weighting can have a further positive effect on alignment quality. Both weighting schemes based on sequence identity and those based on the tree structure and branch lengths are seen to have generally positive effects. As expected, the tree-based weights are seen to perform at their best in the case of sequences which are quite distant from the main taxa, with few or no close relatives, such as Hexamita, and to be of least benefit to alignment quality with sequences which have many close relatives such as O.sativa. With identity-based weights the greatest positive effects are seen in sequences within highly represented taxa such as S.cerevisiae. These two weighting schemes have opposing effects on the values of the sequence weights in the case of sequences aligning into densely populated regions of the tree, and so the net result of combining them, in cases such as S.cerevisiae, may not perform any better than either of the weighting schemes used individually. The examples given (Table 3) indicate that there are cases where tree-based and identitybased weights show a synergistic effect when combined, the combination outperforming either of the schemes applied individually. The combined weights give the best result in more than half of the test cases, and the average difference between the score generated with the combined weights and the overall best score is substantially less than the difference between the scores from any of the other weighting schemes and the overall best score in each case. This synergy is seen to occur most strongly in sequences which are distant from the main bulk of the alignment and therefore more difficult to align correctly. Those which are located in highly repre- In order to tell which sections of the reference alignment may be reliably aligned, each of the 1517 sequences in turn was removed from the alignment and re-aligned with the remaining sequences. Each column of the original, reference alignment was scored depending on what percentage of the residues in it can be realigned in the correct positions. Figure 2 shows the estimated secondary structure of the Saccharomyces cerevisiae nuclear SSU rRNA with those positions from the full alignment which can be realigned with ≥95% accuracy marked in black and those which realign with <95% accuracy in grey. Stems forming pseudoknots are not displayed in this representation. This is a conservative estimate of the regions that may be reliably aligned as there are some positions that are not found in this molecule and sequences from some taxonomic groupings may be aligned almost perfectly. Figure 3 shows the accuracy with which each sequence can be realigned compared to its original alignment as a function of G+C content. The re-alignment accuracy is greatest for 338 rRNA profiles Fig. 2. Secondary structure of Saccharomyces cerevisiae SSU rRNA with stable regions indicated in black., generated using the ESSA program (Chetouani et al., 1997). 339 E.A.O’Brien, C.Notredame and D.G.Higgins Fig. 3. Graph of percentage of sequence correctly re-aligned against G+C content for each of the 1517 sequences in the reference alignment. sented taxa do not show such strong effects from any of the weighting schemes, but these tend to be those sequences which have the best alignments initially. We have shown how to improve the accuracy of alignment of rRNA sequences using some simple methods. It is quite possible that alignments of 100% accuracy will not be possible due to the existence of errors introduced manually into the reference alignment. Nonetheless, we can already see that some sequences may be aligned with >95% accuracy (Oryza and Xenopus), and across the entirety of the alignment 89.84% of all residues can be realigned correctly. Some sequences are still disappointing and this can partially be explained by very biased G+C content (e.g. Giardia). Others come from poorly sampled parts of the overall Eukaryote phylogenetic tree and these will become easier to align as new sequences are added. Nonetheless, it may be difficult for users to evaluate the quality of a new alignment. We provide one, extremely simple method for choosing regions of the overall alignment that can be reliably aligned in almost all cases. This covers about half of the positions in any given molecule and provides a selection of sites which can be reliably chosen for phylogenetic purposes. This site selection can be fine-tuned by looking at regions which may be reliably aligned in specific taxa. Finally, it is very obvious that these methods could benefit from some consideration of secondary structure, which could be used for evaluation of alignments or as part of the alignment process. We are investigating the use of genetic algorithms to optimize the quality of profile alignments where secondary structure is considered (Notredame et al., 1997). We will use a genetic algorithm to optimize the quality function of Corpet and Michot (1994) but based on profiles rather than pairs of sequences. Acknowledgements The authors thank Richard Durbin for suggesting the use of the 1/d weights. We also thank Manolo Gouy for his help with rRNA sequences in general. This work was supported by a grant (BIO4-CT95–0130) from the EU Biotechnology programme. References Chetouani,F.,Monestie,P.,Thebault,P.,Gaspin,C. and Michot,B. (1997) ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. Nucleic Acids Res., 25, 2514–3522. Corpet,F. and Michot,B. (1994) RNAlign program: alignment of RNA sequences using both primary and secondary structures. Comput. Applic. Biosci., 10, 389–399. Eddy,S. and Durbin,R. (1994) RNA sequence analysis using covariance models. N ucleic Acids Res., 22, 2079–2088. Felsenstein,J. (1989) Cladistics, 5, 164–166. 340 rRNA profiles Gotoh,O. (1982) J. Mol. Biol., 162, 705. Gotoh,O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput. Applic. Biosci., 11, 543–551. Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Henikoff,J. and Henikoff,S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Applic. Biosci., 12, 135–143. Luthy,R., Xenarios,I., and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, 139–146. Maidak,B., Olsen,G.,Larsen,N., Overbeek,R.,McCaughey,M. and Woese,C. (1997) The Ribosomal Database Project (RDP). Nucleic Acids Res., 25, 109–111. Needleman,S. and Wunsch,C. (1970)A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Neefs,J.-M., Van de Peer,Y., Hendriks,L. and De Wachter,R. (1990) Database on the structure of small subunit ribosomal RNA. N ucleic Acids Res., 18, 2237–2217. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Pawlowski,J., Bolivar,I., Fahrni,J.F., Cavalier-Smith,T. and Gouy,M. (1996) Early origin of Foraminifera suggested by SSU rRNA gene sequences. Mol. Biol. Evol., 13, 445–450. Saitou,N. and Nei,M. (1987) The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Sakakibara,Y.,Brown,M.,Hughey,R., Mian,I.S.,Sjolander,K, Underwood,R.C., and Haussler,D. (1994) Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res., 22, 5112–5120. Sogin,M.,Elwood,H, and Gunderson,J. (1986) Evolutionary diversity of eukaryotic small-subunit rRNA genes. Proc. Natl Acad. Sci. USA, 83, 1383–1387. Thompson,J., Higgins,D. and Gibson,T. (1994a) CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Thompson,J., Higgins,D. and Gibson,T. (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Applic. Biosci., 10, 19–29. Van de Peer,Y.,Jansen,J.,De Rijk,P. and De Wachter,R. (1997) Database on the structure of small ribosomal subunit RNA. Nucleic Acids Res., 25, 111–116. 341 R EVIEW Recent progresses in multiple sequence alignment: a survey Cédric Notredame Information Génétique et Structurale, UMR 1889, 31 Chemin Joseph Aiguier, 13 006 Marseille, France Tel.: +33 (0)4 911 646 06 Fax: +33 (0)4 911 645 49 E-mail: cedric.notredame @igs.cnrs-mrs.fr The assembly of a multiple sequence alignment (MSA) has become one of the most common tasks when dealing with sequence analysis. Unfortunately, the wide range of available methods and the differences in the results given by these methods makes it hard for a non-specialist to decide which program is best suited for a given purpose. In this review we briefly describe existing techniques and expose the potential strengths and weaknesses of the most widely used multiple alignment packages Introduction Sequence alignment is by far the most common task in bioinformatics. Procedures relying on sequence comparison are diverse and range from database searches [1] to secondary structure prediction [2]. Sequences can be compared two by two to scour databases for homologues, or they can be multiply aligned to visualize the effect of evolution across a whole protein family. In this study we will focus on the later methods, dedicated to the global simultaneous comparison of more than two sequences. Special emphasis will be given to the most recently described techniques. The many uses of MSAs to provide a very accurate estimation of pair-wise distances and to make it possible to estimate the reliability of each branch by bootstrapping [5]. Identification of conserved motifs and domains Multiple alignments constitute an extremely powerful means of revealing the constraints imposed by structure and function on the evolution of a protein family. They make it possible to ask a wide range of important biological questions and they will each be discussed in turn. Phylogenetic analyses Phylogenetic trees are instrumental in elucidating the evolutionary relationships that exist among various organisms. Nowadays, highly accurate phylogenetic trees rely on molecular data. Their computation typically involves four steps: • collection of a set of orthologous sequences in a database • multiple alignment of the sequences • measure of pair-wise phylogenetic distances on the multiple alignment and computation of a distance matrix • computation of the tree by applying a clustering algorithm [3] to the distance matrix As an alternative to the last two bullets the tree may also be computed using maximum likelihood [4]. In both cases, the role of multiple alignment is Keywords: please provide MSAs make it possible to identify motifs preserved by evolution that play an important role in the structure and function of a group of related proteins. Within a multiple alignment, these elements often appear as columns with a lower level of variation than their surroundings. When coupled with experimental data, these motifs constitute a very powerful means of characterizing sequences of unknown function. Important databases like PROSITE [6] or PRINTS [7] rely on this principle. When a motif is too subtle to be defined with a standard pattern, one may use another type of descriptor known as a profile [8] or a hidden Markov model (HMM) [9]. These are meant to exhaustively summarize (column by column) the properties of a protein family or a domain. Profiles and HMMs make it possible to identify very distant members of a protein family when searching a database. Their sensitivity and specificity is much higher than that provided by a single sequence or a pattern. In practice, one can derive their own profile from multiple alignments using packages such as: the PFTOOLS [10], pre-established collections like Pfam [11], or compute the profiles on the fly with PSI-BLAST [12] the position specific version of BLAST. The specificity and sensitivity of a profile are tightly correlated to the biological quality of the multiple alignment it was derived from. Structure prediction Ashley Publications Ltd www.ashley-pub.com Structure prediction is another important use of multiple alignments. Secondary and tertiary Pharmacogenomics (2002) 3 (1) 1 2001 © Ashley Publications Ltd ISSN 1462-2416 REVIEW structure prediction aim at predicting the role a residue plays in a protein structure (buried or exposed, helix or strand etc.). Secondary structure predictions based on a single sequence yield a low accuracy (in the order of 60%) [13], while predictions based on a MSA go much higher (in the order of 75%) [2,14,15]. The rationale behind such improvements is that the pattern of substitutions observed in a column directly reflects the type of constraints imposed on that position in the course of evolution. In the context of tertiary structure determination or when predicting nonlocal contacts, multiple alignments can also help to identify correlated mutations. This approach has only given limited results when applied to proteins [16], it has been much more successful in RNA analysis where it allows highly accurate predictions [17] well confirmed by structural analysis. Altogether, these very important applications explain the amount of attention dedicated to the MSA problem and any biologist should be aware that very few bioinformatics protocols bypass the multiple alignment stage. Unfortunately, available tools are only heuristics providing an approximate solution to a problem that remains largely open. These many heuristics are based on different paradigms, each well suited to a limited range of situations. A complicated problem. aware that given inappropriate sequences, most multiple alignment routines will nonetheless produce an alignment. It will be the responsibility of the biologist to realize that this alignment is meaningless. This is not an easy task, and a few years ago Henikoff reviewed a series of problems that can occur when one forces multiple alignments with unrelated sequences [22]. In order to recruit a set of homologous sequences, it is common practice to use one of the BLAST programs (WU-BLAST, PSI-BLAST, GAPPED BLAST etc.) [12], for searching within a database all the sequences similar to some query sequence. When doing so, an observed similarity is considered good when it is unlikely to arise by chance (given the database and the amino-acid frequencies). To make this estimation, BLAST uses powerful statistical models developed by Altschul and Karlin [23]. Of course, these statistical models merely approximate the biological reality, and homology may be misrepresented by similarity, leading to the incorporation of improper sequences within a multiple alignment. The choice of an objective function MSA is a complicated problem. It stands at thecross road of three distinct technical difficulties: • the choice of the sequences • the choice of an objective function (i.e., a comparison model) • the optimization of that function Altogether, properly solving these three problems would require an understanding of statistics, biology and computer science that lies far beyond our grasp. The choice of the sequences The methods reviewed here (i.e., global MSA methods) only make sense if they are assumed to be dealing with a set of homologous sequences i.e., sequences sharing a common ancestor. Furthermore, with the exception of DiAlign [18], global methods require the sequences to be related over their whole length (or at least most of it). When that condition is not met, one should consider the use of local MSA methods such as the Gibbs sampler [19], Match-Box [20] or MACAW [21]. In any case, one should always be 2 This is purely a biological problem that lies in the definition of correctness. What should a biologically correct alignment look like? Can we define its expected properties and will we recognize it when we see it? These intricate questions can only be answered by means of a mathematical function able to measure an alignment biological quality. We name this function an Objective Function (OF) because it defines the mathematical objective of the search. Given a perfect function, the mathematically optimal alignment will also be biologically optimal. Yet this is rarely the case, and while the function defines a mathematical optimum, we rarely have an argument that this optimum will also be biologically optimal. Defining a proper objective function is a highly non-trivial task and an active research field of its own right. In theory, an OF should incorporate everything that is known about the sequences, including their structure, function and evolutionary history. This information is rarely at hand and is hard to use, so it is usually replaced with sequence similarity. Thus, a very simple general function is often used: the weighted sums-of-pairs with affine gap penalties [24]. Under this model, each sequence receives a weight proportional to the amount of independent information it contains [25] and the cost of the multiple alignment is equal to the sum of the Pharmacogenomics (2002) 3 (1) REVIEW cost of all the weighted pair-wise substitutions. The substitution costs are evaluated using a predefined evolutionary model known as a substitution matrix [26], in which a score is assigned to every possible substitution or conservation according to its biological likeliness (i.e., rarely observed mutations receive a negative score while mutations observed more often would be expected by chance receive to a positive score). Insertions or deletions are scored using affine gap penalties that penalize a gap once for opening and then proportionally to its length. This penalty scheme is a major source of concern because it requires two parameters: • The gap opening • The gap extension penalty whose adequate values can only be set empirically and may vary from one set of sequences to the next [27]. Although this function is clearly wrong from an evolutionary point of view [24], because it assumes every sequence within the set to be an ancestor of every other sequence, the ease of its implementation has made it popular with the most widely used MSA packages [28-30]. This validation was recently confirmed by a more thorough benchmarking [31] indicating that packages that rely on the sums-of-pairs are reasonable performers as judged by the biological quality of the alignments they produce. Very recently, a new variant of the sum-of-pairs function has been introduced that seems less likely to over-estimate evolutionary events [32]. Over the last years, new OFs were described that seem to be less sensitive to gap penalty estimation thanks to the incorporation of local information. These include the segment-based evaluation of DiAlign [33] and the consistency objective function of T-Coffee [34]. HMMs [9,35] constitute another line of thought recently explored. HMMs describe the multiple alignment in a statistical context, using a Bayesian approach. Although from a formal point of view they provide us with the most attractive solution, their performances for ab initio alignments have so far been disappointing and recent work shows that carefully tuned HMM packages barely outperform ClustalW [36]. Other statistically-based methods that attempt to associate a P-value to the multiple alignment have been described Unfortunately, these measures are [19,37]. restricted to ungapped MSAs. All things considered, one should be well aware that there is no such thing as the ideal OF and every available scheme suffers from major http://www.ashley.com drawbacks. In an ideal world, a perfect OF would be available for every situation. In practice, this is not the case and the user is always left to make a decision when choosing the method that is most suitable to the problem. Computational The third problem associated with MSAs is computational. Assuming we have at our disposal an adequate set of sequences and a biologically perfect objective function, the computation of a mathematically optimal alignment is too complex a task for an exact method to be used [38]. Even if the function we are interested in was as simple as a maximization of the number of perfect identities within each column, the problem would already be out of reach for more than three sequences. This is why all the current implementations of multiple alignment algorithms are heuristics and that none of them guarantee a full optimization. Considering their most obvious properties, it is convenient to classify existing algorithms in three main categories: exact, progressive and iterative. Exact algorithms are high quality heuristics that deliver an alignment usually very close to optimality [28,39], sometimes but not always within well-defined boundaries. They can only handle a small number of sequences (< 20) and are limited to the sums-of-pairs objective function. Progressive alignments are by far the most widely used [34,40,41]. They depend on a progressive assembly of the multiple alignment [42-44] where sequences or alignments are added one by one so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [45]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity, even if it is by nature a heuristic that does not guarantee any level of optimization. Other progressive alignment methods exist such as DiAlign [18] or Match-Box [20], which assemble the alignment in a sequence-independent manner by combining segment pairs in an order dictated by their score, until every residue of every sequence has been incorporated in the multiple alignment. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvements can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one 3 REVIEW from a multiple alignment and realigning them to the remaining sequences [46,47], some of these methods can even be a mixture of progressive and iterative strategies [48]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [49] and simulated annealing or genetic algorithms [50-56]. The main advantage is to allow for a good conceptual separation between optimization processes and OF. Recent examples of algorithms belonging to these three categories are reviewed in the next section. Review The number of available MSA methods has steadily increased over the last 20 years. Being exhaustive on these will not be possible within the scope of this work, and this review should be seen as complementary with another recent review [57]. Furthermore, it should be pointed out that only a minority of the methods described in the literature have found their way towards regular usage. There are many reasons for failure, but the main one stems from a simple fact: there is no satisfactory theoretical framework in sequence analysis, in this context an algorithm is only as good as it is useful. Improvements are driven by results and not theory, so that programs with badly designed interfaces or poor portability have been disgarded by natural selection, leaving their algorithms to be reinvented by later generations. Over the last few years, the field of MSA has undergone drastic evolutionary changes with the introduction of several new algorithms and new evaluation methods. Some of the methods used for mutiple sequence alignments are listed in Table 1. Among all this, two new trends have emerged: • the increasing use of iterative optimisation strategies (stochastic or non-stochastic) • the use of consistency-based scoring schemes In this section, we review some of these new algorithms, their main characteristics and potential shortcomings. Another major trend, that will not be extensively covered here, has been the introduction of HMMs methods [9,35]. A very detailed account on HMM-based methods for MSAs may be found in [58]. The progressive algorithms. ing a set of sequences in little time and with little memory. This algorithm was initially described by Hogeweg [42] and later re-invented by Feng [43] and Taylor [44]. The most widely used MSA packages are based on an implementation of this algorithm, which include: Pileup, a part of the GCG package [59], MultAlign [41] and ClustalW [29] that has become the standard method for multiple alignments. ClustalW is a non-iterative, deterministic algorithm that attempts to optimize the weighted sums-of-pairs with affine gap penalties. It is a straightforward progressive alignment strategy where sequences are added one by one to the multiple alignment according to the order indicated by a pre-computed dendrogram. Sequence addition is made using a pair-wise sequence alignment algorithm [45]. The main shortcoming of this strategy is that once a sequence has been aligned, that alignment will never be modified even if it conflicts with sequences added later in the process as shown in Figure 1. ClustalW also includes many highly specialized heuristics meant to maximally exploit sequence information: • • • • local gap penalties automatic substitution matrix choice automatic gap penalty adjustment the delaying of the alignment of distantly related sequences Benchmarking tests, carried out on BAliBASE a database of reference multiple sequence alignments. In general, ClustalW performs better when the Phylogenetic tree is relatively dense without any obvious outlier. It does not matter how widely the sequences are spread just as long as every sequence remains close enough (a bit like crossing a river stepping from stone to stone). Long insertions or deletions also cause trouble, due to the intrinsic limitation of the affine penalty scheme used by ClustalW. The latest improvement to the progressive alignment algorithm is T-Coffee, a novel strategy where sequences are aligned in a progressive manner but using a consistency-based objective function that makes it possible to minimize potential errors, especially in the early stages of the alignment assembly. T-Coffee is reviewed in more detail in the consistency-based algorithm section. [31], Exact algorithms Progressive alignment constitutes one of the simplest and most effective ways of multiply align4 As mentioned earlier, progressive alignment is only an approximate solution. In order to use the Pharmacogenomics (2002) 3 (1) REVIEW Table 1. Some recent and less recent available methods for MSAs. Name MSA DCA OMA ClustalW, ClustalX MultAlin DiAlign ComAlign T-Coffee Praline IterAlign Prrp SAM HMMER SAGA GA Algorithm Exact Exact (requires MSA) Iterative DCA Progressive Progressive Consistency-based Consistency-based Consistency-based/progressive Iterative/progressive Iterative Iterative/Stochastic Iterative/Stochastic/HMM Iterative/Stochastic/HMM Iterative/Stochastic/GA Iterative/Stochastic/GA URL http://www.ibc.wustl.edu/ibc/msa.html http://bibiserv.techfak.uni-biefield.de/dca http://bibiserv.techfak.uni-biefield.de/oma ftp://ftp-igbmc.u-strasbg.fr/pub/clustalW or clustalX http://www.toulouse.inra.fr/multalin.html http://www.gsf.de/biodv/dialign.html http://www.daimi.au.df/~ ocaprani http://igs-server.cnrs-mrs.fr/~ cnotred jhering@nimr.mrc.ac.uk http://giotto.Stanford.edu/~ luciano/iteralign.html ftp://ftp.genome.ad.jp/pub/genome/saitama-cc/ rph@cse.ucsc.edu http://hmmer.wustl.edu/ http://igs-server.cnrs-mrs.fr/~ cnotred czhang@watnow.uwaterloo.ca Ref. [28] [39] [61] [29] [41] [18] [75] [66] [48] [70] [47] [84] [68] [51] [52] signal contained in the sequences properly, one would like to simultaneously align them, rather than adding them one by one to a multiple alignment. This would be especially useful when dealing with sets of extremly divergent sequences whose pair-wise alignments are all likely to be incorrect. Unfortunately, to align several sequences, one would need to generalize the Needlman and Wunsch algorithm [45] to a multidimensional space and for practical reasons (time and memory) this is only possible for a maximum of three sequences. That limit can be pushed a bit further if one finds a way to identify in advance the portion of the hyperspace that does not contribute to the solution and exclude it from computation. This is achieved in the MSA program, an implementation of the Carrillo and Lipman algorithm [60] that makes it possible to align up to ten closely related sequences [28]. It should be stressed here that, contrary to a widespread belief, the MSA program is only a heuristic implementation of the Carillo and Lipman algorithm, that is not guaranteed to reach the mathematical optimum. MSA uses lower and upper bounds tighter than the guaranteed ones (Altschul, personal communication). Even so, the high memory requirement, the lengthy computational time and the limitation on the number of sequences explain why the MSA program quickly gave way to ClustalW. Yet, MSA met again with popularity when Stoye described a new divide and conquer algorithmDCA [39] that sits on the top of MSA http://www.ashley.com and extends its capabilities. The DCA algorithm cuts the sequences in subsets of segments that are small enough to be fed to MSA. The sub-alignments are later reassembled by DCA. The trick is to cut the sequences at the right points so that the produced alignment remains as close as possible to optimality. The way it is done in DCA is slightly heuristic albeit fairly accurate. Benchmarking on BAliBASE indicated that the DCA strategy does slightly better that ClustalW, even if the four largest BAliBASE test sets could not be computed with DCA (Notredame, unpublished results). Even when MSA is coupled to DCA, strong limitations remain on the number of sequences that can be handled (20–30) and on their phylogenetic spread. Recently, an iterative implementation of DCA [61], optimal multiple alignment (OMA) was described that is meant to speed up the DCA strategy and decreases its memory requirements. Iterative algorithms Iterative algorithms are based on the idea that the solution to a given problem can be computed by modifying an already existing sub-optimal solution. Each ‘modification’ step is an iteration. In the examples considered here, modifications can be made using dynamic programming or various random protocols. While the dynamic programming-based protocols can also include elements of randomization, we distinguish them from more traditional stochastic iterative meth- 5 REVIEW Figure 1. Limits of the progressive strategy. GARFIELD THE LAST FA-T CAT GARFIELD THE FAST CA-T --GARFIELD THE VERY FAST CAT GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT --- THE FAT CAT GARFIELD THE VERY FAST CAT GARFIELD THE FAST CAT GARFIELD THE LAST FAT CAT This example shows how a progressive alignment strategy can be misled. In the initial alignment of sequences 1 and 2, ClustalW h a as choice between aligning CAT with CAT and making an internal gap or making a mismatch between C and F and having a terminal gap. Since terminal gaps are much cheaper than internals, the ClustalW scoring schemes prefers the former. In the next stage, when the extra sequence is added, it turns out that properly aligning the two CATs in the previous stage would have led to a better scori ng sumsof-pairs multiple alignment. ods such as simulated annealing (SA) genetic algorithms (GA) [63]. Stochastic iterative algorithms [62] or SA was the first stochastic iterative method described for simultaneously aligning a set of sequences. Various schemes have been published [50,64], which all involve the same chain of processes: an alignment is randomly modified, its score assessed, it is kept or discarded according to an acceptance function that gets more stringent while the iteration number increases (by analogy with a decreasing temperature during crystallization), the process goes on until a finishing criteria such as convergence is met. In practice, despite being intellectually very attractive, SA is too slow for making ab initio alignment and it can only be used as an alignment improver. GAs constitute an interesting alternative to SA as shown in SAGA [51], a GA dedicated to MSA. Like SA, SAGA is an optimization black box in which any OF invented can be tested. The principle of SAGA is very straightforward and follows closely the ‘simple GA’ [65]: randomly generated multiple alignments of a given set of sequences evolve under some selection pressure. These alignments are in competition with each other for survival (survival of the fittest) and reproduction. Within SAGA, fitness depends on the score measured by the objective function (the better the score, the fitter the multiple alignment). Over a series of cycles known as generations, alignments will die or survive, depending on their fitness. They can also improve and reproduce through some stochastic modifications known as mutations and crossovers. Mutations randomly insert or shift gaps while crossovers combine the content of two alignments ( Figure 2). Overall, 20 operators co-exist in SAGA and compete for usage. The program does not guarantee optimality but has been shown to equal or outperform MSA from a mathematical point of view on 13 test sets (using exactly the same OF in both programs). The complete disconnection between the operators and the original OF made it possible to seamlessly modify the original OF in order to test SAGA with a new OF named COFFEE (Consistency Objective Function For alignPharmacogenomics (2002) 3 (1) 6 REVIEW Figure 2. One point crossover in SAGA. Parent Alignment 1 W W W W G D G S KV KV KV KV N---VDEVGGEALNEEE---VGGEALG--AHAGEYGAEAL GGHA--GEYGAEAL Parent Alignment 2 --WGKV NVDEVG-G WD--KV NEEEVG-G WGKV GA-HAGEYGA WSKV GGHAGEY-GA EAL EAL EAL EAL W W W W G D G S KV KV KV KV --NVDEVG-G --NEEEVG-G GA-HAGEYGA GGHAGEY-GA EAL EAL EAL EAL --WGKV WD--KV WGKV-WSKV-- N---VDEVGGEALNEEE---VGGEALG--AHAGEYGAEAL GGHA--GEYGAEAL Child Alignment 1 Child Alignment 2 W W W W G D G S KV KV KV KV --NVDEVG-G --NEEEVG-G GA-HAGEYGA GGHAGEY-GA EAL EAL EAL EAL Chosen Child Alignment This figure illustrates the manner in which two alignments are combined into one in SAGA, a genetic algorithm that evolves a population of alignments toward optimality. The principle is to cut straight one of the alignments and to cut the second one so that compatible ends are generated. evolutionary programming community and, in recent years, at least three algorithms based on the SAGA principle have been published [54-56]. The Gibbs sampler is another interesting stochastic iterative strategy [19]. It is a local multiple alignment method that finds ungapped motifs among a set of unaligned sequences. From a multiple alignment perspective, the most interesting feature of the Gibbs sampler is its OF. The algorithm aims to build an alignment with a good P-value (i.e., a low probability of having been generated by chance). At each iteration, segments are removed or added according to the probability that the current model (the rest of the alignment) could have generated them. If that probability is high enough, the model is then updated with the new segments and the algorithm proceeds toward the next iteration. The overall result is an alignment that has a good P-value and maximizes the probability of the data it contains (i.e., each sequence fits well within the alignment). This Bayesian idea of simultaneously maximizing the data and the model is also central to HMMs [9,35], thus it is not surprising to find that HMMs can also be trained by expected maximization [49,68] . However, like GAs, HMMs proved rather disappointing when it came to ab initio alignments. Today, HMMs such as those found in Pfam [11] are no longer generated from unaligned sequences. State of the art protocols are much more inclined toward turning a pre-computed alignment into an HMM and further refining it using HMMER [49] or SAM [68]. Non stochastic iterative algorithms. mEnt Evaluation) [66]. This series of studies revealed the suitability of GAs to become investigation tools but also made it clear that GAs were too slow a strategy for large-scale projects or everyday use. Another similar MSA GA was later introduced by Zhang and Wong [52]. The authors report a very high efficiency for their GA but these results must be considered with care since their strategy (especially the mutations) is driven by the presence of completely conserved segments that guide the assembly of the alignments. The assumption that such segments will always exist when aligning proteins is not realistic. This method appears to be appropriate when comparing very long highly similar sequences (such as portions of genomes). SAGA was later parallelized by two independent groups [67,53], in order to improve its efficiency. The model described in SAGA has been met with considerable interest in the The first non-stochastic iterative algorithms date back to the origins of MSAs [46]. The idea is simple and attractive: since mistakes may arise in the early stages of a progressive alignment, why not correct them later by re-aligning each sequence in turn to the multiple alignment using standard dynamic programming algorithms [45]. The procedure terminates when iterations consistently fail to improve the alignment. This very simple algorithm constitutes most of the iterative strategies described in the early 1990s. The main scope for variation is the way sequences are divided into two groups before being re-aligned. In AMPS [46], sequences are chosen according to their input order and re-aligned one by one. In the algorithm of Berger and Munsen [69], the choice is made in a random manner and sequences are divided into two groups that can contain more than one sequence. The element of 7 http://www.ashley.com REVIEW randomization makes the algorithm more robust and improves its accuracy. Few of these early iterative methods have been properly benchmarked, making it hard to estimate their true biological significance. The most sophisticated DP-based iterative algorithm available was recently described by Gotoh [47]. It is a double nested iterative strategy with randomization that optimizes the weighted sums-of-pairs with affine gap penalties (Figure 3). The originality of this algorithm is that the weights and the alignment are simultaneously optimised. The inner iteration optimizes the weighted sums of pairs while the outer iteration optimizes the weights that are calculated on a phylogenetic tree estimated from the current alignment [25]. The algorithm terminates when the weights have converged. Prrp was the first multiple alignment program to be extensively benchmarked, using JOY, a database of structural alignments. The results were confirmed on BAliBASE [31,34]. Prrp significantly out-performs most of the traditional progressive methods as well as some of the most recent iterative strategies (Table 2). Two other iterative alignment methods were recently described: Praline [48] and IterAlign [70]. These two methods share very similar protocols. They both start with a preprocessing of the sequences to align. In IterAlign, sequences are ‘ameliorated’(sic), this means that each sequence is locally compared to others and that every segment that shows high similarity with other proteins is replaced by a consensus. One round of ‘amelioration’ constitutes one iteration. Other iterations are run on the new set of ‘ameliorated’ sequences, until the collection of consensus converges. Consistent blocks are then extracted from the consensus collection and these blocks are chained in order to produce the final alignment. Praline uses a very similar protocol: sequences are replaced with a complete profile made from a multiple alignment that only includes their closest relatives. That profile step is iterated until the collection of profiles converges. This collection of profiles is conceptually similar to the ‘ameliorated’ set of sequences used by IterAlign. The multiple alignment is then assembled by using a straightforward progressive algorithm where sequences are replaced with profiles. One of the most interesting consequences of the protocol used in Praline is the possibility of measuring the consistency between the final alignment and the collection of profiles used for its assembly. There may be some correlation between this measure and the true accuracy of the alignment. Regardless of the potential performances of these two methods (neither have been properly bench marked), some emphasis should be given to the novel concepts they incorporate: • the first one is the use of local information in IterAlign, in order to decrease sensitivity to the gap penalty parameterization • the second key concept is consistency Sequences are preprocessed so that the regions consistently conserved across the family see their signal enhanced and become more likely to drive the alignment. This search for consistency has been one of the strongest trend in recent developments of MSA. It is also central to the noniterative methods. Consistency-based algorithm The first consistency-based MSA method was described by Kececioglu in the 80s [71]. Given a set of sequences, the optimal MSA is defined as the one that agrees the most with all the possible optimal pair-wise alignments. Computing that alignment is an NP complete problem that can only be solved for a small number of related sequences, using an MSA-like algorithm. Nonetheless, there are at least three good reasons that make consistency-based OFs very attractive: • firstly, they do not depend on a specific substitution matrix but rather on any method or collection of methods able to align two sequences at a time • secondly, the consistency-based scheme is position dependant, given the collection of pair-wise alignments. This means that the score associated with the alignment of two residues depends on their indexes (position within the protein sequence) rather than their individual nature • the third reason is more general and has to do with consistency. Experience shows that given a set of independent observations, the most consistent are often closer to the truth This principle generally holds well in biology and can be loosely connected to the observation that, given a series of measurements, noise spreads while signal accumulates. Although the first consistency-based OF was described in 1983, it took several more years to develop heuristic algorithms able to deal with optimization and it is only recently that a GA, (SAGA [51]) was used to show the biological Pharmacogenomics (2002) 3 (1) 8 REVIEW Figure 3. Layout of Prrp. Initial alignment Tree and weights computation Weights converged Yes End Outer iteration No Realign two sub-groups Inner iteration Yes Alignment converged No This figure shows the layout of Prrp, a double-nested strategy for optimizing multiple alignments. When the inner iteration has converged, new sequence weights are estimated. The convergence of these weights is the criteria for the outer iteration to stop. Table 2. Some elements of validation on BAliBASE. Method DiAlign ClustalW Prrp T-Coffee Ref1 71.0 78.5 78.6 80.7 Ref2 25.2 32.2 32.5 37.3 Ref3 35.1 42.5 50.2 52.9 Ref4 74.7 65.7 51.1 83.2 Ref5 80.4 74.3 82.7 88.7 Total 57.3 58.7 59.0 68.7 Each method in the Method column was used to align the 141 test-sets contained in BAliBASE. The alignments were then compared with the reference BAliBASE alignment using aln_compare [34]. Ref1–5 indicates the five BAliBASE categories. Results obtained in each category were averaged. All the observed differences are statistically significant, as assessed by the Wilcoxon rank-based test [34,47]. Ref1 contains a homogenous set of sequences, ref2 contains a homogenous group of sequences and an outlayer, ref3 contains two distantly related groups of sequences. Ref4 contains sequences that require long internal gaps to be properly aligned and ref5 contains sequences that require long-terminal gaps to be properly aligned. Total is the average of ref1–5. sequence comparison, database search, experimental knowledge etc.). Although SAGA-COFFEE yielded interesting results, the GA was too slow for everyday use. This prompted the development of a new heuristic algorithm to optimize the COFFEE function in a time efficient manner: T-Coffee (Figure 4). In T-Coffee, the COFFEE library is turned into a so-called ‘extended library’, a position-specific substitution matrix where the score associated with each pair of residues depends on the compatibility of that pair with the rest of the library. T-Coffee uses a procedure reminiscent of Vingron’s Dot matrix multiplication [72] and Morgenstern overlapping weights [73]. The multiple alignment is assembled using a progressive alignment algorithm similar to the one used in ClustalW: • pair-wise distances are computed • a neighbour joining tree is estimated [3] • the sequences are aligned one by one following the topology of the tree The main difference between T-Coffee and ClustalW is that in T-Coffee, the extended library replaces a substitution matrix. Another important characteristic of T-Coffee is that its primary library is made of a mixture of global alignments (produced with ClustalW) and local advantages of such a function, COFFEE [66], which emulates the maximum weight trace problem. In SAGA-COFFEE, the collection of weighted pair-wise alignments is named a library and SAGA is used to compute the alignment that has the highest level of consistency with the library. In practice, the library may contain more than one alignment for each pair of sequences, the information it contains may be redundant, conflicting and may originate from sources as various as one wishes (structure analysis, http://www.ashley.com 9 REVIEW alignments (produced with Lalign [74]). The bench-marking carried out on BAliBASE shows that this combination of local and global information makes the T-Coffee implementation able to outperform Prrp, ClustalW and DiAlign on the five categories of test-sets contained in this reference database [34]. These results were obtained without tuning, since T-Coffee does not have any parameters of its own. Due to the library extension, T-Coffee does more than simply compute a consensus alignment. Nonetheless, given a collection of multiple alignments, it can be interesting to combine them into a single consensus multiple alignment. This is what the ComAlign program does [75] by combining several multiple alignments into a single, often improved, multiple alignment. Although the details differ, T-Coffee bears some similarity to DiAlign [73], another consistency-based algorithm that attempts to use local information in order to guide a global multiple alignment. DiAlign starts with an identification of highly homologous segment-pairs. The weight of each of these pairs is defined by a Pvalue comparable to the P-values used in BLAST. Each of these segment-pairs receives another score proportional to its compatibility with the complete set of segment-pairs. This score is named an overlapping weight and segment-pairs weighted this way are very reminiscent of the extended library. The multiple alignment is then progressively assembled by adding the pairs of segments according to their weight. Assembly is made in a sequence independent order, as opposed to the ClustalW-style progressive alignment strategy. Non-compatible segment-pairs are discarded, hence the importance of the order induced by the weights. According to the authors, DiAlign is especially good at properly aligning sequences where local homology is the driving signal. This has been confirmed by BAliBASE benchmarking [31,34]. Overall, DiAlign is not as accurate as ClustalW or Prrp but it does very well in categories 4 and 5 of BAliBASE, which require very long insertions to be properly aligned. Over the past few years, the DiAlign algorithm has been modified on numerous occasions for improved efficiency [76]. Conclusion and expert opinion ‘relevant’ information and in fact there is so much of it that by choosing the data, one can suit the needs of almost any method (progressive, iterative etc.). Ironically, one could be tempted to say that data has improved faster than mutiple aligment methods. As a consequence, the real challenge is not so much the multiple alignment itself but rather the choice of a subset of sequences that will yield the most biologically correct and informative alignment, given one method or another. There are two good reasons for not using all the available sequences: • Alignments with a large number of sequences are slow to compute and hard to analyze. Whenever possible, an alignment should fit on a single sheet of A4 paper. • Limitations of existing programs. Although they all use weighting schemes meant to minimize the effect of similar or highly correlated sequences, none of these schemes are entirely satisfactory, and over-represented sub-groups always end up dominating the alignment or the profile. This can prevent the proper alignment of less well represented sequence sub-groups that may be just as important. Careful user’s trimming is still the best available way around that effect. Unfortunately, the increased sensitivity of database search tools coupled with the increase in database size has rendered this process very tedious. The second major change that has occurred over the last years is the increasing number of available 3D structures. Although the proportion of protein sequences with a known 3D structure is getting smaller and smaller, the situation is very different from a protein family perspective and the proportion of protein families where at least one member has a known 3D structure increases regularly. This means that in most cases, multiple alignment modelling could benefit from the incorporation of 3D structural information, in order to enhance very remote homologies, or to guide the choice of local penalties [77]. Very few of the packages avaliable are able to mix structure and sequences within a multiple alignment. While ClustalW is able to use SwissProt secondary structure information for gap penalty estimation, a proper tool is still lacking for the simultaneous alignment of sequences and structures. Two of the methods introduced here are good candidates for such a combination. The consistency-based algorithms Pharmacogenomics (2002) 3 (1) Ten years ago, when schemes such as MSA were developed, there was very little data available and the main problem was to use every bit of available information properly. Today the situation has dramatically changed. We are overwhelmed by 10 REVIEW Figure 4. Layout of T-Coffee. Users library A B A C B C Primary library of local alignments Extension A B A C A B A C B C Primary library of global alignments Primary library Extended library Progressive alignment This figure indicates the layout of T-Coffee. Local and global pairwise alignments are first computed and then combined into a primary library that is extended in order to be used for computing the multiple sequence alignment in a progressive manner. have the advantage of having few requirements on the origin of their libraries. For instance, DALI, the database of structural multiple alignments [78] relies on T-Coffee to assemble the collection of pair-wise alignments produced by the DALI algorithm into a multiple alignment. The double dynamic programming algorithm introduced by Taylor [79] is also a good candidate for that purpose. While it has been shown that this algorithm is suitable for structure-to-structure alignments [80], recent results indicate that it could also be used in the context of MSA and possibly as a means to mix sequences and structures [81]. The third major obstacle on the road that leads toward an informative multiple alignment is the processing of repeats. Repeated sequences (in tandem or not) are renowned for confusing all the existing MSA methods. When dealing with sequences that contain such repeats, the only solution is to pre-process the sequences, extract the repeats and only align homologous regions. This extraction can be made using any local multiple alignment tool such as the Gibbs sampler [19], Mocca [82] or Repro [83]. Unfortunately, none of these tools are well integrated within a global multiple alignment procedure. The Gibbs sampler and Mocca have the advantage of providing the user with some estimation of the biological relevance of their output. The fourth point that needs to be raised here is computation. While elegant solutions have been found to parallelize database searches, the parallelization of a MSA algorithm remains a difficult task. The operations involved in the implementation of these algorithms require complex schemes of memory sharing that are not suitable to Linux-farms and other clusters. When dealing with large sets of data of long sequences, supercomputers are still required for multiple alignment programs. The last important point is the estimation of local accuracy. The common property of all the methods introduced here is that no one in particular is the best. They may all be out-performed by the others on one protein family or another. For that reason, we feel that it is more important to be able to assess the exact level of http://www.ashley.com 11 REVIEW Highlights • MSAs are essential bioinformatics tools. They are required for phylogenetic analysis, to scan databases for remote members of a protein family and structure prediction. • No perfect method exists for assembling a multiple sequence alignment and all the available methods do approximations (heuristics). • The most commonly used methods for doing multiple sequence alignments use a progressive alignment algorithm (ClustalW [31]). • Recent progress have focused on the design of iterative (Prrp [30], SAGA[51]) and consistency based methods (DiAlign [33], T-Coffee [34], Praline [48] and IterAlign [70]). • Benchmarking on a collection of reference alignments (BAliBASE [31]) indicates that ClustalW [31] performs reasonably well on a wide range of situations while DiAlign is more appropriate for sequences with long insertions/deletions. These tests also indicate that T-Coffee [34] is on average the best available method among those evaluated that way. • Future methods should be able to integrate structural information within the multiple alignments and to allow some estimation of their local reliability. accuracy of an alignment, rather than improving the average performances of each method. To our knowledge, only four packages, incorporate an estimation of the alignment local quality: ClustalX (the X-Window interface of ClustalW), Praline [48], T-Coffee [34] and Match-Box [20]. None of these methods for estimating local accuracy have been thoroughly benchmarked and properly validated for estimation. To conclude, a multiple alignment is merely a very constrained model. It is a powerful way to spot inconsistencies amongst a data set and to visualize relationships that may exist among seemingly independent pieces of information. Multiple alignment may be driven by any available source of information for instance structure, sequence, experimental knowledge and so on. Outlook and more important in our understanding of biology and there is no doubt than in 5 to 10 years, multiple alignments will be as central to the biological analysis as they are now. There is no doubt in my mind that MSA will remain central to sequence-based biology. This being said, MSA methods will also need to evolve. They will need to integrate heterogeneous information such as structures, results of database searches, experimental data and in general, anything that may come from expression data and proteomic analysis, including regulatory information. Integrating such heterogeneous information is a complex task. When the data is heterogeneous, knowing who is right and who is wrong becomes an art. Addressing that type of questions will be difficult and essential. The appropriate method will have to do this in a transparent way, letting the user control every bit of extra information that goes into his alignment. This ideal method should also allow the user to inject into his model some of his own knowledge. Doing so should be made an easy task. These ideas have been central to development of the underlying philosophy in the T-Coffee package [34]. In any case, these future methods are bound to be memory and CPU hungry. Compared with database searches, multiple sequence alignment protocols are hard to optimize. Special hardware may need to be adapted and the code may have to be redesigned. Several computer manufacturers are currently looking at this problem. One can easily imagine that a powerful multiple sequence alignment server will soon be a feature of most laboratories, just like PCR machines made their appearance in the 90s. Acknowledgements The author wishes to thank Dr Jean Michel Claverie for helpful advice and comments. He also wishes to thank the two referees for interesting comments and for putting to his attention several of the most recent references included in this work. 6. Bairoch A, Bucher P, Hofmann K: The PROSITE database its status in (1997). Nucleic Acids Res. 25, 217-221 (1997). Attwood TK, Croning MD, Flower DR et al.: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28(1), 225-227 (2000). Gribskov M, Luethy R, Eisenberg D: Profile analysis. Meth. Enzymol. 183, 146159 (1990). Are multiple sequence alignments here to stay? The answer is yes, without any doubt. While we enter the area of comparative genomics, the simultaneous comparison of a large number of homologous biological objects will become more Bibliography Papers of special note have been highlighted as either of interest (•) or of considerable interest (••) to readers. 1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990). 2. Rost B, Sander C, Schneider R: PHD - an automatic server for protein secondary structure prediction. CABIOS 10, 53-60 3. 4. 5. (1994). Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406425 (1987). Saitou N: Maximum likelihood methods. Meth. Enzymol. 183, 584-598 (1990). Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783-791 (1985). 7. 8. 12 Pharmacogenomics (2002) 3 (1) REVIEW 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. •• 19. 20. 21. 22. Haussler D, Krogh A, Mian IS, Sjölander K: Protein modeling using hidden markov models: analysis of globins. In: Proceedings for the 26th Hawaii International Conference on Systems Sciences. Wailea HI U.S.A.: Los Alamitos CA: IEEE Computer Society Press (1993). Luthy R, Xenarios I, Bucher P: Improving the sensitivity of the sequence profile method. Protein Sci. 3(1), 139-146 (1994). Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer E: The Pfam protein families database. Nucleic Acids Res. 28(1), 263-266 (2000). Altschul SF, Madden TL, Schaffer AA et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 22893402 (1997). Garnier J, Gibrat J-F, Robson B: GOR method for predicting protein secondary structure from amino acid sequence. Meth. Enzymol. 266, 540-553 (1996). Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195-202 (1999). Rost B: Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 134(2-3), 204-18 (2001). Goebel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins: Structure Function and Genetics 18(4), 309-317 (1994). Gutell RR, Weiser B, Woese CR, Noller HF: Comparative anatomy of 16S-like ribosomal RNA. Prog. Nucleic Acid Res. Mol. Biol. 32, 155-216 (1985). Morgenstern B, Dress A, Wener T: Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93, 12098-12103 (1996). The first method described that does not require arbitrary gap penalties. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-214 (1993). Depiereux E, Baudoux G, Briffeuil P et al.: Match-Box_server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13(3), 24956 (1997). Schuler GD, Altschul SF, Lipman DJ: A workbench for multiple alignment construction and analysis. Proteins 9(3), 180-90 (1991). Henikoff S: Playing with blocks: some 23. 24. 25. 26. 27. 28. 29. •• 30. •• 31. 32. 33. 34. pitfalls of forcing multiple alignments. The New Biologist 3(12), 1148-1154 (1991). Karlin S, Bucher P, Brendel V, Altschul SF: Statistical methods and insight for protein and DNA sequences. Annu. Rev.Biophys. Biophys. Chem. 20, 175-203 (1991). Altschul SF, Lipman DJ:Trees stars and multiple biological sequence alignment. SIAM J. Appl. Math. 49, 197-209 (1989). Altschul SF, Carroll RJ, Lipman DJ: Weights for data related by a tree. J. Mol. Biol. 207, 647-653 (1989). Dayhoff MO, Schwarz RM, Orcutt BC: A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: Atlas of Protein Sequence and Structure Dayhoff MO (Eds.), xNational Biomedical Research Foundation: Washington D.C. USA 353-358 (1979). Vingron M, Waterman MS: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1-12 (1994). Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA 86, 4412-4415 (1989). Thompson J, Higgins D, Gibson T: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4690 (1994). The most widely used method for making multiple sequence alignments. Gotoh O: Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput. Appl. Biosci. 10(4), 379-87 (1994). The first attempt to systematically assess the accuracy of a MSA method by comparison wit reference structural alignment. Also the most complex dynamic programming based iterative method. Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27(13), 2682-2690 (1999). Gonnet GH, Korostensky C, Benner S: Evaluation measures of multiple sequence alignments. J. Comput. Biol. 7(1-2), 261-76 (2000). Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment [In Process Citation]. Bioinformatics 15(3), 211-8 (1999). Notredame C, Higgins DG, Heringa J: TCoffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302, 205217 (2000). 35. Baldi P, Chauvin Y, Hunkapiller T, McClure MA: Hidden Markov models of biological primary sequence information. Proc. Nat. Acad. Sci. 91, 1059-1063 (1994). 36. Karplus K, Hu B: Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 17(8), 713-20 (2001). 37. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7-8), 563-77 (1999). 38. Wang L, Jiang T: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337-348 (1994). 39. Stoye J, Moulton V, Dress AW: DCA: an efficient implementation of the divide-andconquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13(6), 625-6 (1997). 40. Higgins DG, Sharp PM: Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-244 (1988). 41. Corpet F: Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881-10890 (1988). 42. Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175-186 (1984). • The first description of the progressive algorithm. 43. Feng D-F, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360 (1987). 44. Taylor WR: A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169 (1988). 45. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970). 46. Barton GJ, Sternberg MJE: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 198, 327-337 (1987). 47. Gotoh O: Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264(4), 823-838 (1996). 48. Heringa J: Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple http://www.ashley.com 13 REVIEW 49. 50. 51. • 52. 53. 54. 55. 56. 57. 58. •• 59. alignment. Computers and Chemistry 23, 341-364 (1999). Krogh A, Brown M, Mian IS, Sjölander K, Haussler D: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235, 15011531 (1994). Kim J, Pramanik S, Chung MJ: Multiple Sequence Alignment using Simulated Annealing. Comp. Applic. Biosci. 10(4), 419426 (1994). Notredame C, Higgins DG: SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 1515-1524 (1996). One of the first attempts to apply genetic algorithms to sequence analysis. Zhang C, Wong AK: A genetic algorithm for multiple molecular sequence alignment. Comput. Appl. Biosci. 13(6), 565-81 (1997). Anabarasu LA: Multiple sequence alignment using parallel genetic algorithms. In The Second Asia-Pacific Conference on Simulated Evolution (SEAL-98). Canberra Australia (1998). Gonzalez RR: Multiple protein sequence comparison by genetic algorithms. In SPIE98 (1999). Chellapilla K, Fogel GB: Multiple sequence alignment using evolutionary programming. In: Congress on Evolutionary Computation. (1999). Cai L, Juedes D, Liakhovitch E: Evolutionary computation techniques for multiple sequence alignment. In: Congress on Evolutionary Computation. (2000). Duret L, Abdeddaim S: Multiple Alignment for Structural Functional or phylogenetic analyses of Homologous Sequences. In: Bioinformatics Sequence structure and databanks. Higgins D, Taylor W (Eds.), Oxford University Press: Oxford UK (2000). Durbin R et al.: Biological Sequence Analysis. Cambridge University Press. Cambridge, UK (1998). One of the most comrehensive textbook on the algorithms dedicated to sequence analysis. Devereux J, Haeberli P, Smithies O: GCG package. Nucleic Acids Res. 12, 387-395 (1984). 60. Carrillo H, Lipman DJ: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073-1082 (1988). 61. Reinert K, Stoye J, Will T: An iterative method for faster sum-of-pair multiple sequence alignment. Bioinformatics 16(9), 808-814 (2000). 62. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092 (1953). 63. Holland JH: Adaptation in natural and artificial systems. Ann Arbour MI: University of Michigan Press (1975). 64. Ishikawa M, Toya T, Hoshida M, Nitta K, Ogiwara A, Kanehisa M: Multiple sequence alignment by parallel simulated annealing. Comp. Applic. Biosci. 9, 267-273 (1993). 65. Goldberg DE: Genetic Algorithms. In: Search Optimization and Machine Learning. Goldberg DE (Eds), AddisonWesley. New York USA (1989). 66. Notredame C, Holm L, Higgins DG: COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5), 407-422 (1998). 67. Notredame C, O'Brien EA, Higgins DG: RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res. 25(22), 45704580 (1997). 68. Eddy SR: Multiple alignment using hidden Markov models. In: Third international conference on intelligent systems for molecular biology (ISMB). Cambridge England: Menlo Park CA: AAAI Press (1995). 69. Berger MP, Munson PJ: A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci. 7, 479-484 (1991). 70. Brocchieri L, Karlin S: Asymetric-iterated multiple alignment of protein sequences. JMB 276, 249-264 (1998). 71. Kececioglu JD: The maximum weight trace problem in multiple sequence alignment. Lecture notes in Computer Science 684, 106-119 (1983). 72. Vingron M, Argos P: Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol. 218, 33-43 (1991). 73. Morgenstern B, Frech BK, Dress A, Werner T: DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14(3), 290-294 (1998). 74. Huang X, Miller W: A time-efficient linearspace local similarity algorithm. Adv. Appl. Math. 12, 337-357 (1991). 75. Bucka-Lassen K, Caprani O, Hein J: Combining many multiple alignments in one improved alignment. Bioinformatics. 15(2), 122-130 (1999). 76. Lenhof HP, Morgenstern B, Reinert K: An exact solution for the segment-to-segment multiple sequence alignment problem. Bioinformatics 15(3), 203-210 (1999). 77. Jennings AJ, Edge CM, Sternberg MJ: An approach to improving multiple alignments of protein sequences using predicted secondary structure. Protein Eng. 14(4), 227-31 (2001). 78. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L: A fully automatic evolutionary classification of protein folds: dali domain dictionary version 3. Nucleic Acids Res. 29(1), 55-7 (2001). 79. Taylor WR, Saelensminde G, Eidhammer I: Multiple protein sequence alignment using double-dynamic programming. Comput. Chem. 24(1), 3-12 (2000). 80. Orengo CA, Taylor WR: A rapid method of protein structure alignment. J. Theor. Biol. 147, 517-551 (1990). 81. Eidhammer I, Jonassen I, Taylor WR: Structure comparison and structure patterns. J. Comput. Biol. 7(5), 685-716 (2000). 82. Notredame C: Mocca: semi-automatic method for domain hunting. Bioinformatics 17(4), 373-374 (2001). 83. Heringa J, Argos P: A method to recognzse distant repeats in protein sequences. Proteins: Structure Function and Genetics 17, 391-411 (1993). 84. Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Computer Applications in Biological Science 12, 95-107 (1996). 14 Pharmacogenomics (2002) 3 (1) UTILISATION DES ALGORITHMES GENETIQUES POUR L’ANALYSE DE SEQUENCES BIOLOGIQUES Cédric Notredame Doctorat en Bio−informatique Février 1998 Université Paul Sabatier France Directeur De Thèse: Prof. François Amalric Acknowledgement I wish to thank the EMBL for their funding through an EMBL grant. This work was carried out in the lab of Des Higgins, first at the European Molecular Biology Laboratory in Heidelberg,Germany and later at the EMBL outstation, the European Bioinformatics Institute, in Hinxton, U.K. Des has been a constant source of support. I wish to thank him and express him all my gratitude for teaching me most of what I know in bioinformatics, and so much more about being a scientist. I also wish to thank Ture Etzold who gave me a place in his team when Des had to leave for Ireland. Thanks to his expertise in the field, Thure has had a lots of influence on my work. A special thanks to the system managers, namely Roy Omond, Rodrigo Lopez and Pettere Jokinen who have always been supportive, allowing me to overload the machines any time I needed it. Without their help, this work could not have been achieved. I also wish to thank Miguel Andrade, Inge Jonassen and Burkhard Rost for stimulating discussions and frendship. Chris Sander and Liisa Holm have always been available to share with me their extensive experience of the field. I wish to express them my gratitude. Michelle Magraine has proved an unvaluable allie for fighting against my english grammar weaknesses. I wish to thank her for that. I also wish to thank Rob Harper for joining me in my everyday fight against post script files. My stay at the EBI has been extremly enjoyable. This work is dedicated to my friends and family for their constant support over all these years. RESUME DE THESE Une grande partie de la recherche fondamentale en biologie moléculaire repose sur l’étude des protéines et des acides nucléiques. Ces molécules extrêmement complexes résultent de la combinaison d’éléments plus simples: les acides aminés et les nucléotides. Vingt acides aminés constituent la grande majorité des protéines, cinq nucléotides constituent la plupart des acides nucléiques. Le termes séquence est utilisé pour désigner l’enchaînement de nucléotides constituant un acide nucléique ou l’enchaînement d’acides aminés constituant une protéine. La plupart des protéines sont codées par les gènes contenus dans l’ADN des chromosomes. Au cours de ces dernières années, de nombreux progrès techniques ont rendu possible le séquençage à grande échelle du génome de plusieurs espèces bactériennes ou eucaryotes. Ces séquences d’ADN sont entreposées dans des banques de données spécialisées (Swiss Prot, Gene Bank, EMBL nucléotides database...) dont la croissance (en taille) est aujourd’hui exponentielle. La bio−informatique est une sous−discipline de la biologie ayant pour objet l’analyse de ces données par des moyens informatiques. Le principe de base d’une telle approche est la notion de relation entre fonction et séquence. Le but est d’extrapoler des données obtenues de façon expérimentales sur certaines séquences à d’autres séquences pour lesquelles aucune donnée expérimentale n’est disponible. Alors qu’il est clair que deux protéines (ou acide nucléiques) ayant la même séquence ont probablement la même fonction, une corrélation devient plus difficile à établir lorsque les séquences ne présentent qu’une homologie partielle. La nécessitée d’utiliser ce type d’information est la principale motivation derrière le développement des méthodes de comparaison aujourd’hui utilisées. Le travail présenté dans cette thèse est essentiellement consacré à cet aspect de la bio−informatique. L’un des moyens les plus utilisés pour la comparaison de séquences est l’alignement. Un alignement permet d’identifier les zones conservées entre deux séquences. Ces zones peuvent correspondre à des motifs structuraux ou fonctionnels dont l’identification permet de faire des prédictions quand à la fonction putative des séquences analysées. De façon plus générale, un alignement permet l’identification de régions sur lesquelles existent des contraintes diverses imposant le maintient de certaines propriétés. D’autre part, un alignement de qualité permet l’évaluation de la distance évolutive séparant deux organismes, ou deux protéines. Cependant, dans les cas complexes, la quantité d’information contenue dans deux séquences n’est pas suffisantes, et il devient nécessaire d’étendre la comparaison à plusieurs séquences. C’est là l’objet des alignements de séquences multiples. Leur problématique est double. Il s’agit tout d’abord d’un problème biologique. Etant donné un groupe de séquences, les propriétés de l’alignement optimal doivent être définies. La règle la plus simple est de tenter de générer autant d’identités que possible dans les colonnes tout en limitant le nombre d’insertions/délétions (gaps). En pratique néanmoins, les règles utilisées sont plus complexes et peuvent prendre en compte la nature des acides aminés alignés (protéines) ou la structure secondaire des séquences (ARN). L’usage est de donner à cette liste de règles une forme mathématique associant un score à chaque alignement. On parle alors de fonction objective. Un nombre important de fonctions de ce types ont été décrites au cours de ces dernières années. Globalement, elles peuvent être divisées en deux groupes: les fonctions basées sur des matrices de substitutions et des penalitées d’insertion/délétion et les fonctions globales telles que les HMM (Hidden Markov Models). Une des propriétés les plus importantes d’une fonction objective est sa signification biologique. De façon idéale, une fonction doit assigner à un alignement optimal un score traduisant l’intérêt biologique de l’information qu’il contient. Le second aspect est purement informatique. Il ne suffit pas d’avoir une fonction objective, il faut aussi être capable d’optimiser le score de cette fonction (i.e. produire l’alignement ayant le meilleur score). Ce problème est loin d’être trivial. L’optimisation de la plupart des fonctions objectives appartient à la classe des problèmes dits NP complets. En conséquence, l’optimisation ne peut être réalisée qu’en utilisant des méthodes dites heuristiques qui ne garantissent pas une solution optimale. Le travail présenté dans cette thèse englobe l’ensemble de ces problématiques. Dans la première partie, une méthode d’optimisation globale par algorithme génétique est proposée. Cette méthode est intégrée dans un logiciel nommé SAGA (Sequence Alignment by Genetic Algorithm). Les algorithmes génétiques sont des stratégies d’optimisation basées sur une analogie avec le phénomène de sélection naturelle. Cette méthode peut en théorie être appliquée à n’importe quel type de fonction objective. Le second aspect du travail a consisté à définir une nouvelle fonction objective (COFFEE: Consistency based Objective Function for alignEment Evaluation) et à optimiser cette fonction en utilisant SAGA de façon à prouver que COFFEE peut induire la création de meilleurs alignements que des méthodes alternatives. La troisième application a été axée sur l’alignement d’ARNs ribosomiques avec définition d’une fonction objective adaptée à la prise en compte des interactions secondaires. Ce programme, adapté de SAGA a été nommé RAGA (RNA Alignment by Genetic Algorithm). L’une des principales limitations de RAGA réside dans la simplicité de la fonction objective utilisée. Afin de remédier à ce problème, un travail d’analyse a été réalisé sur des alignements de référence afin de déterminer les paramètres pouvant aider à la définition d’une fonction objective plus réaliste dans la prise en compte des contraintes à modéliser dans l’alignement. Ce travail constitue la quatrième application présentée dans cette thèse. Dans l’ensemble, ce travail a permis d’établir l’utilité des algorithmes génétiques dans le contexte des problèmes d’alignement de séquences multiples. SAGA est à l’heure actuelle l’algorithme le plus performant pour l’optimisation des fonctions objectives couramment utilisées pour les alignements de séquences multiples. Dans le cas de séquences protéiques, SAGA est le seul algorithme capable de réaliser l’alignement global de plus de dix séquences. Pour ce qui est de l’ARN ribosomique, RAGA est le seul programme capable d’aligner des séquences ayant une longueur supérieure à deux mile paires de bases, tout en prenant en compte les pseudo−noeux. D’autre part, la fonction COFFEE est l’une des rares fonctions capable de permettre la génération d’alignements biologiquement plus exacts que ceux obtenus par ClustalW (ClustalW est une des méthodes d’alignement les plus populaires). RESUME DES ANNEXES Document Numéro 1 SAGA: Sequence Alignement by Genetic Algorithm Dans cet article, une nouvelle approche est proposée pour la résolution du problème des alignements de séquences multiples. Un algorithme génétique a été conçu et intégré dans un logiciel nommé SAGA. La méthode implique l’évolution d’une population d’alignements. Dans ce contexte, évolution signifie que la qualité des alignements est graduellement améliorée au gré d’une succession de cycles (générations) contenant des étapes de modifications aléatoires (opérateurs) ainsi que des étapes de sélection basée sur le score. Le degré d’amélioration est jugé par l’évaluation du score de chaque alignement à l’aide de la fonction objective. SAGA utilise une technique de contrôle automatique pour réguler l’utilisation simultanée de vingt opérateurs destinés à recombiner entre eux des alignements (crossing overs), ou bien à les modifier individuellement (mutation). Afin de tester SAGA, nous avons utilisé comme référence le programme M.S.A. (Multiple Sequence Alignment) capable d’optimiser une des fonction objectives le plus couramment utilisée (somme des paires avec pénalités affines de délétion/insertions). Utilisé dans ce contexte, SAGA fournit de meilleurs résultats que M.S.A. en terme d’optimisation (score de l’alignement obtenu). De plus les alignements produits par SAGA sont biologiquement plus exacts s’il on en juge par leur similarité avec l’alignement des mêmes séquences réalisés par comparaison de structures. Au total, SAGA a été teste sur treize groupes de séquences pour lesquelles un alignement de référence basé sur les structures est disponible dans la banque de données 3D_ali. Document Numéro 2 COFFEE: A New Objective Function for Multiple Sequence Alignmnents Dans ce travail, nous présentons un nouveau mode d’évaluation des alignements de séquences multiples. Cette fonction est nommée COFFEE. COFFEE est une mesure du degré de consistance existant entre un alignement de séquences multiples et un bibliothèque de référence contenant les mêmes séquences alignées deux par deux. Il est montré que le score COFFEE peut être efficacement optimisé par SAGA. La fonction a été utilisé sur onze groupes de séquences pour lesquels un alignement de référence est disponible dans la banque de données 3D_ali. Dans neuf cas sur onze, SAGA, utilisé avec COFFEE, produit des alignements meilleurs que ceux obtenus avec ClustalW (s’il on en juge par leur similarité avec les alignements de références). Nous avons aussi montré que le score assigné par COFFEE peut être utilisé pour évaluer la qualité d’un alignement multiple, de façon locale ou globale. Finalement, la bibliothèque de référence peut être constituée d’alignement en paire obtenus par comparaison de structure (par exemple, des alignements extraits de FSSP). Dans ce cas là, SAGA−COFFEE est capable de produire des alignements structuraux multiple de très haute qualité. En théorie, COFFEE devrait permettre d’appliquer aux alignements multiples n’importe quelle méthode adaptée à l’alignement de paires de séquences. Document Numéro 3 RAGA: RNA Sequence Alignment by Genetic Algorithm. Cet article décrit une nouvelle approche pour aligner deux séquences d’ARN homologues lorsque la structure secondaire de l’une des deux molécules est connue. A cette fin, deux programmes ont été développés, RAGA (RNA séquence alignment by Genetic algorithm) et PRAGA ( Parallel RAGA). Ces deux programmes sont essentiellement basés sur le programme SAGA. La parallélisation est réalisée par la synchronisation d’un nombre défini de copies actives de RAGA. Celles−ci échangent une partie de leur population suivant une topologie définie comme étant un arbre à branches multiple et profondeur variable. Cette méthode permet d’optimiser une fonction objective prenant en compte les informations primaires et secondaires contenues dans les deux séquences. Une des propriétés les plus intéressantes de RAGA réside dans le fait qu’il est possible de prendre en compte aussi bien les tiges boucles classiques que les pseudo−noeux présents dans l’ARN ribosomique. RAGA à été testé à l’aide de neuf alignements de référence extraits à partir d’alignements d’experts. Ces alignements, constitués d’ARNs de la petite sous unité ribosomique, ont servi de référence. Dans chacun des cas, PRAGA est capable de surpasser en exactitude les méthodes alternatives basées sur la programmation dynamique. Ceci est vrai même lorsque la distance phylogénétique séparant les deux séquences à aligner est très importante (comme entre l’humain et saccharomyces cerevisiae). Document Numéro 4 Optimisation of RNA Profile Alignments Ce projet fait pendant au projet numéro 3 et s’intègre dans un contexte plus large visant à la création des outils nécessaires à la maintenance automatique des banques de données d’ARN ribosomiques. De part le monde, plusieurs banques de données de ce type existent. Elles sont essentiellement maintenues de façon manuelle. A long terme, la création de méthodes automatiques va devenir une nécessité absolue. Dans le document numéro 3, nous avons proposé une procédure destinée à l’alignement de deux séquences. N’utiliser que deux séquences revient à ignorer la vaste quantité d’information contenue dans les alignements multiples établis par des groupes d’experts. Le but de ce projet a été de définir certaines des modalités d’utilisation de cette information. Différentes méthodes de pondérations ont été testées et implémentée dans un contexte de programmation dynamique. L’évaluation des méthodes a été réalisée en testant la qualité d’alignement obtenue lorsqu’une séquence est extraite puis réintroduite dans un alignement multiple. Cette stratégie ne prend en compte que les contraintes primaires. Les résultats montrent que par l’utilisation d’un mode de pondération adéquat et d’un système de pénalités d’insertion/délétion adapté, il est possible d’améliorer considérablement la qualité de l’alignement entre un profil et une séquence. 1−INTRODUCTION ....................................................................................................................... 11 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20 4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 CONCLUSION ....................................................................................................................... 50 TABLE OF CONTENTS 1−INTRODUCTION ....................................................................................................................... 11 1.1 BIOINFORMATICS AND BIOLOGY ............................................................................................................ 12 ............................................................................................................ 14 1.2. COMPARING SEQUENCES ............................................................................................................ 15 1.3 OUR APPROACH ............................................................................................................ 16 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 2.1 WHAT IS A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? ............................................................................................................ 18 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT ............................................................................................................ 19 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20 3.1 PROTEIN PAIRWISE ALIGNMENTS ............................................................................................................ 20 3.1.1 Substitution Matrices ................................................................................................ 20 3.1.2 Gap penalties ................................................................................................ 22 ................................................................................................ 3.1.3 Database Searches ................................................................................................ 24 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT ............................................................................................................ 25 ................................................................................................ 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments ................................................................................................ 25 3.2.2 Weights ................................................................................................ 27 3.2.3 Profiles ................................................................................................ 29 3.2.4 hidden Markov models ................................................................................................ 30 3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS ............................................................................................................ 32 4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 4.1 COMPLEXITY OF THE PROBLEM ............................................................................................................ 35 4.2 DETERMINISTIC GREEDY APPROACHES ............................................................................................................ 36 4.2.1 Aligning Two Sequences ................................................................................................ 36 4.2.2 Aligning Two Alignments : Progressive Alignment Methods ................................................................................................ 37 4.3 DETERMINISTIC APPROACHES FOR NON PROGRESSIVE MULTIPLE ALIGNMENTS ............................................................................................................ 39 4.3.1 The Carrillo and Lipman Algorithm ................................................................................................ 39 4.3.2 Other Approximation Techniques ................................................................................................ 39 4.4 STOCHASTIC HEURISTICS ............................................................................................................ 40 4.4.1 What is a stochastic Method ? ................................................................................................ 40 4.4.2 Iterative alignments and Expectation−Maximization Strategies ................................................................................................ 41 4.4.3 Simulated Annealing ................................................................................................ 42 4.5 GENETIC ALGORITHMS ............................................................................................................ 43 4.5.1 What is a Genetic Algorithm? ................................................................................................ 43 4.5.2 Applications of GAs in Sequence Analysis ................................................................................................ 45 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 6.1 SAGA: MAKING MULTIPLE SEQUENCE ALIGNMENT BY GENETIC ALGORITHM ............................................................................................................ 48 6.2 COFFEE: IMPROVING ON EXISTING OBJECTIVE FUNCTIONS ............................................................................................................ 48 6.3 RAGA: THREADING RNA SECONDARY STRUCTURES ............................................................................................................ 49 6.4 OPTIMIZING RIBOSOMAL RNA PROFILE ALIGNMENTS ............................................................................................................ 50 CONCLUSION ....................................................................................................................... 50 1−INTRODUCTION ....................................................................................................................... 11 1.1 BIOINFORMATICS AND BIOLOGY ............................................................................................................ 12 ............................................................................................................ 14 1.2. COMPARING SEQUENCES ............................................................................................................ 15 1.3 OUR APPROACH ............................................................................................................ 16 2−THE SCOPE OF SEQUENCE ALIGNMENTS ....................................................................................................................... 17 2.1 WHAT IS A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? ............................................................................................................ 17 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? ............................................................................................................ 18 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT ............................................................................................................ 19 3 EVALUATING ALIGNMENTS ....................................................................................................................... 20 3.1 PROTEIN PAIRWISE ALIGNMENTS ............................................................................................................ 20 3.1.1 Substitution Matrices ................................................................................................ 20 3.1.2 Gap penalties ................................................................................................ 22 ................................................................................................ 3.1.3 Database Searches ................................................................................................ 24 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT ............................................................................................................ 25 ................................................................................................ 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments ................................................................................................ 25 3.2.2 Weights ................................................................................................ 27 3.2.3 Profiles ................................................................................................ 29 3.2.4 hidden Markov models ................................................................................................ 30 3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS ............................................................................................................ 32 4 MAKING MULTIPLE SEQUENCE ALIGNMENT ....................................................................................................................... 35 4.1 COMPLEXITY OF THE PROBLEM ............................................................................................................ 35 4.2 DETERMINISTIC GREEDY APPROACHES ............................................................................................................ 36 4.2.1 Aligning Two Sequences ................................................................................................ 36 4.2.2 Aligning Two Alignments : Progressive Alignment Methods ................................................................................................ 37 4.3 DETERMINISTIC APPROACHES FOR NON PROGRESSIVE MULTIPLE ALIGNMENTS ............................................................................................................ 39 4.3.1 The Carrillo and Lipman Algorithm ................................................................................................ 39 4.3.2 Other Approximation Techniques ................................................................................................ 39 4.4 STOCHASTIC HEURISTICS ............................................................................................................ 40 4.4.1 What is a stochastic Method ? ................................................................................................ 40 4.4.2 Iterative alignments and Expectation−Maximization Strategies ................................................................................................ 41 4.4.3 Simulated Annealing ................................................................................................ 42 4.5 GENETIC ALGORITHMS ............................................................................................................ 43 4.5.1 What is a Genetic Algorithm? ................................................................................................ 43 4.5.2 Applications of GAs in Sequence Analysis ................................................................................................ 45 5 COMPARISON OF THE METHODS ....................................................................................................................... 46 6 SUMMARY OF THE CONTRIBUTIONS ....................................................................................................................... 48 6.1 SAGA: MAKING MULTIPLE SEQUENCE ALIGNMENT BY GENETIC ALGORITHM ............................................................................................................ 48 6.2 COFFEE: IMPROVING ON EXISTING OBJECTIVE FUNCTIONS ............................................................................................................ 48 6.3 RAGA: THREADING RNA SECONDARY STRUCTURES ............................................................................................................ 49 6.4 OPTIMIZING RIBOSOMAL RNA PROFILE ALIGNMENTS ............................................................................................................ 50 CONCLUSION ....................................................................................................................... 50 APPENDIX: RESEARCH PAPERS Annex 1 Annex 2 "SAGA: Sequence Alignmnent by Genetic Algorithm" Notredame and Higgins, 1996 "COFFEE: A new Objective Function for Multiple Sequence Alignments" Notredame, Holm and Higgins, 1998 "RAGA: RNA Alignmnent by Genetic Algorithm" Notredame, O’Brien and Higgins, 1997 "Optimisation of ribosomal RNA profile Alignments" O’Brien, Notredame and Higgins, 1998 Annex 3 Annex 4 1−INTRODUCTION 1.1 BIOINFORMATICS AND BIOLOGY Life as we know it is a complex arrangement of biological structures designed to interact with each other. As complex as they may appear, even the most elaborated living structures can be described as arrangements of smaller less complex building blocks (such as cells), which are themselves the result of the combination of even smaller basic blocks (such as metabolites, proteins, nucleic acids). Identifying these structures and characterizing their functions is a major aim of biology. Questions can be addressed at any level of organization one may wish to study (from populations to atoms in fact). In such a top to bottom approach, molecular biology is almost at the bottom. It deals with the biological structures at the molecular level, trying to understand how these are created and interact with one another to perform the basic functions of life. The search for ordered systems is an important part of the biological methodology. Ordered systems usually make it possible to establish general rules allowing a global understanding of otherwise disparate collections of facts. In this respect, the discovery of DNA structure and the understanding of RNA and protein synthesis have been two of the most important milestones of modern molecular biology. They have allowed a deep and precise understanding of some of the most central cellular mechanisms. We now know that proteins and RNA molecules are involved at virtually all the steps of biological processes. We also know that these key components of cellular life are coded in DNA sequences, in an almost universal manner. These DNA sequences are contained in the genomes of living organisms. At the lowest level, a genome can be described as a long string of nucleotides. It could be compared to a very long text made of four letters. As in a text, the letters are not distributed at random but organized in words. In the context of a genome, a word will be any sequence having a function. One of the difficulties when trying to identify these ’words’ stems from the fact that nature uses spaces and punctuation in a very personal way. This mean that not only do we ignore beforehand the function of the ’words’, we also do not know where they start and finish. Add to this the fact that there are many different classes of ’words’. Some allow the binding of other molecules on the DNA, others are translated into RNAs, that can in turn be translated into proteins. A protein or an RNA molecule may contain motifs that will have functions (binding, catalytic site....). The genome also contains some very specific combinations of words such as genes (enhancer, promoter, introns, exons...). Bioinformatics and molecular biology are two complementary techniques with similar aims: identification of biological structures and sub−structures at a molecular level and characterization of their function. ’Function’ is a very general concept. If we look at it from an experimental point of view such as genetics, a function can be defined with respect to a gene and will often be deduced from what happens when this gene is inactivated/modified by a mutation. Often, this is not enough to gain a real deep understanding of the mechanisms involved. To do so, one will have to know whether this function is performed by a protein, an RNA, or a regulatory sequence. If it is a protein then the next question is ’how does the protein perform its function?’. If it is an enzyme we will want to know where is the catalytic site, does it look like any other known site, what are the residues involved in the site and how they perform their function? The protein (enzyme or not) may also interact with other proteins, nucleic acids or metabolites. Here again we will want to know what are the portions of the protein involved in these operations and what are the potential partners. In most of the cases, this information will be much easier to understand/predict when a 3D model is available for the protein. In a broad sense, the function of a protein (or a nucleic acid or a DNA/RNA regulatory element) is defined by the sum of all these elements. Until recently, the only way available for gathering together these pieces of information was to use wet lab techniques. These involve genetic analysis, cloning, sequencing, interaction experiments..... Although the results obtained that way can usually be regarded as strong biological evidence, they suffer from a major drawback. The cost of wet lab techniques is extremely high, in terms of time and money. This means that there is a limit on the number of functions that can be thoroughly investigated through such techniques. The problem has become especially severe now that the improvement of sequencing techniques gives us access to far more sequences than it will ever be possible to analyse in the wet lab. It is this situation that has promoted the massive development of bioinformatics techniques over the last few years. Bioinformatics could be regarded as an approach diametrically opposed to the traditional experimental ones. Instead of starting from a phenotype, one will start with a sequence and try to gather as much information as possible by comparing this sequence to others for which experimental evidence is available. But the difference between the two approaches is much less acute than it seems. Bioinformatics relies on the same basic assumptions as classic biology. It is a method of inquiry based on a series of comparisons that lead to classifications/predictions. The main paradigm of bioinformatics is that sequence conservation is correlated to function conservation. Under this framework, the aim is to extrapolate, as much as possible the information acquired experimentally. The process follows a traditional feed back scheme where models are built and validated or invalidated by experiments made in wet lab or in silico. Darwinian laws of evolution, and the notion of parsimony often underlie the bioinformatics approach. The assumption is that biological systems have evolved from the same origin, constantly reusing some basic building blocks (such as metabolic pathways) and adapting them to respond to their environmental constraints. If each time a new constraint appeared, a new biological system was created from scratch, the bioinformatics approach would probably be bound to fail. Fortunately, in most of the cases, this is not what happens. Through the cycles of mutation/selection that constitute evolution, new functions have been created by reusing pieces of already existing machinery, and existing functions have evolved to become more adapted to the environment in which they are needed. If we consider this problem in terms of sequences, this means that two sequences responsible for similar functions may be different, depending on how long they have been diverging (i.e. how long ago the original sequence was duplicated, or how long ago the two organisms containing these sequences started diverging). Nevertheless, if the distance separating them is small enough, an evolutionary scenario can be reconstructed that will show how related these sequences are. Depending on what is known for one of these sequences (or for other sequences of the same category), it will then become possible to make assumptions about the function. On the other hand, if the sequences are evolutionarily too far apart, accurately analysing their relationship may prove difficult by simply comparing the sequences. The signal they contain may have to be enhanced using other techniques such as structure prediction. Sequences are only conceptual objects. As such, they have no function in a cell. In fact, even the distinction between RNA, proteins and DNA is artificial. As far as the cell is concerned, all these elements only exist as complex 3D arrangement of atoms. It is because of its precise 3D structure that a molecule has the mechanical and chemical properties it needs to perform its function (catalytic activity, interactions...). The relation between structure and function is probably one of the oldest paradigms of molecular biology. We also accept that broadly speaking, structures are induced by sequence although we know for a fact that very different amino acid sequences can code for similar 3D folds. This last point helps understand why proteins with different sequences can have similar function and structure, since natural selection is applied on the active 3D structures rather than on the sequences (i.e. evolution gives more freedom to the sequence than to the structure). As a consequence, relationships between proteins (or RNAs) are usually easier to analyse when the structures are known. Unfortunately, structures are difficult to determine experimentally and prediction from sequence alone (ab initio folding, threading) is still one of the main challenges of computational biology. It is true however that useful tools exist that can help supplement weak sequence identity. Developing new techniques for automatically analysing sequences is one of the main purposes of research in bioinformatics. It is a point of crucial importance. Today, all the major databases of nucleotide or protein sequences are growing in size at an exponential rate (doubling every year or so). It means that the proportion of sequences for which experimental data are available is decreasing. For this reason, targeting the points at which experiments are needed has become more important than ever. Such a goal will only be achieved by gaining some more understanding on the ways in which information can be extrapolated from one sequence (or a set of sequences) to another. This is the only way available for making any use of the DNA sequencing results (at least in a reasonable amount of time). It is for this reason that sequence comparison tools are at the heart of the bioinformatics approach. Figure 1. Growth of the EMBL Nucleotide database over the 1985−1997 period. The last release of the EMBL nucleotide sequence database ( Rel. 52, October 1997) contained 1,181,167,498 nucleotides. The last release of the Swiss−Prot Database ( Rel. 32 October 1996) contained 21 210 389 amino acids in 59021 entries. For comparison, the last PDB release (proteins with known structures) of December 1997 contained a total of 6731 entries. 1.2. COMPARING SEQUENCES In most cases the problem facing the user takes the following form: a new sequence is available and it is desirable to search the database and find out whether one or more close relatives of this sequence have already been reported. If so, one may wish to extrapolate some of the experimental data gathered that way to the new sequence. In such a case, the solution is to compare the sequences of interest to all the sequences contained in the database, keeping track of the most similar. Two very popular tools are used to perform basic database similarity searches: FASTA(1) and BLAST(2). Sequence comparison can also be much more complex. For instance, by combining experimental data contained in the databases and sequence analysis, one may want to know if a specific motif is sufficient for a protein to bind zinc. If several types of alternative motifs emerge when doing such analysis, one may want to build a classification. Further experimental data may also be available that allows the establishment of functional differences between the zinc−binding motifs (some may be associated with RNA−binding proteins and others with DNA−binding proteins). This for instance, is one of the ideas developed in the Prosite database(3). Once such results have been established, new sequences can be scanned for the known motifs they contain. As simple and trivial as they may seem, such strategies present various very complex difficulties that need to be overcome. First of all, one has to define what ’comparing’ means. This is typically a biological problem. The features one is interested in when looking at two sequences or more will obviously depend on the aim and the scope of the comparison. Do we want to compare two proteins for having the same sequence, the same function, for having related functions, for being expressed in the same circumstances, for having similar folds? Are these questions equivalent? When it comes to making detailed analysis, sequence alignments appear as one of the most powerful solutions. In the simplest case, they only involve two sequences (pairwise alignments) but they can be extended to a larger number (multiple sequence alignments). Having decided that we want to use sequence alignments, the problem of defining what is a ’good’ sequence alignment remains. It is a difficult question that requires a deep understanding of the biological information one wishes to extract from such alignments. In most of the cases, the criterion that will allow evaluation of the quality of an alignment will take the form of a mathematical function(objective function) associating a value to an alignment. But even so, the problem is not solved. Having a criterion for alignment quality is not enough. One also needs to be able to build the best scoring alignment according to this quality criterion. In most of the cases, this is far from easy. Many of the problems in bioinformatics and more specifically in sequence alignment are said to be NP complete. This means that the number of potential solutions rises exponentially with the number of sequences and their length (i.e. the solution cannot be found in polynomial time and space). The need to overcome such severe limitations requires the development of powerful algorithms. 1.3 OUR APPROACH In the work presented here, the problem of sequence alignment was approached through the two aspects mentioned above: −defining new objective functions for sequence alignment. −developing new ways to optimize these functions. One of the main concerns of the approach was the fact that there is no use in defining new sequence comparison schemes if no tool is available to use them and allow judgment to be made on their potential relevance. To test an objective function, one must be able to optimize it and compare the quality of the alignments it provides with other methods. The new optimization scheme proposed here is a genetic algorithm named SAGA (Annex 1) for Sequence Alignment by Genetic Algorithm. This algorithm was used to evaluate the biological relevance of COFFEE (Consistency Based Objective Function for alignmEnt Evaluation) ,a new objective function designed for protein multiple sequence alignment(Annex 2). SAGA was also adapted to RNA alignments (RAGA, RNA Alignment by Genetic Algorithm) using an objective function that takes into account secondary structure interactions in RNA (Annex 3). In order to improve RAGA’s objective function, a new function was designed for aligning an RNA sequence to a large multiple RNA sequence alignment (Annex 4). This function was only tested using a traditional optimization method (Dynamic Programming). The following sections will deal with the three main concepts associated with sequence alignment: what it is useful for, how to define a sequence alignment and finally how to build a sequence alignment. The last section will put the four contributions in their relative context. 2−THE SCOPE OF SEQUENCE ALIGNMENTS 2.1 WHAT IS A SEQUENCE ALIGNMENT ? A sequence alignment is the representation of two sequences in a way that reflects their relationship. If the alignment is correct, two residues aligned with one another are homologous. The definition of homology depends on the criterion used for the alignment. For instance if the aim is to identify the relationship between two structures, two residues will be aligned because they are equivalent in the 3D structures. If the alignment is designed to reflect phylogenetic relationships, two residues will be aligned when they originate from the same residue in the common ancestor. The definition of a pairwise alignment can be extended to multiple sequence alignments. In this case, several sequences are aligned together and each column contains homologous residues. However, homologous residues do not necessarily exist in each sequence for each position of the alignment. If a given sequence lacks one residue, a gap will be inserted in its place at the corresponding position. Gaps usually take the form of strings of nulls. In an evolutionary context, a null sign means that a residue was inserted in one of the sequences or deleted in the other while the sequences were diverging from their common ancestor. There are two types of alignments: global and local. In a local alignment, the only portions that are aligned are those which are clearly homologous. The rest of the sequence is ignored. In a global alignment, whole sequences are aligned, regardless of the local level of similarity. The scope of global and local alignment is usually different. Local alignments are more appropriate when the sequences analysed are remotely related and may share only a few domains. Global alignments are mostly designed to analyse sequences that are known to be homologous to one another. In this thesis, I will mostly consider global sequence alignments. It is also important to realize that, given a set of sequences, there are lots of alternative alignments. For instance, given two sequences of 1000 residues each. there are about 10764 different possible alignments. This rules out any naive enumeration strategy for identifying the correct one! Instead, we will see that several strategies have been developed that allow more or less efficient computation in polynomial time. 2.2 WHAT IS THE USE OF A SEQUENCE ALIGNMENT ? Quantitatively, the most widely used application of sequence alignment is in database searching. When doing so, the aim is to find for a given query sequence, all the related sequences contained in a database. The principle is very straightforward. The query sequence is aligned in turn to all the members of the database and results are ranked according to some similarity criterion. FASTA(1) and BLAST (2)are the most popular tools of this type. They rely on local alignments rather than global. However important the results obtained in this way, there is a clear limit on the quality of the alignments that can be deduced from these searches. Firstly, the sequence alignment algorithms implemented in these programs are only crude approximations of the standard sequence alignment algorithm. This is necessary in order to search very large databases in a reasonable amount of time. Secondly, in most of the cases, the searches are based on pairwise alignments. This means that they only contain a limited amount of information. Although database searching is at the heart of many approaches, pushing the analysis further may require important refinements. Such refinements can involve structure prediction, identification of new motifs or domains, generalization of the family properties (i.e. combining the information contained in the known sequences in order to identify distant members), phylogenetic analysis. For all these applications, pairwise alignments are of limited use. A way to simultaneously combine the information contained in several sequences is needed. Such a need is the main motivation for building multiple sequence alignments. Multiple alignments are very important for phylogenetic analyses because they provide a way to compute evolutionary distances and phylogenetic trees. Trees are computed from sets of pairwise distances using some clustering algorithms such as the neighbor joining method(4). When computing a tree, it is very important to have accurate pairwise distances hence the use of a multiple alignment in which the pairwise alignment of two sequences depends on the information contained in all the sequences of the set. Another fundamental application of multiple sequence alignments is the identification of motifs or domains. In a multiple alignment, these elements often appear as portions on which constraints exist that limit divergence. If some of the sequences are experimentally characterized, these motifs can be used for function prediction. This is, for instance, one of the aims of the Prosite(3) or the ProDom databases(5). The information contained in a multiple sequence alignment can also be generalized in order to produce a profile(6)or a hidden Markov model (7)that can be used for identifying new family members. The other important use of multiple alignments is structure prediction. In a given protein, residues do not evolve in the same fashion, depending on their role in the structure (buried/exposed, helix/beta strand/loop...). It is very hard to extract such information from a sequence alone while it can be accessed through analysis of multiple alignments by looking at the distribution of the substitutions. Using multiple sequence alignments instead of sequences alone has had a dramatic effect on this area of sequence analysis, boosting the accuracy of protein secondary structure predictions from 55%(8) (9)to 75%(10) accuracy. One can also go further and try to identify correlated mutations in multiple sequence alignments. This has been done on many occasions in proteins with limited success(11−13). On the contrary, in RNA analysis, the identification of correlated mutations has been of great help, allowing accurate prediction for secondary structure analysis and even tertiary structures(14− 16). Finally, a less challenging but very important application of multiple sequence alignments is the localization of highly conserved area for the design of efficient PCR primers, in order to clone new members of a family. All these examples reflect the importance of multiple sequence alignments in the domain of sequence analysis. We will show here that making good multiple sequence alignments is a multi−step task. 2.3 WHAT IS A ’GOOD’ ALIGNMENT ? A scoring function associates a score to an alignment. Ideally, the better the score, the more biologically accurate the alignment. An alignment with the best possible score is said to be optimal, whether it is biologically relevant or not. An optimal alignment always exists. Being able to distinguish between biologically relevant and non relevant alignments is an important issue, especially when analyzing databases. Powerful statistical tools have been developed for this purpose, allowing the discrimination of hits for their potential biological relevance(2). This problem remains when aligning sequences known to be related. For instance, two homologous domains may surround a loop that is different in the two structures. In this case, any alignment of the residues contained by these loops will be meaningless even if a mathematical optimum exists. Another important problem, common to many areas of computational biology has to do with the choice of parameters. Most of the objective functions come with complex sets of parameters. In many cases, one has to rely on empirical values, known to lack robustness (i.e. small changes of the parameter values may induce very different alignments) which may lead to inaccurate alignments. This explains the fact that a large amount of the work dedicated to objective function definition focused on parameter elimination. Ironically, so much work has been done in this field that the choice of a scoring scheme can almost be regarded as one more parameter requiring optimization. Among the countless number of existing methods, we will only describe some of the most important schemes, focusing on those related to the work carried out for this thesis. There are two types of alignments: sequence and structure alignments. Sequence alignments do not require any non local interactions to be taken into account and are therefore less complex (algorithmically speaking) than structural alignments. For the problem of sequence alignments we will mostly talk about protein sequences while the problem of structure alignment will be addressed through the example of RNA secondary structures. The problem of protein structure alignment will not be analysed in depth. Its complexity and the amount of literature available on this subject put it beyond the scope of this dissertation which is mostly oriented toward sequence analysis rather than structure. 2.4 HOW TO BUILD A ’GOOD’ SEQUENCE ALIGNMENT In many cases, building a sequence alignment takes the form of a compromise between biological relevance, mathematical optimality and efficiency. Given an objective function, it may be very hard to produce the mathematically optimal alignment. We already mentioned earlier that done in a naive way through enumeration, sequence alignment is beyond the scope of any computer. For this reason, trade−offs need to be made on both side. Objective functions need to be defined in such a way that they fit already existing optimization techniques, and optimization techniques need to be improved in order to accommodate the complexity of the problem. For instance, in its more general form, the problem of aligning two sequence is NP complete(17). However, if formulated under certain constraints, it can be solved with a technique known as dynamic programming(18) but becomes NP complete again with the number of sequences (i.e. when trying to align more than two sequences simultaneously)(19). Structure alignments (i.e. sequence alignments taking into account non local interactions) are also NP complete, even for two sequences, and can only be addressed in a simplified form(i.e. using an objective function that does not reflect all the known constraints)(20). Because of this NP completeness most of the algorithms developed in the context of sequence alignments are heuristics(21−24). It means that they do not guarantee a mathematically optimal solution, but rather a good approximation. In many cases, this trade off is reasonable and allows the computation of multiple sequence alignments in an efficient manner. In this thesis, I will describe some of the optimization methods currently used, with a special emphasis on the genetic algorithms. 3 EVALUATING ALIGNMENTS 3.1 PROTEIN PAIRWISE ALIGNMENTS 3.1.1 Substitution Matrices The twenty amino acids commonly found in proteins have very specific physico− chemical properties such as size, charge and hydrophobicity. The role of a residue in a protein mostly depends on these properties. For this reason, substitutions do not occur at random but in a way that reflects physico−chemical constraints in the 3D structure. It is therefore a very intuitive idea to try to associate with each possible substitution a cost depending on its probability. This information can be stored in what is known as a substitution matrix, a 20 by 20 table giving the relative cost or the probability of each possible amino acid substitution. Although many types of matrices have been proposed, the more successful are those derived empirically (25, 26). The principle often involves statistical analysis of a large set of alignments. Interestingly, these matrices tend to be in general agreement with what would be expected, knowing the physico−chemical properties of the residues (e.g. substitutions conserving charge, size or hydrophobicity have lower costs). We will review here the most popular of these matrices and their relative strengths/weaknesses. The simplest possible substitution matrices are those only rewarding identities in an alignment. Considering their simplicity, they do remarkably well in a variety of cases, probably owing to the fact that they put a very drastic threshold on background noise, only allowing the identification of very strong signals(27). On the other hand, these matrices disregard a large part of the information and this proves a big disadvantage when doing pairwise alignments between sequences with a low level of identity but a clear homology. The need to show that two sequences with a low level of identity can still be significantly similar has been one of the main motivations behind the development of more complex substitution matrices that take into account similarity as well as identity. The Dayhoff matrices(25), also known as PAM matrices, are among the most widely used. The principle on which they are built is quite straightforward. Alignments of closely related proteins are made (more than 85% identity). When such a high level of identity is shared by the sequences, alignments are usually straightforward and accurate. In these alignments, the frequency of each possible substitution is measured. The table of frequencies obtained in this way is turned into a probability model (log− odds matrix). This model can be used to define weight matrices, appropriate for comparing sequences of any degree of divergence The distance is measured in PAM, Point Accepted Mutations per 100 residues and one can have matrices from 1 PAM up to 500 PAM. C S T P A G 11.5 0.1 −0.5 −3.1 0.5 −2.0 2.2 1.5 2.5 0.4 0.1 7.6 1.1 0.6 0.3 0.4 −1.1 −1.6 2.4 0.5 6.6 N −1.8 0.9 0.5 −0.9 −0.3 0.4 3.8 D −3.2 0.5 0.0 −0.7 −0.3 0.1 2.2 4.7 E −3.0 0.2 −0.1 −0.5 0.0 −0.8 0.9 2.7 3.6 Q −2.4 0.2 0.0 −0.2 −0.2 −1.0 0.7 0.9 1.7 2.7 H −1.3 −0.2 −0.3 −1.1 −0.8 −1.4 1.2 0.4 0.4 1.2 6.0 R −2.2 −0.2 −0.2 −0.9 −0.6 −1.0 0.3 −0.3 0.4 1.5 0.6 4.7 K −2.8 0.1 0.1 −0.6 −0.4 −1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 M −0.9 −1.4 −0.6 −2.4 −0.7 −3.5 −2.2 −3.0 −2.0 −1.0 −1.3 −1.7 −1.4 4.3 I −1.1 −1.8 −0.6 −2.6 −0.8 −4.5 −2.8 −3.8 −2.7 −1.9 −2.2 −2.4 −2.1 2.5 4.0 L −1.5 −2.1 −1.3 −2.3 −1.2 −4.4 −3.0 −4.0 −2.8 −1.6 −1.9 −2.2 −2.1 2.8 2.8 4.0 V −0.0 −1.0 0.0 −1.8 0.1 −3.3 −2.2 −2.9 −1.9 −1.5 −2.0 −2.0 −1.7 1.6 3.1 1.8 3.4 F −0.8 −2.8 −2.2 −3.8 −2.3 −5.2 −3.1 −4.5 −3.9 −2.6 −0.1 −3.2 −3.3 1.6 1.0 2.0 0.1 7.0 Y −0.5 −1.9 −1.9 −3.1 −2.2 −4.0 −1.4 −2.8 −2.7 −1.7 2.2 −1.8 −2.1 −0.2 −0.7 0.0 −1.1 5.1 7.8 W −1.0 −3.3 −3.5 −5.0 −3.6 −4.0 −3.6 −5.2 −4.3 −2.7 −0.8 −1.6 −3.5 −1.0 −1.8 −0.7 −2.6 −3.6 4.1 14.2 ...C S T P A G N D E Q H R K M I L V F Y W Figure 2. A log odds matrix computed from mutation data. This is a PAM 250 matrix, extrapolated from the original PAM 15. Each entry indicates the cost for aligning two residues. The worst substitution costs are usually associated with the tryptophan (W). Originally, the Dayhoff matrices were established using 71 sets of aligned protein sequence pairs with 1572 point mutations (amino acid substitutions). The main limit of this approach is the fact that the content of information in alignments having 85% sequence identity is low. Therefore, it appears risky to extrapolate such limited information to large evolutionary distances such as PAM 250. Furthermore, the extrapolation of the PAM model to any PAM distance is based on the assumption that mutations are independent events. This hypothesis was challenged by several alternative methods. The most popular alternative to PAM based scoring functions is the BLOSUM scheme (26). It is based on a library of blocks, extracted from sequences of related proteins. A block is a local multiple alignment that does not contain any gaps. About 2000 blocks were used for establishing the matrices. Given a set of substitution frequencies, BLOSUM and PAM matrices are computed in a similar fashion. The main difference is that in BLOSUM the frequencies are measured in a way that takes into account sequence identity. This way, using the collection of blocks, several sets of matrices can be generated without having to do any extrapolation. They range from 80 to 45 % average identity. BLOSUM matrices have been shown on various occasions to outperform PAM or other matrices(26−28). The main reproach that can be made to these two types of matrices, is that they attempt to be general while only relying on alignments with a low information content (sequences more than 85% identical), or domains that can be aligned without gaps (blocks). The question of the existence of bias in these matrices has often been discussed. For instance, let us consider alpha helices and beta strands. The type of substitutions observed on these two types of structural elements are known to be slightly different (this is the basis of some efficient secondary structure prediction algorithms(29)). As a consequence, if a matrix is built using a data set that contains more helices than beta sheets, it will be biased and will perform poorly when aligning portions of sequences coding for the beta−sheets. On the other hand, if the data set is equilibrated for the two types of structures, the matrix will simply be an average, and will not be as good as it could be in either of the two cases. In an attempt to compensate for this type of potential problem, several alternative matrices were built for helices(30), beta strands(30) or transmembrane domains(31). The problem is that the way in which such matrices should be used is far from being obvious. Solving this problem often amounts to solving structure prediction, unless a structure is available for some of the homologous sequences in which one is interested. Overall, about 40 different substitution matrices have been proposed, using no less than 20 different methods with different training sets in most cases. Recently, two generic studies have been made in an attempt to understand the fundamental differences between these schemes(27, 32). The main motivation in the work of Vogt et al. was to assess the behavior of these matrices for the accuracy of the alignments they induce using dynamic programming (See Section 4.2.1). Correctness was judged using the structural alignments contained in 3D_ali(33). Their work shows that there is very little difference between the best matrices (Gonnet(34), BLOSUM(26) and Benner(35)). They also concluded that matrices which are able to identify remote homologues in database searches were the ones leading to the more correct alignments. Finally, from an algorithmic point of view, (see 4.2.1) they concluded that these matrices are more suited for global alignments than local alignments. The second study(32) focused on understanding the way these matrices reflect amino acids properties. The authors found that PAM units are significantly correlated to volume and hydrophobicity while other matrices are much more biased toward identity. Interestingly their results indicate that despite the different methods and different data sets used for their construction, most of the substitution matrices based on sequence analysis are highly correlated with the Dayhoff’s. When trying to group these matrices according to their level of correlation, the global matrices fall into the same cluster (i.e. are very similar), while structure−specific matrices fall into separate clusters. This is further evidence supporting the idea that matrices should be applied in a way that takes into account structural information. We will see later that the Dirichlet mixtures(36) provide an interesting alternative to this problem (see section 3.2.4). The matrix comparison problem was also addressed in a different context, and with a smaller set of matrices, by Henikoff et al. who compared matrices for their ability to discriminate sequences in a database search using FASTA or BLAST (i.e. local alignments). The results obtained that way confirm that matrices based on the alignment of distantly related sequences (such as BLOSUM) or structures (37) perform much better than PAM matrices. Finally, a point on which all these studies agree is the necessity of using appropriate gap penalties when scoring an alignment with a matrix. Most of the results obtained using substitution matrices can be dramatically affected, depending on the set of penalties used to score gaps. In the next section, we review some of the concepts underlying the definition and the scoring of gaps when aligning two sequences. 3.1.2 Gap penalties Substitutions are not the only events affecting sequences while they diverge. Insertions and deletions also occur. This means that two sequences may contain unrelated portions that should not be aligned. Such an event is represented by a gap in the sequence that did not receive the insertion, or where the deletion occurred. Deletion Terminal gap XXXXXXX−−−−−−−−−−−−−−−−−XXXXXXXXXX−−−−−−−− XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX or Insertion It is obvious that a cost should be given to these events when scoring an alignment. Unfortunately the choice of a scoring scheme often implies some evolutionary model, and we still lack a good understanding of the underlying biology of insertions/deletions. For instance, unless a very reliable phylogenetic scenario is available, it is often very hard to distinguish between insertion and deletion(38, 39), hence the word indel that describes a position where an insertion OR a deletion occurred. Some useful information can be gathered about indels by looking at structural alignments(39). As one would expect, indels do not happen at random, but tend to concentrate on the portions of the structure with less steric constraints such as loops. It is in theory possible to extrapolate this information to sequences with unknown tertiary structure. For instance, Chou and Fassman(40) proposed using secondary structure propensity in order to derive some local gap penalties. Another method proposed by Pascarella and Argos (39) involves measuring, on reference structural alignments, the probability of having a gap after a given residue. This scoring scheme has been implemented in ClustalW(21) where a gap can be given a cost depending on the residue after which it occurs. Similarly, some heuristics have been implemented in ClustalW that attempt to locate areas more prone to gap insertion such as stretches of hydrophilic residues, usually exposed in the loops. However, these methods are both empirical and very general, they do not take into account the specificity of the sequences one is interested in aligning ( for instance due to the structural constraints, some positions may be more conserved than others...). We will later discuss the way information derived from a multiple sequence alignment can be used in order to established more reliable local gap propensity (cf. the Profile section, 3.2.3). The question "where do gaps occur?" is only one part of the problem. When scoring a gap, one also has to ask: "Is this gap long enough?". There is clear evidence that gap should be penalized according to their sizes. Analyses made on mammals(41) suggest that a logarithmic scheme could be quite appropriate, with a gap opening penalty and an extension penalty that is a function of the length of the gap (i.e. a penalty per residue decreasing with the length of the gap). These results confirm those obtained by Benner et al. (38) and suggest that linear gap penalties are less than realistic (penalty cost increasing linearly with the gap length). Results also suggest that insertion and deletion should be distinguished from one another(38, 41). However, even if there is a wide agreement on the fact that non−linear schemes would probably be biologically more relevant(42), alternative solutions suffer from a major drawback: their implementation poses significant algorithmic problems when it comes to optimizing alignments(43, 44). For this reason, in practice, gap penalties are usually optimized under a simplified form known as "affine gap penalty". It can be formalised as follow: cost = Gap Opening Penalty + Gap Extension Penalty * Length (eq. 1) This gives penalties as a linear function of gap length and an efficient algorithm for optimizing this was introduced by Gotoh(45). There is no real justification for using this type of model apart from the fact that it performs reasonably well, especially when the gaps are small (less than 20 residues). Since the size of indels is known to be a function of the evolutionary distance(38), this means that linear gaps will be acceptable when aligning closely related sequences. However, since affine gap penalties are not empirically based, they raise the problem of defining the values of the two parameters: the gap opening penalty (GOP) and the gap extension penalty (GEP). There is no guaranteed way to choose these values so that they fit the sequences one is interested in. A popular practice is to give to the opening penalty a value equal to the average of the values contained in the substitution matrix used for comparison (excluding the main diagonal). The extension penalty is then set to a tenth of the opening value. It is also common practice not to penalize terminal gaps, at least for opening. When making an alignment there is a competition between gap insertion and substitution. To some extent, gap penalties can be regarded as thresholds used to decide whether a stretch of residues has a homologue or not in the other sequences. In this context, it makes sense to modulate the penalties with some local information (secondary structure propensity, profile information....). But even so, a major problem remains: when the sequences aligned are only remotely related, the gap penalties lack robustness. A study made by Vingron and Waterman showed that slight variations of values for the GOP and the GEP can induce very different optimal alignments(46). Under such conditions, it may be hard to decide which alignment is biologically the most relevant. An attempt to increase the robustness of the penalty parameters has been proposed by Taylor: "Score Run−Length Enhancement"(47). It originates from the observation that in biologically relevant alignments, gaps are usually clustered in a few parts of the sequences and separated by long uninterrupted blocks. The technique proposed by Taylor involves enhancing the score of long ungapped portions in order to avoid them being interrupted by gaps. This work shows that under this scoring scheme, a correct guess for the penalty values becomes much less critical than previously reported. There is little doubt that the correct treatment of gaps and a deep understanding of their biology are critical parameters when making accurate sequence alignments. However, as shown here, the problem is mostly algorithmic. It is possible to define gap penalties that describe sequence relations in a realistic way, it is simply not practical to optimize them. The problem of practicality becomes even more acute when it comes to scanning databases containing hundreds of thousands of sequences. 3.1.3 Database Searches An exhaustive treatment of database search methods is beyond the scope of this thesis. The most commonly used methods will be briefly described here as respect to the fact that they involve specific scoring schemes designed for evaluating the statistical significance of an alignment. The principle of a database search is quite straightforward. A query sequence is aligned in turn with all the other sequences of the database. Depending on their score, alignments are kept as relevant or discarded. BLAST(2) is probably one of the most popular methods for database searches. Given two proteins, the method involves finding the High Scoring Pairs of residues (i.e. short stretches of aligned residues, with no gap, which have high scores). The scores are evaluated using a substitution matrix (PAM120 for instance). These segments are found by looking for words of a specified size(48), and by extending these words. Since the method does not allow gaps, it will in many cases be restricted to more or less small segments. In such a context, the score alone will not be informative enough to decide whether or not a high scoring pair is significant. These scores will need to be normalized in order to become comparable from a statistical point of view. This normalization takes into account the size of the database and the size of the sequences and gives the probability that the match arose by chance. This score is called the E value and is used to rank the hits. The other popular tool for database searches is FASTA(49). As in BLAST, FASTA starts by looking for high scoring segments using the Wilbur and Lipman method(48). The segments the method is interested in are those having a high proportion of identities. They are scored by using the main diagonal of a substitution matrix such as PAM 250 (i.e. only considering identities, but giving them a score that depends on the amino acids). The ten best diagonals found that way are then re− scored using a full substitution matrix. In a second step, non collinear segments are joined by dynamic programming, using a segment joining penalty (analogous to a gap penalty). The resulting scores are then measured and used to rank the matches. In order to prevent this chaining step from decreasing selectivity, it is only applied when the best scoring segment is above some empirical threshold. The mean and the standard deviation of this distribution are then evaluated and used to decide on a final threshold that is used to separate spurious hits from real ones. Of course, both these methods lead to false negatives and false positives, partly because they do not use the most accurate method for local alignments (50) that have been shown to significantly outperform FASTA(51), and also because some background noise is difficult to avoid especially when dealing with very large databases. The reason why the statistics behind FASTA are less evolved than for BLAST has to do with the fact that assessing the statistical significance of a gapped alignment is much harder than for an ungapped segment as in BLAST. This may change soon. Vingron and Waterman have recently proposed a scheme that allows the estimation of probabilities for gapped alignments(52). Furthermore, very recently, a new version of BLAST, gBLAST(53), has been published, that incorporates some of these results and allows statistical ranking of gapped alignments when scanning a database. Other statistical scoring systems include the one described by Bucher(54). 3.2 EVALUATING PROTEIN MULTIPLE SEQUENCE ALIGNMENT 3.2.1 Using Substitution Matrices and Gap Penalties: SP Alignments Extending the definition of pairwise costs to multiple sequence alignments can become complicated. To keep within the framework used for pairwise alignments, we seek a multiple alignment cost which is the sum of the substitution costs (cost based on substitution matrices and gap penalties). Nonetheless, based on different evolutionary scenario. Because genetic mutations are binary events which change one protein or nucleic acid sequence into another, substitution costs for multiple alignments are generally defined in terms of those for pairwise alignments(42). Two different approaches have been described. The first is to define the substitution cost for a set of elements as the sum of the substitution cost for all pairs of elements chosen from the set(55−57). Thus for three sequences i, j and k the cost of the alignment i, j, k will be equal to the sum of the cost of each pairwise alignment induced by the multiple alignment (cost (i,j)+cost(i,k)+cost (j,k)). These induced pairwise alignments are also called pairwise projections. An alignment defined that way is called SP for "sum−of−pairs" alignment. In such a context the evaluation of a multiple alignment amounts to measuring the dissimilarities within a set of letters. Although this approach has the advantage of being straightforward and intuitive, its main disadvantage is that it has no clear foundation in the theory of molecular evolution. Sankoff(58) proposed an approach in closer agreement with biological intuition. In his model, an evolutionary tree is assumed where each sequence is a leaf. The nodes are occupied by reconstructed sequences. If the tree has k nodes, then substitution costs are defined on k−tuples and equal the sum of the pairwise substitution costs associated with each edge of the tree. An alignment defined that way is a tree alignment. It must not be confused with progressive alignments (see Section 4.2.2) which often rely on estimated phylogenetic trees for the computation of an approximated SP alignment. In the cases where the tree alignment has only one central node, it is named star alignment, being based on star phylogeny. Despite the fact that tree alignments are biologically more realistic than SP alignments they have not become very popular, mostly because of the fact that their construction presents serious algorithmic difficulties. A majority of the multiple alignment methods based on pairwise substitutions attempt to produce optimal SP alignments. In an SP context, defining gap costs is not necessarily straightforward, as can be gathered from the variety of alternative schemes (55, 56, 59, 60). Sequences A A A C C Reconstructed Sequences (Nodes) / / / Cost 6 A A A C C A A A C C A A C 1 A 2 Figure 3. SP, Tree and Star alignment substitution costs for five 1 letter sequences ( from Altschul, (42)). The reconstructed sequences are indicated by circles at the tree nodes. Dashed lines indicate substitution cost of 0 while plain lines indicate a cost of 1. The simple implementation of pairwise costs in SP evaluation is known as ’natural gap costs’. The idea is to consider the costs of the gaps in each pairwise projection of a multiple alignment (any column of nulls is removed from the pairwise projection). Although these gap costs seem to be the obvious companions of an SP alignment, Altschul was the first one to formally propose them(42). Most of the alternative schemes have been introduced for aligning simultaneously three sequences. For instance, Gotoh(56) proposed to define a gap as a set of columns having a null in identical positions. For Murata(55), a gap is a set of adjacent columns, each containing at least a null. The main motivation behind these schemes was algorithmic. These approximations were made in order to make gap cost evaluation easier when computing an SP alignment. It is for this same reason that Altschul proposed a simplified version of the natural gap cost named ’quasi natural gap costs’(42). Quasi natural gap costs are very similar to natural ones. The main difference is that when a pairwise projection is considered, columns of nulls are not removed, and an extra gap is counted as opening when a null run in one sequence starts after and finishes before a null run in the other sequence, such as: Sequence 1 XXXXXX−−−−−−−−−−XXXXXXX Sequence 2 XXXXXXXX−−−−−XXXXXXXXXX that will lead to considering two gaps when in practice there is only one between the two sequences. The motivations behind this approximation are purely algorithmic and have to do with efficiency requirements when implementing the Carillo and Lipman algorithm for multiple sequence alignment(57) (see Section 4.3.1). This scheme induces a bias that favors similar gaps in aligned sequences, Nonetheless, this approximation is reasonable because indels seem to be rare events that tend to be kept unchanged through evolution(38). As a consequence, in most multiple alignments, the number of times where the quasi natural scheme induces alignments different from the natural one should be fairly limited. Natural and semi−natural gap costs can also be applied to tree or star alignments. This type of gap penalties in the context of an SP alignment constitutes one of the most widely used objective function for multiple sequence alignments. It is often referred to as "sums of pairs with affine gap penalties". Some of the drawbacks of this method have already been discussed in the previous section. The main ones stem from the fact that substitution matrices are general descriptions that do not take into account local constraints, this is true as well of gap penalties that should incorporate local information. An attempt to do so has been made in ClustalW where penalties are locally reassessed, trying to use some information from the other sequences being aligned(21). Another weakness of the SP function has to do with the fact that sequences are considered in pairs only. It probably makes more sense to consider a column in a multiple alignment as a distribution of amino acids generated by evolution. In the next sections about hidden Markov models and profiles (Sections 3.2.3 and 3.2.4), we will present some methods that attempt to take this fact into account when scoring multiple sequence alignments. Finally, a potential weakness inherent in any scoring scheme is the problem of non− representative information. The sequences used for building an alignment rarely constitute a representative set. They are often biased by the composition of the database used for collecting these sequences. In such a case, the alignment of the sequences that are in an isolated minority will suffer from the fact that the information they contain may be buried among the rest. We will see that several weighting schemes have been designed in order to overcome this problem. 3.2.2 Weights With most of the multiple sequence scoring systems, weighting of the sequences is necessary. Weights are designed in order to correct for unequal representation among a set of sequences. For instance, when aligning together globin sequences found by querying a database like Swiss−Prot(61), large numbers of sequences will be identical, while others will be quite different from the rest of the set and will have no close relative. However, if we want our alignment to be representative of the globin family in general, it is important to avoid a complete domination by the vertebrate myoglobin and hemoglobin sequences, simply because the database contains far more of them. Such an alignment, made by giving each sequence the same weight would be biased. Weights are used to avoid this type of bias. Several methods have been proposed that can be separated into two groups whether they depend on an alignment or an estimated phylogenetic tree. Alignment based weighting methods do not require the sequences to be related at all. Therefore, complex issues of tree topology and root placement are avoided. These methods can be based on pairwise distances (62) or on the distances from some average generalized sequence (63). Whatever the method, the general trend is similar and results in an up−weighting of the sequences which are poorly related to the rest of the set while sequences which, on average, are more similar to the other sequences have their weights accordingly lowered. The tree−based weights assume that sequences are related through evolution and that a reasonably correct tree can be deduced from pairwise distances (64). Two schemes of this type have been proposed: branch−length proportional weights (65).and the Altschul Carol Lipman (ACL) weights that are based on a statistical analysis of the tree topology(66). In the ACL scheme, a sequence receives a low weight if it is far from the tree root or if it has close neighbors in the tree. ACL weights have the advantage of correcting duplicated information without biasing the alignment toward very divergent sequences. The underlying assumption is that although a very divergent sequence contains a lots of information, it is hard to exploit such information without bringing in too much extra noise. The main weakness of the ACL method is that when the topology of the tree is hard to establish, mistakes can be made regarding the position of the root. The weights proposed by Thompson et al. (65) provide an alternative solution. They also rely on a phylogenetic tree, but under the scheme, sequences only get down weighted for having closely related neighbors. In an SP context, applying pairwise or sequence weights to score a multiple alignment is straightforward. The weighted sums of pairs alignment score can be ) can be estimated for formulated as follows. Given N sequences Si..Sn a weight( each possible pair of sequence Si, Sj. This pairwise weight will be obtained directly through computation, or will result from the combination of individual sequence weights. Each pairwise projection of the sequences Si and Sj in the alignment has a cost (COST ( ))evaluated using a substitution matrix and a set of gap penalties. Given these definitions, the global weighted SP score is equal to: Wi,j* COST (Ai,j) 1 i+ SCORE = i= j= 1 ΣS N1 N − (eq. 2) As with matrices, an important issue for weights is to decide which scheme should be applied. Each of these weights have desirable properties and unwanted side effects. Vingron and Sibbald proposed a systematic way of comparing five different methods(62). Their conclusion was that when sequences are related through a robust phylogenetic tree, the ACL do better than alignment based methods. On the other hand, when the relationship between the sequences is harder to estimate, leading to an inaccurate phylogenetic tree, the Sibbald and Argos(63) or related methods(67) are ¥ ¤£¢  ¡ ¥¤£¡ ¦ ¥¤£¡ ¦ preferable. Similar results were more recently established by Henikoff and Henikoff(68) using an empirical evaluation method. These authors found that for phylogenetically related sequences, tree based methods are preferable and that the Thompson’s scheme slightly outperforms ACL. It is not clear however, to what extent these findings are method dependent, especially if one considers that most of these weights are used in different heuristic alignment making strategies. Gotoh pointed out the fact that weighting schemes are very likely to be method dependent(69). For instance, an important difference between the ACL weights and the Thompson’s ones is that the ACL method produces pairwise weights while the other gives individual sequence weights. As a consequence, the ACL weights contain more information since the pairwise weights they depend on are not necessarily correlated (i.e. there is not always a set of sequence weights corresponding to a set of pairwise weights). On the other hand, as we will see later, these two types of weights are used in very different contexts. Thompson’s weights are mostly used for progressive alignments, where they are probably very appropriate since remotely related sequences usually have little effect on the overall multiple alignment(21). ACL weights are used in the program MSA (Multiple Sequence Alignment(22)) which does global simultaneous alignments where each sequence is given a chance to affect the overall alignment. We will now see that weights can also be useful for the construction and the use of sequence profiles (i.e. generalized alignments used to describe a protein family or a domain). 3.2.3 Profiles Multiple sequence alignments can be used to provide position specific scoring matrices known as profiles(6). The procedure of turning an alignment into a profile is fairly straightforward. It involves counting the residue frequencies in each column of the multiple alignment, and deducing from these measures, a table of substitution costs for each position of the profile. A local cost is also evaluated for gap insertion and extension. The term profile refers to the collection of costs associated with each position of the alignment. A profile can be treated as a single sequence and aligned to any other sequence (or profile), using the profile substitution costs and penalties instead of a single matrix. A Penalty POS ALN 1 EGVL 2 LLSP 3 VVVV . . . 21 SS−D 22 S−−S . . . 49 SSNY C D E F 4 0 2 . . . 5 3 . . . 2 G H I K L M 1 3 0 . . . 3 2 . . . 1 P Q ... ... ... ... 9 9 9 . . . 4 4 . . . 9 Gap 3 −2 3 4 0 2 −2 −2 −1 3 2 2 −2 −2 2 . . . . . . . . . . . . . . . 3 2 5 4 −4 2 3 1 1 −2 . . . . . . . . . . . . . . . 2 5 2 1 1 −1 3 −1 4 4 1 −1 3 −1 6 5 −1 −3 11 −2 1 −2 −2 . . . . . . . . . . . . . . . . . . 0 −1 2 −3 −2 4 −1 0 1 −2 −1 2 . . . . . . . . . . . . . . . . . . 1 0 1 −2 −2 5 ... ... ... Figure 4. Example of profile ( adapted from Gribskov et al.(6)). For each position (POS) of a multiple alignment (ALN, presented in a vertical format with each line corresponding to a column) a substitution cost is calculated for any amino acid that would be aligned to this position. A gap penalty is also evaluated. Note that at positions 21 and 22 of the profile, the gap penalty is lowered because the alignment used for the profile contains gaps at this position. A profile is specific for a family (or a domain). One of the main usage of profiles is to search databases for new members of a family. In such a context the desirable properties are sensitivity and selectivity (i.e. recognize very remote homologues and discriminate false positives). Such a result can only be achieved if the profile induces very good alignments. This in turns depends on the quality of the profile itself and can be affected by several factors: (i) choice of the sequences, (ii) method used for building the multiple alignment, (iii) method used to turn the multiple alignment into a profile (especially the treatment of the gaps and the method used to describe background frequencies). In many cases, the choice of the sequences is directly imposed by the database and one of the best way to remove this type of bias is to use a weighting scheme when aligning the sequences (see previous section). However if one wishes to build some very specific profile, it is also possible to select the appropriate sequences, using techniques such as the one described by Neuwald et al. (70). Weights also need to be applied when the alignment is turned into a profile. On various occasions, it has been shown that many of the schemes used for sequence alignments do as well when used to build profiles, and help to increase the level of generalization(65, 71). Accumulation of gaps is another side effect that appears when many sequences are used to build a profile. Because the number of gaps tends to increase with the number of sequences, schemes have to be used for down weighting the effect of their occurrence(65, 72, 73). Profiles are not only important for database searches, their computation is also a crucial step for some multiple sequence alignment strategies based on a progressive approach (67, 74, 75). In this context, to build the full multiple sequence alignment, partial multiple alignments need to be aligned in intermediate steps. The best way to do so is often to turn these alignments into profiles, and to align these profiles with one another. Methods for doing so have been extensively described by Higgins(76, 77) and Gotoh(69, 73). Finally, since profiles are involved in database searches, some significant work has been done on establishing the statistical meaning of alignment scores(78, 79). This important aspect of the problem has received much more attention in the context of the hidden Markov models based approaches that will now briefly be reviewed. 3.2.4 hidden Markov models A hidden Markov model (HMM) describes a series of observations generated by a "hidden" stochastic process (a Markov process)(80). They have been used extensively in speech recognition. HMMs designed for sequence alignments are related to profiles. Their aim is to provide a statistical model representative of a given family of proteins(7). In theory, one of the main advantages of HMMs (as opposed to profiles) is that they provide a way to estimate a model directly from unaligned sequences. However, in practice, most of the methods available for HMM optimization require the computation of multiple alignments. Nevertheless, HMMs have some interesting features that distinguish them from profiles. A key concept in HMMs is the notion of states. A HMM is a chain of elements with different possible states. The number of possible states is arbitrarily defined. In SAM(81) for instance, there are three states: align, insert and delete. When going through a model, probabilities are given to each possible transition. The values of these transition probabilities are evaluated by training the model using known members of a protein family. Sequences can be aligned to a trained model using a variant of dynamic programming(18) known as the Viterby algorithm(80). An alignment between a sequence and a HMM is called a path, in the sense that it joins different states in order to produce the path with the highest probability. To a large extent, aligning a sequence to a model can be regarded as equivalent to aligning a sequence to a profile. There is however a fundamental conceptual difference. A new sequence is not ’aligned’ to a HMM. What is measured is the probability for a given HMM to generate the optimally aligned sequence (i.e. the sequence with the right pattern of gaps/unaligned residues/aligned residues). Figure 5. A linear hidden Markov model (from Hughey and Krogh, (82). This model has three different states ( M, I, D). Each state is connected to the others by a transition probability ( arrows). Assigning the weights to each transition is the purpose of the training. The number of sequences, and their range of identities are critical factors that will influence the model. In their simplest expression, HMMs do not require any prior information (as opposed to profiles that required a multiple alignment made using a substitution matrix). In HMMs, residues are described as ’letters’ and the training only relies on identities to establish the parameters. If there are enough sequences in the training set, this will result in a sensible model since the substitution constraints will be ’discovered’ by the model on a position per position basis. The accuracy and the sensitivity of a model trained that way will be highly dependent on the number of sequences and their information content. To overcome possible lack of information (missing data), pseudo counts are incorporated in order to simulate background frequencies (82, 83). Their influence on the model decreases with the amount of information present in the sequences. The actual values of the pseudocounts can be estimated using various methods. One can measure the probability of each amino acid in the training set, or other background probabilities from a standard substitution matrix. Generally speaking, the smaller the number of sequences used for the training, the more critical the values of the pseudocounts. In this context, Dirichlet mixtures(36) proved extremely useful. A Dirichlet mixture is a mathematical tool that, given an observed amino acid distribution and a set of reference distributions allows the computation of a probability for the observed distribution. In a hidden Markov model context, these mixtures can be regarded as the equivalent of a substitution matrix. They have been shown to be more sensitive to sequence conservation or variation than traditional substitution matrices(36). As with profiles, HMMs can be used to generate multiple sequence alignments or for scanning databases. Scoring can be made by combining the probabilities of all the different alignments of a sequence to a model, which is equivalent to calculating the total probability of a sequence given the model. This can be done efficiently using the Viterby algorithm (80). Such a score is called the NLL score for Negative Log Likelihood score. The NLL scores measures how far a sequence is from its model (in other words, the statistical cost of forcing a given model to produce an aligned sequence). The problem with NLL score is that they depend on the size of both the sequence and the model. One way to overcome this is to measure the Z score, the number of standard deviations a NLL score is away from the average NLL score of unrelated sequences of the same length. A complete algorithm for the computation of Z scores is described in (82). This study of multiple sequence alignment scoring systems is far from exhaustive. A large variety of alternative methods have been described that roughly fall into two distinct categories: those relying on SP evaluation and those (like HMMs) that consider distributions of amino acids rather than pairs. Although it seems that considering distributions is a more realistic approach, the main reason why SP schemes have so far been more popular has mostly to do with algorithmic problems associated with distribution based methods. Both methods only deal with one aspect of the problem: the use of local information. We know that since proteins fold into active 3D structures, there must be more information in the sequences (i.e. tertiary structure interactions...) than what we discussed so far. If used in an appropriate manner there is no doubt it could help to improve the quality of the alignments. Few methods have been proposed to deal with these non−local interactions. There are two good reasons for that: this type of signal is usually very weak in proteins and the algorithmic problem is even more complex than when taking into account primary sequences. We will now see that the problem is different with RNA hence the use of more complex types of objective functions. 3.3 RNA ALIGNMENTS: TAKING INTO ACCOUNT NON LOCAL INTERACTIONS When reliable non local interaction information is available, it makes sense to incorporate it into the scoring scheme. This is rarely the case with proteins, hence the difficulties encountered when doing structure threading (threading a protein sequence onto a known structure)(84). Fortunately, in the case of RNA the problem is different and the rules that govern the formation of non local interactions are better understood(85, 86). A large part of the secondary structures encountered in RNAs are due to Watson and Crick interactions. The existence of mathematical models that describe these interactions on a thermodynamic basis make RNAs good candidates for ’ab−initio’ folding predictions. However, so many parameters influence the structure of an RNA molecule (local conditions, interacting proteins, unknown tertiary interactions...) that accurate predictions based on thermodynamic models are still very imperfect despite some recent improvements(87, 88). This does not mean that the thermodynamic approach is wrong, but it may mean that the information it uses is not sufficiently detailed considering our knowledge of the RNA folding process. However when combined with phylogenetic information (such as can be taken from multiple sequence alignments) predictions can be realized to a high level of accuracy(16). Figure 6. Some of the motifs commonly found in RNA secondary structure. Base pairings are usually made through Watson and Crick interaction although non canonical pairs have often been reported. Multiple sequence alignments have the advantage of revealing constraints without the need of hypotheses on their origin. Such analysis, performed on ribosomal RNA sequences(16), made it possible to predict the secondary structures of these molecules with an accuracy outperforming traditional energy minimization schemes like Zuker(89). Unfortunately, these types of RNA alignments are difficult to build automatically due to complex algorithmic problems. Several functions have been proposed that incorporate RNA secondary structure information into their evaluation of multiple sequence alignment quality(90−92). In their approach, Kim et al. (90) proposed a function that computes the probability for all the potential secondary structures contained in a multiple sequence alignment to be taken into account. It is a scheme that has the advantage of being very flexible (for instance, it allows pseudo−knots) and not requiring the actual computation of a secondary structure (structures are not assessed for compatibility). Its main drawback is that evaluation time is quadratic with the length of the alignment. This can prove a severe limit when dealing with very long sequences (like ribosomal RNAs) while using an iterative or stochastic optimization method such as simulated annealing. Eddy and Durbin(91) proposed a different type of approach. In their method, the secondary structure is expressed as a binary tree in which each node stands for a column in a multiple sequence alignment. This tree can be seen as a path through a generalized HMM (HMM with bifurcation) named Covariance Model (CV). A CV can be trained like an HMM. Once this has been done, the sequences are then aligned to the model in order to produce the multiple sequence alignment. This approach is very similar to the stochastic context free grammar (SCFG) methods(93, 94), where the aim is to express the structure using a special type of regular expression. The alignment is made by parsing the sequences through the proper expression that has been obtained by training on the sequences. CV and SCFG methods suffer from the same drawback. They can only allow nested structures and cannot take into account pseudoknots(95, 96) as opposed to the method proposed by Kim. Furthermore, their computation is very expensive and restricts these methods to small sequences (<200 nucleotides). An option for decreasing the complexity of the problem is to do threading: assume that some master structure is known and thread the sequences onto it. In several important cases, like ribosomal RNAs, this is a realistic assumption. This approach can of course be taken using a SCFG based objective function. Alternatively, one can consider a simpler function such as the one described by Corpet and Michot(92). In this case, the evaluation of the alignment is split in two steps: (i) evaluating the primary sequence alignment (ii) evaluating the quality of the fold induced by the sequence of known structure onto the sequence of unknown structure. To generate the overall score, these two terms are combined with one another. Although this scheme has no real theoretical justification as opposed to those previously described, it has the merit of being conceptually simple. It can also accommodate a range of interactions such as pseudo−knots and other non nested structures. In this case, the alignment problem is known to be NP−complete(17). It is in order to provide a reasonable heuristic solution with long sequences (>1000 nucleotides) that we developed the package RAGA(97) (See section 5.3 and Annex 3). 4 MAKING MULTIPLE SEQUENCE ALIGNMENT 4.1 COMPLEXITY OF THE PROBLEM So far, we have focused on reviewing some of the objective functions needed for evaluating the quality of sequence alignments. However, as pointed out earlier, this is only one side of the coin. The other one, that constitutes in fact the main bottleneck, is optimization. In other words, given an objective function, is it possible to optimize it by producing the best scoring alignment? There are at least two good reason for designing efficient and accurate optimization strategies. The first one is obvious: making the alignments that are needed for whatever purpose.... The second reason is less direct but of extreme importance. The evaluation schemes described above are only theoretically justified using phylogenetic or structural. criteria. As such, they do not constitute any proof and must therefore be validated through empirical analysis (i.e. how well do they perform). The optimization methods required for these two reasons do not necessarily need to be equivalent. One can, for instance, use a very robust but expensive (computer time and memory) method to compare and validate alternative scoring schemes. If one of these schemes proves useful it may later become appropriate to develop a very specific heuristic method that approximates the optimization reasonably well while being efficient enough for production purpose. Needless to say, whatever direction one wishes to take, the design of an optimization technique will always prove to be a very demanding problem. We already mentioned that even for two sequences of moderate length, a naive approach to alignment computation can lead to impractical enumeration problems. Fortunately, the situation does not have to be that bad and will depend on the scoring scheme one wishes to optimize. In many cases, there are short cuts that allow efficient computation of an alignment, given some specific objective functions. We will see in section 4.2.1 that dynamic programming(18) is one of these techniques that allows the computation of pairwise alignments in time proportional to the product of the length of the two sequences. This essential technique constitutes the core of many alignment methods. In theory, it is not restricted to two sequences, but since its complexity is a function of the product of the length of the sequences to align, it can hardly be used for more than three sequences at a time(55). This does not mean multiple sequence alignments cannot be computed automatically with dynamic programming, but it means that to do so, one has to rely on heuristic algorithms. Heuristic methods do not guarantee an optimal solution but may perform well and sometimes even guarantee the solution to be inside given boundaries. Generally speaking, multiple sequence alignment algorithms can be divided into two classes: (i) the greedy algorithms that usually rely on sequence clustering algorithms and dynamic programming for making progressive alignments (ii)the non−progressive algorithms that attempt to simultaneously align all the sequences. These non progressive algorithms themselves fall into two distinct sub−categories: −deterministic heuristics. −stochastic heuristics. In the following sections, the underlying principles of these algorithms and their main differences will be briefly explained. More emphasis will be given to the genetic algorithm techniques on which the SAGA package is based(98)(see section 5.1 and annex 1). 4.2 DETERMINISTIC GREEDY APPROACHES 4.2.1 Aligning Two Sequences The main algorithm for aligning two sequences, often referred to as the Needleman and Wunsch(18) or dynamic programming (DP) algorithm, is one of the oldest and most important tools in bioinformatics. Over the last 30 years, it has been used under in form or another in most of the methods developed for sequence comparison. When applying dynamic programming to two sequences, it is possible to compute the best scoring alignment between these sequences using an amount of memory and time proportional to the product of the lengths of the two sequences to align. This provides a dramatic improvement over the naive approach that would require enumerating all the possible alignments. An important advantage of dynamic programming is that it is a very general scheme. Given substitution costs (e.g. matrix or profile...) and a scheme for scoring gaps, the algorithm can compute the alignment with the best score. In practice, DP can accommodate any context−independent scoring scheme. The algorithm is based on extending recursively the best scoring alignment until all the residues of each sequence have been aligned. In practice, this means finding the best path through a matrix constructed from the scores of all pairs of elements between the two sequences. Let us consider two sequences, A of length m and B of length n, a matrix that assigns a score di,j to the substitution of residue i in A by the residue j in B and a gap penalty g. Computation of the optimal score is achieved by incrementally extending each path with a locally optimal step. For example; element di,j can extend any path terminating in the preceding row (i−1, m:m