Analysis of Two Sequences
1.Given a random ORF, what probability of appearance do you expect for each amino
2.What are the two residues that are the most likely to mutate by chance?
3.Does this prediction correlate well with the observation?
The simplest form of multiple sequence analysis occurs when you have only two sequences ! Analyses you may want to do of two sequences can be carried out by the programs compare, gap and bestfit.
Dotplot makes a dot-plot with the output file from compare, foldrna, or stemloop.
Exercise:To do this exercise you will first need to fetch the two following sequences from the GCG database, these are the E.coli and Mycobacterium genitalium recA sequences em:ecreca (v00328) em:mtreca (x58485)
The default filenames are ecreca.em_ba and mtreca.em_ba.
To compare these enter the following:
Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.COMPARE what horizontal sequence ? em:ecreca
Begin (* 1 *) ? End (* 1391 *) ? Reverse (* No *) ?to what vertical sequence (* ecreca.em_ba *) ? em:mtreca
Begin (* 1 *) ? End (* 2762 *) ? Reverse (* No *) ? What comparison window size (* 21 *) ? What stringency (* 14.0 *) ? What should I call the output file (* ecreca.pnt *) ? Number of points: 947 Writing .......To view this comparison dotplot is used as follows:
NB! you must set your GCG graphics environment before using this or any other GCG graphics program. (see Appendix I of this document)
DotPlot makes a dot-plot with the output file from Compare, FoldRNA, or StemLoop.DOTPLOT what point file ? ecreca.pnt
ecreca.pnt contains COMPARE results of Axis Name Check Start End Dir Horizontal ecreca.em_ba 3229 1 1391 for Vertical mtreca.em_ba 8528 1 2762 for Window . . . . . . . . . 21 Stringency . . . . . . . 14.0 Number of points . . . . 947 Percent of possible . . 0.023 The minimum density for a one-page plot is 5118.3 bases/100 platen units on each axis. What point density would you like (* 2303.3 *) ? DOTPLOT will take 1 pages. Would you like to: P)lot the points D)ifferent density G)et another point file to plot Q)uit Please select one (* P *): P)lot the points D)ifferent density G)et another point file to plot Q)uit Please select one (* Q *):Ask yourself why there is a big discontinuity in the sequence comparison. Clue: Mycobacterium has inteins.
This graph can be improved by changing the settings for the compare command. For the previous example the default settings were used and there is a lot of confounding information (the small dots) which can be removed by making the window size greater.
Exercise:First fetch your sequences. Use the Haemophilus influenzae rec1 sequence and the Escherichia coli recA gene em:hearec (L07521) em:ecreca (v00328).
To do the gap comparison enter the following:
Gap uses the algorithm of Needleman and Wunsch to find the alignment oftwo complete sequences that maximizes the number of matches and minimizes the number of gaps.GAP of what sequence 1 ? em:hearec
Begin (* 1 *) ? End (* 1484 *) ? Reverse (* No *) ?to what sequence 2 (* hearec.em_ba *) ? em:ecreca
Begin (* 1 *) ? End (* 1391 *) ? Reverse (* No *) ? What is the gap creation penalty (* 50.00 *) ? What is the gap extension penalty (* 3.0 *) ? What should I call the paired output display file (* hearec.pair *) ? Aligning .................................................. ...................-. Aligning .................................................. ...................-....... Gaps: 4 Quality: 7888 Quality Ratio: 5.671 % Similarity: 58.531 Length: 1486To view the alignment, type:
% more hearec.pair
It is much quicker to put all the instructions on the command line:
% gap em:hearec em:ecreca -def
but the "-def" (default) assumes that you want to align the whole sequence, and not, for example just the coding sequence and that you are happy with the gap penalties. This is unlikely to be always true, especially as you move into multiple sequence alignments.
Map is a global alignment program which attempts to align two complete sequences. Bestfit offers another way of aligning two sequences, it tries to find the best local alignment. They are both good for different sorts of problems.
Exercise.Compare the results you get from bestfit and gap under various gap-penalty regimes.
1.PAPA_CARPA - SERA_PLAFG
What biological phenomenon do your observations reflect?Lalign. Explain the resultsand propose a list of repeats.