Analysis of Two Sequences

1-MUTATIONS AND GENETIC CODE

1.Given a random ORF, what probability of appearance do you expect for each amino

acid?

2.What are the two residues that are the most likely to mutate by chance?

3.Does this prediction correlate well with the observation?

MATERIAL: Pam250 and the Genetic Code

SOLUTION

[BACK]

The simplest form of multiple sequence analysis occurs when you have only two sequences ! Analyses you may want to do of two sequences can be carried out by the programs compare, gap and bestfit.

2-COMPARE & DOTPLOT

Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with dotplot. Compare finds the points using either a window/stringency or a word match criterion. The window/stringency comparison is the slower but more sensitive of the two.

Dotplot makes a dot-plot with the output file from compare, foldrna, or stemloop.

Exercise:

To do this exercise you will first need to fetch the two following sequences from the GCG database, these are the E.coli and Mycobacterium genitalium recA sequences em:ecreca (v00328) em:mtreca (x58485)

The default filenames are ecreca.em_ba and mtreca.em_ba.

To compare these enter the following:

% compare

Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window/stringency or a word match criterion. The word comparison
is 1,000 times faster than the window/stringency comparison, but
somewhat less sensitive.

COMPARE what horizontal sequence ? em:ecreca

                  Begin (* 1 *) ?
                End (*  1391 *) ?
               Reverse (* No *) ?

to what vertical sequence (* ecreca.em_ba *) ? em:mtreca

                  Begin (* 1 *) ?
                End (*  2762 *) ?
               Reverse (* No *) ?
 What comparison window size (* 21 *) ?
 What stringency (* 14.0 *) ?

 What should I call the output file (* ecreca.pnt *) ?
 Number of points: 947
 Writing .......

To view this comparison dotplot is used as follows:

NB! you must set your GCG graphics environment before using this or any other GCG graphics program. (see Appendix I of this document)

% dotplot

DotPlot makes a dot-plot with the output file from Compare,
FoldRNA, or StemLoop.

DOTPLOT what point file ? ecreca.pnt

 ecreca.pnt contains COMPARE results of
    Axis                 Name   Check   Start     End   Dir
 Horizontal     ecreca.em_ba     3229       1    1391   for
 Vertical       mtreca.em_ba     8528       1    2762   for
             Window . . . . . . . . . 21
             Stringency . . . . . . . 14.0
             Number of points . . . . 947
             Percent of possible  . . 0.023
 The minimum density for a one-page plot is
 5118.3 bases/100 platen units on each axis.
 What point density would you like (* 2303.3 *) ?
 DOTPLOT will take 1 pages.  Would you like to:
       P)lot the points
       D)ifferent density
       G)et another point file to plot
       Q)uit
 Please select one (* P *):
       P)lot the points
       D)ifferent density
       G)et another point file to plot
       Q)uit
 Please select one (* Q *):

Ask yourself why there is a big discontinuity in the sequence comparison. Clue: Mycobacterium has inteins.

This graph can be improved by changing the settings for the compare command. For the previous example the default settings were used and there is a lot of confounding information (the small dots) which can be removed by making the window size greater.

[BACK]

3-Comparison Using Dynamic Programming: GAP and Bestfit

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximises the number of matches and minimises the number of gaps.

Exercise:

First fetch your sequences. Use the Haemophilus influenzae rec1 sequence and the Escherichia coli recA gene em:hearec (L07521) em:ecreca (v00328).

To do the gap comparison enter the following:

% gap

Gap uses the algorithm of Needleman and Wunsch to find the alignment of

two complete sequences that maximizes the number of matches and minimizes the number of gaps.GAP of what sequence 1 ? em:hearec

                  Begin (* 1 *) ?
                End (*  1484 *) ?
               Reverse (* No *) ?

to what sequence 2 (* hearec.em_ba *) ? em:ecreca

                  Begin (* 1 *) ?
                End (*  1391 *) ?
               Reverse (* No *) ?
 What is the gap creation penalty (* 50.00 *) ?
 What is the gap extension penalty (* 3.0 *) ?
 What should I call the paired output display file (* hearec.pair *) ?
 Aligning ..................................................
          ...................-.
 Aligning ..................................................
          ...................-.......
          Gaps:     4
       Quality: 7888
 Quality Ratio: 5.671
  % Similarity: 58.531
        Length:  1486

To view the alignment, type:

% more hearec.pair

It is much quicker to put all the instructions on the command line:

% gap em:hearec em:ecreca -def

but the "-def" (default) assumes that you want to align the whole sequence, and not, for example just the coding sequence and that you are happy with the gap penalties. This is unlikely to be always true, especially as you move into multiple sequence alignments.

Map is a global alignment program which attempts to align two complete sequences. Bestfit offers another way of aligning two sequences, it tries to find the best local alignment. They are both good for different sorts of problems.

Exercise.

Compare the results you get from bestfit and gap under various gap-penalty regimes.

[BACK]

4 Looking at the Structure of Genes

Using the same tools as in the above exercises, examine the following seqences pairs:

                           1.PAPA_CARPA - SERA_PLAFG
                           2.ANCALM - ANCALM_5
                           3.HS058362 - HS05836

What biological phenomenon do your observations reflect?

[SOLUTION]
[BACK]

5 Identification of Multiple Repeats

Do the same analysis as previously using the sequence TF3A_XENLA. Make a dot-plot and then use Lalign. Explain the resultsand propose a list of repeats.

[SOLUTIONS]
[BACK]