The ABC of Bioinformatics

11. Significance Testing of Trees


Whatever data you put into Phylip, you can usually get out a tree of some kind. Obviously you will want to have some assessment of how reliable such a tree is. One of the standard methods of determining the reliability of the tree generated from a dataset is bootstrapping. This involves the random resampling, with replacement, of the data/sites from which the original tree was derived. From each resampling a new tree is drawn and this procedure is repeated 100, 1000 or 10,000 times. After all the resampling you can see how often particular branches are supported under the bootstrapping regime.

With clustalw a bootstrapping module is incorporated into the neighbour joining tree drawing part of the program. Use option 4 from the main menu to get:

****** PHYLOGENETIC TREE MENU ******


    1.  Input an alignment
    2.  Exclude positions with gaps?        = OFF
    3.  Correct for multiple substitutions? = OFF
    4.  Draw tree now
    5.  Bootstrap tree
    6.  Output format options

    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu

Then take option 5 (almost certainly toggling options 2 and 3 at the same time!)

Enter name for bootstrap output file   [reca6.phb]:

Enter seed no. for random number generator  (1..1000) [111]: 559

Enter number of bootstrap trials  (1..10000)    [1000]:

Each dot represents 10 trials

..........
..........
..........
..........
..........
..........
..........
..........
..........
..........

Bootstrap output file completed       [reca6.phb]

Note that the random number generator is one of those pseudorandom algorithms. Thus you will get exactly the same bootstrap values if you persist in taking the default 111 seed.

With Phylip it is a bit more complicated and a multistep process. Let us suppose that you wish to bootstrap a tree of somatoptropin genes drawn with PROTPARS. The first step is to run SEQBOOT:

% seqboot

/phylip/bin_phylip/seqboot:  can't read infile
Please enter a new filename>  soma.phy

Random number seed (must be odd)?
579

The program will then present you with:

Bootstrapped sequences algorithm, version 3.53c

Settings for this run:
  D   Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
  J     Bootstrap, Jackknife, or Permute?  Bootstrap
  R                  How many replicates?  100
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)

Y

In fact it might be better to enter R and reduce the number of replicates to 50 or 20 in a classroom setting and 1000 for real work

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to output file called outfile !

% mv outfile soma.sqb

% protpars

/phylip/bin_phylip/protpars:  can't read infile
Please enter a new filename> soma.sqb

Protein parsimony algorithm, version 3.53c

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

M

How many data sets?
50

Y (to indicate that all parameters are set and the analysis can begin)

protpars will clank through the input sequence files 50 times printing out each time thus:

Data set # 8:

Adding species:
   SOMA_BOVIN
   SOMA_SHEEP
   SOMA_MOUSE
   SOMA_RAT  
   SOMA_RABIT
   SOMA_PIG  
   SOMA_HUMAN
doing global rearrangements
  !-------------!
   .............

until eventually it declares:

Output written to output file

Trees also written onto file

% mv treefile soma.sqbtree
% mv outfile soma.sqbout

You will now need to use Phylip's CONSENSE with the New Hampshire format multiple tree file as input to determine how many times each branch of the most parsimonious tree is replicated in the bootstrapped dataset.

incbi@acer>consense
phylip/bin_phylip/consense: can't read infile
Please enter a new filename>
soma.sqbtree

Majority-rule and strict consensus tree program, version 3.53c

Settings for this run:
  O                        Outgroup root?  No, use as outgroup species  1
  R        Trees to be treated as Rooted?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1         Print out the sets of species  Yes
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

Y

There are two output files from consense. The file outfile has the most accessible information although hardly to camera-ready quality. Each branch in this file has the number of times the most parsimonious tree was supported by bootstrap replicates. Don't draw the consense treefile expecting to see the bootstraps printed at the branches. It writes these into a New Hampshire format tree in such a way that they are interpreted as branch lengths.

You should now practice the use of these programs with a real dataset of your own or use SRS at the EBI to download a family or partial family of proteins, such as:

mammalian protein seqs: cytochrome C , alcohol dehydrogenase, actin alpha, actin beta , enolase, catalase, cathepsin D, HSP70 (heat shock protein), hexokinase, histone H3, lactate dehydrogenase, octeonectin/S.P.A.R.C., pyruvate kinase, somatotropin, spectrin alpha, spectrin beta, Thy-1 (membrane) glycoprotein, triose phosphate isomerase, tubulin alpha, tubulin beta.

within prokaryotes: flagellin, recA, glnA/glutamine synthase


Appendix I Graphics Output

Much of the strength of GCG lies in its graphical display capabilities. However graphics are very much dependent on the terminal you are using. So it is not possible to set the correct parameters for bioinformatics, but rather each user must determine which graphics display mode best suits their local situation.

To set-up the graphics output interface you must run one of the following programs. It may be necessary to switch between graphics output modes in one session. 'postscript' can be used for hard copy and final print outs, while 'tektronix' and 'xwindows' are better for instant views on the screen.

% postscript
% tektronixfor PC and Mac terminals
% xwindowsfor X-terminals, including Macs running eXodus or MacX in the first instance try the defaults (hit <return>) for these set-up programs: this will show you the available options.

The following command lines cover several of the more common situations.

To create a colour display Xwindow called 'screen'
% xwindows color screen

For running eXodus on a Mac with memory limitations try
% xwindows mono picture

Note: if you have Xwindows capability, you may want to use the Wisconsin Package (X-)Interface called seqlab. To run this type:
% seqlab -small

All the GCG programs are then available from a window with menus.

For Macs running NCSA-Telnet:
% tektronix tek4107 term

For Macs running versaterm:
% tektronix versaterm term

For PC running kermit:
% tektronix tek4014 term

several other options are available. To create a postscript file, which you can transfer to a postscript printer later:
% postscript laserwriter output.PS


Appendix II

Ways of representing sequences for findpatterns and motifs.

SYMBOL

MEANING

Round brackets ()

Enclose one or more symbols that can be repeated some number of times

Curly brackets {}

Enclose numbers that tell how many times the symbols within the preceding round brackets must be found. One or both of the numbers within the curly brackets may be missing.

Expressions:

TYPE

EXAMPLE

MEANING

OR

RGF(Q,A)S

RGF followed by either Q or A followed by S.

GAT(TG,T,G){1,4}A

GAT followed by any combination of TG, T and G 1 to 4 times, followed by A, for example GATTGGA matches this pattern.

NOT

GC~CAT

GC followed by any symbol except C, followed by AT.

GC~(A,T)CC

GC followed by any symbol except A or T, followed by CC.


Appendix III Sequence Symbols

GCG uses the letter codes for amino acid codes and nucleotide ambiguity proposed by IUB (Nomenclature Committee, 1985, Eur. J. Biochem. 150; 1-5). These codes are compatible with the codes used by the EMBL, GenBank, and PIR databases.

Nucleotides

IUB/GCG

MEANING

COMPLEMENT

A

A

T

C

C

G

G

G

C

T/U

T

A

M

A or C

K

R

A or G

Y

W

A or T

W

S

C or G

S

Y

C or T

R

K

G or T

M

V

A or C or G

B

H

A or C or T

D

D

A or G or T

H

B

C or G or T

V

X/N

G or A or T or C

X

.

not G or A or T or C

.


Amino Acids

SYMBOL

MEANING

CODONS

IUB DEPICTION

A

Ala

GCT, GCC, GCA, GCG

!GCX

B

Asp, Asn

GAT, GAC, AAT, AAC

!RAY

C

Cys

TGT, TGC

!TGY

D

Asp

GAT, GAC

!GAY

E

Glu

GAA, GAG

!GAR

F

Phe

TTT, TTC

!TTY

G

Gly

GGT, GGC, GGA, GGG

!GGX

H

His

CAT, CAC

!CAY

I

Ile

ATT, ATC, ATA

!ATH

K

Lys

AAA, AAG

!AAR

L

Leu

TTG, TTA, CTT, CTC, CTA, CTG

!TTR, CTX, YTR

M

Met

ATG

!ATG

N

Asn

AAT, AAC

!AAY

P

Pro

CCT, CCC, CCA, CCG

!CCX

Q

Gln

CAA, CAG

!CAR

R

Arg

CGT, CGC, CGA, CGG, AGA, AGG

!CGX, AGR, MGR

S

Ser

TCT, TCC, TCA, TCG, AGT, AGC

!TCX, AGY

T

Thr

ACT, ACC, ACA, ACG

!ACX

V

Val

GTT, GTC, GTA, GTG

!GTX

W

Trp

TGG

!TGG

X

Unknown

!XXX

Y

Tyr

TAT, TAC

!TAY

Z

Glu, Gln

GAA, GAG, CAA, CAG

!SAR

*

Terminator

TAA, TAG, TGA

!TAR, TRA


The Universal Genetic Code.

Phe

UUU

Ser

UCU

Tyr

UAU

Cys

UGU

UUC

UCC

UAC

UGC

Leu

UUA

UCA

ter

UAA

ter

UGA

UUG

UCG

ter

UAG

Trp

UGG

Leu

CUU

Pro

CCU

His

CAU

Arg

CGU

CUC

CCC

CAC

CGC

CUA

CCA

Gln

CAA

CGA

CUG

CCG

CAG

CGG

Ile

AUU

Thr

ACU

Asn

AAU

Ser

AGU

AUC

ACC

AAC

AGC

AUA

ACA

Lys

AAA

Arg

AGA

Met

AUG

ACG

AAG

AGG

Val

GUU

Ala

GCU

Asp

GAU

Gly

GGU

GUC

GCC

GAC

GGC

GUA

GCA

Glu

GAA

GGA

GUG

GCG

GAG

GGG

APPENDIX IV

Biochemically meaningful grouping of Amino Acids

Marked with a ":" in clustalW

Marked with a "." in clustalW

"strong groups"

"weak groups"

STA

CSA

NEQK

ATV

NHQK

SAG

NDEQ

STNK

QHRK

SGND

MILV

SNDEQK

MILF

NDEQHK

HY

NEQHRK

FYW

FVLIM

HFY