The ABC of Bioinformatics

11. Significance Testing of Trees

Whatever data you put into Phylip, you can usually get out a tree of some kind. Obviously you will want to have some assessment of how reliable such a tree is. One of the standard methods of determining the reliability of the tree generated from a dataset is bootstrapping. This involves the random resampling, with replacement, of the data/sites from which the original tree was derived. From each resampling a new tree is drawn and this procedure is repeated 100, 1000 or 10,000 times. After all the resampling you can see how often particular branches are supported under the bootstrapping regime.

With clustalw a bootstrapping module is incorporated into the neighbour joining tree drawing part of the program. Use option 4 from the main menu to get:

****** PHYLOGENETIC TREE MENU ******


    1.  Input an alignment
    2.  Exclude positions with gaps?        = OFF
    3.  Correct for multiple substitutions? = OFF
    4.  Draw tree now
    5.  Bootstrap tree
    6.  Output format options

    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu

Then take option 5 (almost certainly toggling options 2 and 3 at the same time!)

Enter name for bootstrap output file   [reca6.phb]:

Enter seed no. for random number generator  (1..1000) [111]: 559

Enter number of bootstrap trials  (1..10000)    [1000]:

Each dot represents 10 trials

..........
..........
..........
..........
..........
..........
..........
..........
..........
..........

Bootstrap output file completed       [reca6.phb]

Note that the random number generator is one of those pseudorandom algorithms. Thus you will get exactly the same bootstrap values if you persist in taking the default 111 seed.

With Phylip it is a bit more complicated and a multistep process. Let us suppose that you wish to bootstrap a tree of somatoptropin genes drawn with PROTPARS. The first step is to run SEQBOOT:

% seqboot

/phylip/bin_phylip/seqboot:  can't read infile
Please enter a new filename>  soma.phy

Random number seed (must be odd)?
579

The program will then present you with:

Bootstrapped sequences algorithm, version 3.53c

Settings for this run:
  D   Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
  J     Bootstrap, Jackknife, or Permute?  Bootstrap
  R                  How many replicates?  100
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)

In fact it might be better to enter R and reduce the number of replicates to 50 or 20 in a classroom setting and 1000 for real work

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to output file called outfile !

% mv outfile soma.sqb

% protpars

/phylip/bin_phylip/protpars:  can't read infile
Please enter a new filename> soma.sqb

Protein parsimony algorithm, version 3.53c

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

How many data sets?
50

Y (to indicate that all parameters are set and the analysis can begin)

protpars will clank through the input sequence files 50 times printing out each time thus:

Data set # 8:

Adding species:
   SOMA_BOVIN
   SOMA_SHEEP
   SOMA_MOUSE
   SOMA_RAT  
   SOMA_RABIT
   SOMA_PIG  
   SOMA_HUMAN
doing global rearrangements
  !-------------!
   .............

until eventually it declares:

Output written to output file

Trees also written onto file

% mv treefile soma.sqbtree
% mv outfile soma.sqbout

You will now need to use Phylip's CONSENSE with the New Hampshire format multiple tree file as input to determine how many times each branch of the most parsimonious tree is replicated in the bootstrapped dataset.

incbi@acer>consense
phylip/bin_phylip/consense: can't read infile Please enter a new filename> soma.sqbtree

Majority-rule and strict consensus tree program, version 3.53c

Settings for this run:
  O                        Outgroup root?  No, use as outgroup species  1
  R        Trees to be treated as Rooted?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1         Print out the sets of species  Yes
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

There are two output files from consense. The file outfile has the most accessible information although hardly to camera-ready quality. Each branch in this file has the number of times the most parsimonious tree was supported by bootstrap replicates. Don't draw the consense treefile expecting to see the bootstraps printed at the branches. It writes these into a New Hampshire format tree in such a way that they are interpreted as branch lengths.

You should now practice the use of these programs with a real dataset of your own or use SRS at the EBI to download a family or partial family of proteins, such as:

mammalian protein seqs: cytochrome C , alcohol dehydrogenase, actin alpha, actin beta , enolase, catalase, cathepsin D, HSP70 (heat shock protein), hexokinase, histone H3, lactate dehydrogenase, octeonectin/S.P.A.R.C., pyruvate kinase, somatotropin, spectrin alpha, spectrin beta, Thy-1 (membrane) glycoprotein, triose phosphate isomerase, tubulin alpha, tubulin beta.

within prokaryotes: flagellin, recA, glnA/glutamine synthase

Appendix I Graphics Output

Much of the strength of GCG lies in its graphical display capabilities. However graphics are very much dependent on the terminal you are using. So it is not possible to set the correct parameters for bioinformatics, but rather each user must determine which graphics display mode best suits their local situation.

To set-up the graphics output interface you must run one of the following programs. It may be necessary to switch between graphics output modes in one session. 'postscript' can be used for hard copy and final print outs, while 'tektronix' and 'xwindows' are better for instant views on the screen.

% postscript
% tektronixfor PC and Mac terminals
% xwindowsfor X-terminals, including Macs running eXodus or MacX in the first instance try the defaults (hit <return>) for these set-up programs: this will show you the available options.

The following command lines cover several of the more common situations.

To create a colour display Xwindow called 'screen'
% xwindows color screen

For running eXodus on a Mac with memory limitations try
% xwindows mono picture

Note: if you have Xwindows capability, you may want to use the Wisconsin Package (X-)Interface called seqlab. To run this type:
% seqlab -small

All the GCG programs are then available from a window with menus.

For Macs running NCSA-Telnet:
% tektronix tek4107 term

For Macs running versaterm:
% tektronix versaterm term

For PC running kermit:
% tektronix tek4014 term

several other options are available. To create a postscript file, which you can transfer to a postscript printer later:
% postscript laserwriter output.PS

Appendix II

Ways of representing sequences for findpatterns and motifs.

SYMBOL

MEANING

Round brackets ()

Enclose one or more symbols that can be repeated some number of times

Curly brackets {}

Enclose numbers that tell how many times the symbols within the preceding round brackets must be found. One or both of the numbers within the curly brackets may be missing.

Expressions:

TYPE	EXAMPLE	MEANING
OR	RGF(Q,A)S	RGF followed by either Q or A followed by S.
	GAT(TG,T,G){1,4}A	GAT followed by any combination of TG, T and G 1 to 4 times, followed by A, for example GATTGGA matches this pattern.
NOT	GC~CAT	GC followed by any symbol except C, followed by AT.
	GC~(A,T)CC	GC followed by any symbol except A or T, followed by CC.

Appendix III Sequence Symbols

GCG uses the letter codes for amino acid codes and nucleotide ambiguity proposed by IUB (Nomenclature Committee, 1985, Eur. J. Biochem. 150; 1-5). These codes are compatible with the codes used by the EMBL, GenBank, and PIR databases.

Nucleotides

IUB/GCG	MEANING	COMPLEMENT
A	A	T
C	C	G
G	G	C
T/U	T	A
M	A or C	K
R	A or G	Y
W	A or T	W
S	C or G	S
Y	C or T	R
K	G or T	M
V	A or C or G	B
H	A or C or T	D
D	A or G or T	H
B	C or G or T	V
X/N	G or A or T or C	X
.	not G or A or T or C	.

Amino Acids

SYMBOL	MEANING	CODONS	IUB DEPICTION
A	Ala	GCT, GCC, GCA, GCG	!GCX
B	Asp, Asn	GAT, GAC, AAT, AAC	!RAY
C	Cys	TGT, TGC	!TGY
D	Asp	GAT, GAC	!GAY
E	Glu	GAA, GAG	!GAR
F	Phe	TTT, TTC	!TTY
G	Gly	GGT, GGC, GGA, GGG	!GGX
H	His	CAT, CAC	!CAY
I	Ile	ATT, ATC, ATA	!ATH
K	Lys	AAA, AAG	!AAR
L	Leu	TTG, TTA, CTT, CTC, CTA, CTG	!TTR, CTX, YTR
M	Met	ATG	!ATG
N	Asn	AAT, AAC	!AAY
P	Pro	CCT, CCC, CCA, CCG	!CCX
Q	Gln	CAA, CAG	!CAR
R	Arg	CGT, CGC, CGA, CGG, AGA, AGG	!CGX, AGR, MGR
S	Ser	TCT, TCC, TCA, TCG, AGT, AGC	!TCX, AGY
T	Thr	ACT, ACC, ACA, ACG	!ACX
V	Val	GTT, GTC, GTA, GTG	!GTX
W	Trp	TGG	!TGG
X	Unknown		!XXX
Y	Tyr	TAT, TAC	!TAY
Z	Glu, Gln	GAA, GAG, CAA, CAG	!SAR
*	Terminator	TAA, TAG, TGA	!TAR, TRA

The Universal Genetic Code.

Phe	UUU	Ser	UCU	Tyr	UAU	Cys	UGU
	UUC		UCC		UAC		UGC
Leu	UUA		UCA	ter	UAA	ter	UGA
	UUG		UCG	ter	UAG	Trp	UGG

Leu	CUU	Pro	CCU	His	CAU	Arg	CGU
	CUC		CCC		CAC		CGC
	CUA		CCA	Gln	CAA		CGA
	CUG		CCG		CAG		CGG

Ile	AUU	Thr	ACU	Asn	AAU	Ser	AGU
	AUC		ACC		AAC		AGC
	AUA		ACA	Lys	AAA	Arg	AGA
Met	AUG		ACG		AAG		AGG

Val	GUU	Ala	GCU	Asp	GAU	Gly	GGU
	GUC		GCC		GAC		GGC
	GUA		GCA	Glu	GAA		GGA
	GUG		GCG		GAG		GGG

APPENDIX IV

Biochemically meaningful grouping of Amino Acids

Marked with a ":" in clustalW	Marked with a "." in clustalW
"strong groups"	"weak groups"
STA	CSA
NEQK	ATV
NHQK	SAG
NDEQ	STNK
QHRK	SGND
MILV	SNDEQK
MILF	NDEQHK
HY	NEQHRK
FYW	FVLIM
	HFY