Contrary to widely held belief, "homology searching", "Blast" and "the
NCBI basic blast server" are not all synonyms ! The GCG programs
that you will use for homology searching are blast netblast and
fasta. We will also be using some WWW blast servers.
Blast finds sequences in a database that are similar to a query sequence.
It uses the method of Altschul et al.(J. Mol. Biol. 215; 403-410
(1990)). The query sequence and the database to be searched can be either
peptide or nucleotide in any combination
So, for example, you can search your peptide sequence against the Genbank
or EMBL DNA database translated in all six reading frames. This might be
particularly appropriate if you wanted to find homologues in the EST (Expressed
Sequence Tags) database.
blastp; compares a protein sequence against a protein
blastn; compares a NA sequence against a NA DB
blastx; dynamically translates a NA query in all
6 reading frames and compares it to a protein DB
tblastn; compares a protein query against a NA database
translated in all six reading frames
tblastx; translates both input NA and NA database
6x and compares !
Blast finds regions of relative similarity between your query sequence
and a given database. These regions of similarity are known as segment
pairs. The segment pairs found have the property that the sum of the scoring
matrix of their constituent symbol pairs is above what you may expect to
occur by chance.
Homology searching is one of the most computationally intensive things
you can do in bioinformatics. Care should be taken to ensure whether local
computers are able to deal with your search in an effective time-scale.
You can use such UNIX based protocols as uptime or top(if
installed) to try to determine if running a blast search to the current
system load is likely to slow everything down unacceptably. GCG 'netblast'
enables you to carry out your searches on a dedicated remote blast server,
this could be recommended if:
Alternatively, there are a variety of WWW based servers which will do homology
searches for you. You will compare and contrast the following web servers:
as well as GCG blast and GCG netblast programs.
you have good network connections to the USA,
if your local server is small or underpowered,
if you need access to daily updated information and
if you choose not to work when America is "awake".
Options and parameters
All the homology searching programs come with several options and choices
to tailor a search to yield the results you want. Some of these parameters
change the quality of the search while others merely change the amount
of information (eg number of hits) that is returned to you.
It is perfectly legitimate to run a homology search taking the default
parameters: those defaults have been chosen to achieve the "greatest good
for the greatest number". If your query sequence turns out to be a member
of a well established and widely sequenced family and you merely want to
establish this fact, then you will almost certainly get the same result
regardless of what options you choose. For less trivial searches/questions,
however, such as finding distant relationships, one homology search is
almost certainly inadequate.
You may wish to change these either to avoid a deluge of output or to get
beyond obvious matches to those with lower scores that are potentially
Running searches with different substitution matrices
will help you find different sorts of hits: PAM30 will preferentially find
homologues that are evolutionarily close - strong short matching segments;
on the other hand PAM250 will tend to find long, weak, diffuse matches
typical of more distantly related proteins. GCG option: -mat=PAM30
You will possibly find it useful to run searches
both with/without filtering out repeated sequence or employing low-complexity
masking: these options XXXX out regions of the protein which are particularly
biassed - proline or histidine rich for example. The defaults on the blast
servers at NCBI and the EBI for masking are not the same! GCG option: -fil
=xs. GCG 9 incorporates two self-standing programs xnu and seg
which allow you to carry out two sorts of filtering on protein sequences.
For very short sequences blast may not return any
hits at all unless the Expectation Cutoff is set very high (E=1000 is the
maximum). GCG option: -exp=1000. Alternatively you may wish to exclude
hits other than those that a highly significant biologically. GCG option:
Defaults are also set to determine the amount of
the number of hits reported. GCG option:-lis=500
the number of alignments (segments) based on these
matches that are shown. GCG option:-seg=100
For the WWW servers you should be able to find the equivalent switches,
using the hypertext-linked help information for each option. For obvious
reasons the number of available options is limited on the Basic Blast page.
For reproducibility you should note the time, date and database version
and size against which you carried out the homology search, this information
should appear at the head of the output
Hints and suggestions
If your DNA sequence is coding, then translate it
and use blast. For one thing, mutational biases at silent third positions
could give you spuriously good matches with sequences that have a similar
mutational bias but not particularly close evolutionary relationship.
In general 'fasta' is better for DNA - DNA searches
(ie for non-coding DNA sequence) than blast.
Fasta accesses the databases directly while blast
requires them to be converted into special compressed binary encoded files
- this is one reason why blast tends to be faster than fasta for comparable
searches. This has the advantage of allowing you to search specific segments
or divisions of the DNA databases. Only Fungi or only prokaryotes, for
Use the GCG version of fasta to find potential yeast homologs of the E.
coli recA gene using the DNA sequence.
FastA does a Pearson and Lipman search for similarity between a
query sequence and a group of sequences of the same type (nucleic acid
or protein)? For nucleotide searches, FastA may be more sensitive than
FASTA with what query sequence ?em:v00328 Begin
(* 1 *) ?
End (* 1391 *) ?
Search for query in what sequence(s) (* GenEMBL:* *) ? em_fun:sc*
What word size (* 6 *) ?
Don't show scores whose E() value exceeds: (* 2.0 *): 10
What should I call the output file (* v00328.fasta *)?
1 Sequences 2,167bp searched EM_FUN:SC
101 Sequences 569,762bp searched EM_FUN:SC13615
201 Sequences 821,511bp searched EM_FUN:SC20641
301 Sequences 1,175,052bp searched EM_FUN:SC30613
Unfortunately, this does not find the known (but distant) homologue
of recA the RAD51 gene in yeast (sw:ra51_yeast) because there are so many
gaps in the alignment of the two sequences.
If you were to do a blast search against, say, SwissProt to find this
yeast homologue you'd have to ignore the first 80+ hits because recA is
amongst the most widely sequenced of prokaryotic proteins. This can be
a real problem when you already know of the obvious relationships with
your protein of choice. In this particular case we are in better shape
because the NCBI has created a blast-searchable DB specifically for Saccharomyces
cerevisiae (yeast) presumably because it is a) a standard genetic organism
and b) the first completely sequenced eukaryotic genome. See exercise 5
below for another approach.
GCG - netblast
Use the GCG internet version of blast to find homologs of the Bordetella
pertussis recA gene using the protein sequence. Or some other
more interesting gene of your own choice !
% netblast sw:reca_borpe
NetBLAST searches for seqs similar to a query sequence. The query
and the database searched can be either peptide or nucleic acid in any
combination. NetBLAST can search only databases maintained at the National
Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.
Search for query in what sequence database:
p Non-redundant GB CDS translations+PDB+SwissProt+PIR
2) pdb p
PDB protein sequences
3) swissprot p SwissProt sequences
4) yeast p Saccharomyces
cerevisiae protein sequences
5) kabat p Kabat Sequences
of Proteins of Immunological Interest
6) alu p
Translations of Select Alu Repeats from REPBASE
7) month p All new or
revised GB CDS translation+PDB+SwissProt+PI
n Non-redundant GenBank+EMBL+DDBJ+PDB seqs (but no EST's
9) pdb n
PDB nucleotide sequences
10) vector n Vector subset of GenBank
11) yeast n Saccharomyces cerevisiae
genomic nucleotide sequences
12) est n Non-redundant
Database of GenBank+EMBL+DDBJ ESTs
13) sts n Non-redundant
Database of GenBank+EMBL+DDBJ STSs
14) htgs n High Throughput
15) mito n Database of
mitochondrial seqs, Rel. 1.0, July 1995
16) kabat n Kabat Seqs of Nucleic
Acid of Immunological Interest
17) epd n Eukaryotic
18) alu n Select
Alu Repeats from REPBASE
19) month n All new or revised
GB+EMBL+DDBJ+PDB sequences released
Please choose one (* 1 *): 3
Ignore hits expected to occur by chance more than (* 10.0 *) times?
Limit the number of sequences in my output to (* 250 *) ?
What should I call the output file (* reca_borpe.blastp *) ?
Done. Wrote search results to reca_yeast.blastp
% more reca.blastp
BLASTP 1.4.11 [24-Nov-97] [Build 24-Nov-97] Reference: Altschul,
Stephen F? Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool.
J. Mol. Biol.
215:403-10. Query= reca_borpe.sw ID RECA_BORPE STANDARD;
PRT; 352 AA.
Database: Non-redundant SwissProt sequences
sequences; 26,848,718 total letters? Searching...................................................done
Sequences producing High-scoring Segment Pairs:
Score P(N) N
sp|P17740|RECA_BORPE RECA PROTEIN
1744 5.5e-229 1
sp|P19690|RECA_BURCE RECA PROTEIN
1193 8.7e-180 2
sp|P18076|RECA_METFL RECA PROTEIN
1115 3.8e-167 2
sp|P24542|RECA_METCL RECA PROTEIN
1094 7.7e-165 2
sp|O32377|RECA_CHRVI RECA PROTEIN
1102 2.6e-161 2
sp|Q05358|RECA_LEGPN RECA PROTEIN
1056 1.7e-157 2
sp|P27865|RECA_RHIME RECA PROTEIN
1072 1.1e-156 2
sp|P24543|RECA_RHILP RECA PROTEIN
1057 3.8e-156 2
sp|P33156|RECA_AGRTU RECA PROTEIN
1063 5.2e-156 2
sp|P24074|RECA_RHILV RECA PROTEIN
1054 1.3e-155 2
sp|P16238|RECA_THIFE RECA PROTEIN
1039 1.1e-153 2
sp|P21152|RECA_NEIGO RECA PROTEIN
1033 3.3e-152 2
The first part of the output displays in order of statistical significance
the 'hits' each line having the following information: database|accession
number|database_name a brief description, an absolute score, the probability
that you would get a match as good as this by chance - in all these cases
this probability is vanishingly small so the 'hits' are likely to be statistically
(and biologically) significant, and finally the number of high scoring
segment pairs. GCG netblast is using an outmoded gap-free blast algorithm,
so most of the hits come in two chunks:
>SP|P19690|RECA-BURCE RECA PROTEIN
Length = 347
Score = 232 (105.0 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
Identities = 48/55 (87%), Positives = 50/55 (90%)
Query: 11 AEKAKALAAALSQIEKQFGKGSIMRYGDNEVEHDIQVVSTGSLGLDIALGVGGLP
AEK+KALAAAL+QIEKQFGKGSIMR GD E DIQVVSTGSLGLDIALGVGGLP
Sbjct: 3 AEKSKALAAALAQIEKQFGKGSIMRMGDGEAAEDIQVVSTGSLGLDIALGVGGLP
Score = 1193 (539.7 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
Identities = 228/271 (84%), Positives = 258/271 (95%)
Query: 68 VIEVYGPESSGKTTLTLQVIAEMQKLGGTCAFVDAEHALDVQYASKLGVNLTDLLISQPD
V+E+YGPESSGKTTLTLQVIAE+QKLGGT AF+DAEHALDVQYA+KLGVN+ +LLISQPD
Sbjct: 61 VVEIYGPESSGKTTLTLQVIAELQKLGGTAAFIDAEHALDVQYAAKLGVNVPELLISQPD
Query: 128 TGEQALEITDALVRSGSVDLIVIDSVAALVPKAEIEGEMGDSLPGLQARLMSQALRKLTA
Sbjct: 121 TGEQALEITDALVRSGSIDMIVIDSVAALVPKAEIEGEMGDSLPGLQARLMSQALRKLTG
Query: 188 TIKRTNCMVIFINQIRMKIAVMFGNPETTTGGNALKFYSSVRLDIRRIGAIKKGDEVVGN
TIKRTNC+VIFINQIRMKI VMFGNPETTTGGNALKFYSSVRLDIRRIG+IKK DEV+GN
Sbjct: 181 TIKRTNCLVIFINQIRMKIGVMFGNPETTTGGNALKFYSSVRLDIRRIGSIKKNDEVIGN
Query: 248 ETRVKVVKNKVAPPFKQAEFDIMYGSGISREGEIIDLGVQANVVDKSGAWYSYSGNRIGQ
ETRVKVVKNKV+PPF++A FDI+YG GISR+GEIIDLGVQA +VDK+GAWYSY+G +IGQ
Sbjct: 241 ETRVKVVKNKVSPPFREAIFDILYGEGISRQGEIIDLGVQAKIVDKAGAWYSYNGEKIGQ
Query: 308 GKDNVREYLKEHKEMAIEIENKVRENQGIVS 338
GKDN RE+L+E+ E+A EIEN++RE+ G+V+
Sbjct: 301 GKDNAREFLRENPEIAREIENRIRESLGVVA 331
After the list of hits there is a display of the alignments on which
the hits are based. As an aside you may notice that, for the Query sequence,
the first chunk above ends with residue 65 and the second begins with residue
68, while for the Sbjct sequence the first chunk ends with residue 57 and
the second begins with residue 61. The reca_borpe.sw sequence is one residue
shorter than most other recA sequences at this point and the old blastp
is unable to bridge the gap.
Careful scrutiny of the output will tell you that GCG Netblast is pointing
at an NCBI server which is still running the 'gapless' Blast 1.4. This
server is quite unreliable and may become intermittently unavailable after
the USA wakes up.
Output from http://www.ncbi.nlm.nih.gov/BLAST
Here below is the first part of the output from a search for the yeast
homologue of the prokaryotic recaA gene, using sw:reca_ecoli as query sequence.
Altschul, Stephen F. Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs", Nucleic Acids Res. 25:3389-3402.
Sequences producing significant alignments:
gi|603333 (U18839) Rad51p: RecA-like protein [Saccharomyces
cer.] 52 1e-07
gi|603420 (U18922) Dmc1p: DNA repair protein [Saccharomyces
cer.] 40 8e-04
gi|642809 (Z48008) Rad57p [Saccharomyces cerevisiae]
gnl|PID|e252186 (Z75270) ORF YOR362c [Saccharomyces cerevisiae]
gi|500681 (U10398) Yhr131cp [Saccharomyces cerevisiae]
The hits are reported as gi numbers - a unique identifier for the Genbank
seqience, followed by a Genbank/EMBL accession number, then a brief diescriptor,
then a score in bits (which should be independent of the scoring matrix)
and finally an E value. This lat is the same as the -exp parameter
and tellsd you the number of hitys you would expect in the searched database
by chance alone. 1e-07 is a small number so the match is likely to be statistically
(and biologically!) significant.
Try running GCG netblast entirely from the command line, explicitly choosing
all appropriate parameters.
% netblast sw:cas1_bovin swissprot -lis=50 -seg=20 -fil=xs
-mat=pam40 -exp=0.001 -out=bovin.blastp
You can thus more easily run the several necessary blast searches and
keep track of the combination of options that you have used.
Do a blast search on the NCBI Basic Blast server and on the EBI Blast2
server with using the default settings and swissprot as the searched database.
Use any (fasta format) protein sequence? Do you get the same results? Can
you account for any obvious differences? Hints: are the database versions
the same? Are the default parameters the same?
Do an explicit comparison on one server
with low complexity masking (the default at NCBI) and
Use Wheat Gamma Gliadin (sw:P04726). Why do you get many more hits under
the latter search? What feature do many of these hits have in common?