Homology Searching

Contrary to widely held belief, "homology searching", "Blast" and "the NCBI basic blast server" are not all synonyms ! The GCG programs that you will use for homology searching are blast netblast and fasta. We will also be using some WWW blast servers. 

Blast finds sequences in a database that are similar to a query sequence. It uses the method of Altschul et al.(J. Mol. Biol. 215; 403-410 (1990)). The query sequence and the database to be searched can be either peptide or nucleotide in any combination

  • blastp; compares a protein sequence against a protein DB 
  • blastn; compares a NA sequence against a NA DB 
  • blastx; dynamically translates a NA query in all 6 reading frames and compares it to a protein DB 
  • tblastn; compares a protein query against a NA database translated in all six reading frames 
  • tblastx; translates both input NA and NA database 6x and compares ! 
So, for example, you can search your peptide sequence against the Genbank or EMBL DNA database translated in all six reading frames. This might be particularly appropriate if you wanted to find homologues in the EST (Expressed Sequence Tags) database. 

Blast finds regions of relative similarity between your query sequence and a given database. These regions of similarity are known as segment pairs. The segment pairs found have the property that the sum of the scoring matrix of their constituent symbol pairs is above what you may expect to occur by chance. 

Homology searching is one of the most computationally intensive things you can do in bioinformatics. Care should be taken to ensure whether local computers are able to deal with your search in an effective time-scale. You can use such UNIX based protocols as uptime or top(if installed) to try to determine if running a blast search to the current system load is likely to slow everything down unacceptably. GCG 'netblast' enables you to carry out your searches on a dedicated remote blast server, this could be recommended if: 

  1. you have good network connections to the USA, 
  2. if your local server is small or underpowered, 
  3. if you need access to daily updated information and 
  4. if you choose not to work when America is "awake". 
Alternatively, there are a variety of WWW based servers which will do homology searches for you. You will compare and contrast the following web servers:  as well as GCG blast and GCG netblast programs. 

Options and parameters

All the homology searching programs come with several options and choices to tailor a search to yield the results you want. Some of these parameters change the quality of the search while others merely change the amount of information (eg number of hits) that is returned to you. 

It is perfectly legitimate to run a homology search taking the default parameters: those defaults have been chosen to achieve the "greatest good for the greatest number". If your query sequence turns out to be a member of a well established and widely sequenced family and you merely want to establish this fact, then you will almost certainly get the same result regardless of what options you choose. For less trivial searches/questions, however, such as finding distant relationships, one homology search is almost certainly inadequate. 

  1. Running searches with different substitution matrices will help you find different sorts of hits: PAM30 will preferentially find homologues that are evolutionarily close - strong short matching segments; on the other hand PAM250 will tend to find long, weak, diffuse matches typical of more distantly related proteins. GCG option: -mat=PAM30 
  2. You will possibly find it useful to run searches both with/without filtering out repeated sequence or employing low-complexity masking: these options XXXX out regions of the protein which are particularly biassed - proline or histidine rich for example. The defaults on the blast servers at NCBI and the EBI for masking are not the same! GCG option: -fil =xs. GCG 9 incorporates two self-standing programs xnu and seg which allow you to carry out two sorts of filtering on protein sequences. 
  3. For very short sequences blast may not return any hits at all unless the Expectation Cutoff is set very high (E=1000 is the maximum). GCG option: -exp=1000. Alternatively you may wish to exclude hits other than those that a highly significant biologically. GCG option: -exp=0.001 
  4. Defaults are also set to determine the amount of output delivered: 
    1. the number of hits reported. GCG option:-lis=500 
    2. the number of alignments (segments) based on these matches that are shown. GCG option:-seg=100 
You may wish to change these either to avoid a deluge of output or to get beyond obvious matches to those with lower scores that are potentially exciting. 

For the WWW servers you should be able to find the equivalent switches, using the hypertext-linked help information for each option. For obvious reasons the number of available options is limited on the Basic Blast page. 

For reproducibility you should note the time, date and database version and size against which you carried out the homology search, this information should appear at the head of the output 

Hints and suggestions

  1. If your DNA sequence is coding, then translate it and use blast. For one thing, mutational biases at silent third positions could give you spuriously good matches with sequences that have a similar mutational bias but not particularly close evolutionary relationship. 
  2. In general 'fasta' is better for DNA - DNA searches (ie for non-coding DNA sequence) than blast. 
  3. Fasta accesses the databases directly while blast requires them to be converted into special compressed binary encoded files - this is one reason why blast tends to be faster than fasta for comparable searches. This has the advantage of allowing you to search specific segments or divisions of the DNA databases. Only Fungi or only prokaryotes, for example 

GCG -Fasta

Use the GCG version of fasta to find potential yeast homologs of the E. coli recA gene using the DNA sequence. 
% fasta

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein)? For nucleotide searches, FastA may be more sensitive than BLAST 

FASTA with what query sequence ?em:v00328 Begin (* 1 *) ?
End (* 1391 *) ?
Search for query in what sequence(s) (* GenEMBL:* *) ? em_fun:sc*
What word size (* 6 *) ?
Don't show scores whose E() value exceeds: (* 2.0 *): 10
What should I call the output file (* v00328.fasta *)?

  1 Sequences 2,167bp searched EM_FUN:SC
101 Sequences 569,762bp searched EM_FUN:SC13615
201 Sequences 821,511bp searched EM_FUN:SC20641
301 Sequences 1,175,052bp searched EM_FUN:SC30613
Etc etc 

Unfortunately, this does not find the known (but distant) homologue of recA the RAD51 gene in yeast (sw:ra51_yeast) because there are so many gaps in the alignment of the two sequences. 

If you were to do a blast search against, say, SwissProt to find this yeast homologue you'd have to ignore the first 80+ hits because recA is amongst the most widely sequenced of prokaryotic proteins. This can be a real problem when you already know of the obvious relationships with your protein of choice. In this particular case we are in better shape because the NCBI has created a blast-searchable DB specifically for Saccharomyces cerevisiae (yeast) presumably because it is a) a standard genetic organism and b) the first completely sequenced eukaryotic genome. See exercise 5 below for another approach. 

GCG - netblast

Use the GCG internet version of blast to find homologs of the Bordetella pertussis recA gene using the protein sequence. Or some other more interesting gene of your own choice ! 

% netblast sw:reca_borpe 

NetBLAST searches for seqs similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA. 

Search for query in what sequence database: 

 1) nr          p Non-redundant GB CDS translations+PDB+SwissProt+PIR
 2)   pdb       p PDB protein sequences
 3)   swissprot p SwissProt sequences
 4) yeast       p Saccharomyces cerevisiae protein sequences
 5) kabat       p Kabat Sequences of Proteins of Immunological Interest
 6) alu         p Translations of Select Alu Repeats from REPBASE
 7) month       p All new or revised GB CDS translation+PDB+SwissProt+PI
 8) nr          n Non-redundant GenBank+EMBL+DDBJ+PDB seqs (but no EST's
 9)   pdb       n PDB nucleotide sequences
10)   vector    n Vector subset of GenBank
11) yeast       n Saccharomyces cerevisiae genomic nucleotide sequences
12) est         n Non-redundant Database of GenBank+EMBL+DDBJ ESTs
13) sts         n Non-redundant Database of GenBank+EMBL+DDBJ STSs
14) htgs        n High Throughput Genomic Sequences
15) mito        n Database of mitochondrial seqs, Rel. 1.0, July 1995
16) kabat       n Kabat Seqs of Nucleic Acid of Immunological Interest
17) epd         n Eukaryotic Promotor Database
18) alu         n Select Alu Repeats from REPBASE
19) month       n All new or revised GB+EMBL+DDBJ+PDB sequences released

Please choose one (* 1 *): 3 

Ignore hits expected to occur by chance more than (* 10.0 *) times? 10 

Limit the number of sequences in my output to (* 250 *) ? 

What should I call the output file (* reca_borpe.blastp *) ? reca.blastp 

Sending query...
Awaiting results...
Done. Wrote search results to reca_yeast.blastp

% more reca.blastp

BLASTP 1.4.11 [24-Nov-97] [Build 24-Nov-97] Reference: Altschul, Stephen F? Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 
215:403-10. Query=  reca_borpe.sw ID RECA_BORPE STANDARD; PRT; 352 AA.
        (352 letters) 

Database:  Non-redundant SwissProt sequences
           74,596 sequences; 26,848,718 total letters? Searching...................................................done 

                                                         High  Probability
Sequences producing High-scoring Segment Pairs:          Score  P(N)     N 

sp|P17740|RECA_BORPE RECA PROTEIN                        1744  5.5e-229  1
sp|P19690|RECA_BURCE RECA PROTEIN                        1193  8.7e-180  2
sp|P18076|RECA_METFL RECA PROTEIN                        1115  3.8e-167  2
sp|P24542|RECA_METCL RECA PROTEIN                        1094  7.7e-165  2
sp|O32377|RECA_CHRVI RECA PROTEIN                        1102  2.6e-161  2
sp|Q05358|RECA_LEGPN RECA PROTEIN                        1056  1.7e-157  2
sp|P27865|RECA_RHIME RECA PROTEIN                        1072  1.1e-156  2
sp|P24543|RECA_RHILP RECA PROTEIN                        1057  3.8e-156  2
sp|P33156|RECA_AGRTU RECA PROTEIN                        1063  5.2e-156  2
sp|P24074|RECA_RHILV RECA PROTEIN                        1054  1.3e-155  2
sp|P16238|RECA_THIFE RECA PROTEIN                        1039  1.1e-153  2
sp|P21152|RECA_NEIGO RECA PROTEIN                        1033  3.3e-152  2

etc. etc. 

The first part of the output displays in order of statistical significance the 'hits' each line having the following information: database|accession number|database_name a brief description, an absolute score, the probability that you would get a match as good as this by chance - in all these cases this probability is vanishingly small so the 'hits' are likely to be statistically (and biologically) significant, and finally the number of high scoring segment pairs. GCG netblast is using an outmoded gap-free blast algorithm, so most of the hits come in two chunks: 


            Length = 347 

 Score = 232 (105.0 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
 Identities = 48/55 (87%), Positives = 50/55 (90%) 


 Score = 1193 (539.7 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
 Identities = 228/271 (84%), Positives = 258/271 (95%) 





            GKDN RE+L+E+ E+A EIEN++RE+ G+V+

After the list of hits there is a display of the alignments on which the hits are based. As an aside you may notice that, for the Query sequence, the first chunk above ends with residue 65 and the second begins with residue 68, while for the Sbjct sequence the first chunk ends with residue 57 and the second begins with residue 61. The reca_borpe.sw sequence is one residue shorter than most other recA sequences at this point and the old blastp is unable to bridge the gap. 

Careful scrutiny of the output will tell you that GCG Netblast is pointing at an NCBI server which is still running the 'gapless' Blast 1.4. This server is quite unreliable and may become intermittently unavailable after the USA wakes up. 

Output from http://www.ncbi.nlm.nih.gov/BLAST

Here below is the first part of the output from a search for the yeast homologue of the prokaryotic recaA gene, using sw:reca_ecoli as query sequence. 

Altschul, Stephen F. Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. 

                                                                  Score     E
Sequences producing significant alignments:                      (bits)  Value
gi|603333  (U18839) Rad51p: RecA-like protein [Saccharomyces cer.] 52    1e-07
gi|603420  (U18922) Dmc1p: DNA repair protein [Saccharomyces cer.] 40    8e-04
gi|642809  (Z48008) Rad57p [Saccharomyces cerevisiae]              34    0.061
gnl|PID|e252186  (Z75270) ORF YOR362c [Saccharomyces cerevisiae]   32    0.14
gi|500681  (U10398) Yhr131cp [Saccharomyces cerevisiae]            28    2.7 

The hits are reported as gi numbers - a unique identifier for the Genbank seqience, followed by a Genbank/EMBL accession number, then a brief diescriptor, then a score in bits (which should be independent of the scoring matrix) and finally an E value. This lat is the same as the -exp parameter and tellsd you the number of hitys you would expect in the searched database by chance alone. 1e-07 is a small number so the match is likely to be statistically (and biologically!) significant. 

  1. Try running GCG netblast entirely from the command line, explicitly choosing all appropriate parameters.
  2. % netblast sw:cas1_bovin swissprot -lis=50 -seg=20 -fil=xs -mat=pam40 -exp=0.001 -out=bovin.blastp

    You can thus more easily run the several necessary blast searches and keep track of the combination of options that you have used. 

  3. Do a blast search on the NCBI Basic Blast server and on the EBI Blast2 server with using the default settings and swissprot as the searched database. Use any (fasta format) protein sequence? Do you get the same results? Can you account for any obvious differences? Hints: are the database versions the same? Are the default parameters the same? 
  4. Do an explicit comparison on one server 
    1. with low complexity masking (the default at NCBI) and 
    2. without. 
  5. Use Wheat Gamma Gliadin (sw:P04726). Why do you get many more hits under the latter search? What feature do many of these hits have in common?