2nd PROSITE Workshop,

27-29 June 2001

IGS-Marseille




Workshop Project

!!! MORE RULES IN PROSITE !!!

Purpose

Here below are the criteria of acceptation/rejection of the matches by motif descriptor as used to produce the InterPro hit list:

Prosite Profiles - Only matches with score exceeding the so-called LEVEL=0 cutoff are accepted. The fact that a protein has multiple matches by a domain is not considered.

Pfam HMMs - The computation that produces the decision to accept match(es) of a motif on a protein is quite intricate. The raw score of every match and the sum of the score of individual matches are first re-evaluated by the search program to take into account the composition of the matched protein (this step can be skipped with the --null2 option). Then, the acceptation of matches rely upon the TWO cutoffs declared on the GA line of every Pfam entry: the first one is for sum of all matches (the per protein cumulated score), the second one is for the highest scoring individual match (the per match score). According to the HMMER2.0 manual, the cutoff is set by default to zero for the acceptation of the remaining individualmatchs...but the actual behaviour of hmmsearch might be different!

The Pfam startegy is clearly advantageous for the detection of domains that usually occur reapeated in a protein. The goal of this workshop is to write the specifications (part of the user manual) of a new type of Prosite rules, that would permit to improve the detection multiple domains in a protein. There are three points to work out

- The envisaged modification to the Prosite format.
- The stategy used to combine the matches.
- Extensive case studies to evaluate the appropriateness of the envisaged startegy.

The following miscellaneous points might fuel the debate:

The Pfam stategy is actually problematic on one point. "Scanning" or "searching" several proteins with several HMMs may result in different list of matches. This is basically an implementation problem, but new rules that suffer from this kind of ambiguities have to be rejected.

Another weekness of the Pfam strategy is that it can only deal with multiple occurences of the same motif. A domain described by two sub-domains would strongly benefit from a co-evaluation of the scores.

It must be stressed that all these considerations are "proteinocentric". The concept of multiple matches on a DNA sequence is not very relevant because of the size of the molecules which is usually quite arbitrarilly fixed in databases. The "proximity" of two matches should be taken into account.

Contact

Marco.Pagni