This page centralizes all the code information relative to the PAVIE/Bioinformatics project, a collaboration between the PAVIE (Jacques-Antoine Gauthier and Eric Widmer) and the Swiss Institute of Bioinformatics (Philipp Bucher, Cedric Notredame)

## Project #1: Matrix Training

### Accessing and running the algorithms

- Install the latest T-Coffee distribution in which saltt is incorporated
- Use the documentation along with the following sample sequences
### Training Matrices

- Your sequences must be in FASTA format:
seq_reformat -in |

*your sequences*| -action +pavie_seq2pavie_mat [_IDXX_TWEXX[THRid]_[CHANNELn] - The program runs iteratively. It starts with the ID matrix and updates its matrix until convergeance is reached.
**Weight: _TWEXX_**- _TWE00_: No weighting
- _TWE01_: Default, n_id_pairs/n_match
- _TWE02_: n_id_pairs/aln_length
- _TWE03_: n_id_pairs/MIN(length seq1, length seq2)
**Default** - _TWE04_: n_id_pairs/MAX(length seq1, length seq2)
**Default** - _TWE05_: score as measured with the matrix
- simweight:
**NOT SUPPORTED ANYMORE** **_THRid_**- _THRid_: id is a threshold that filters alignments. For instance, with THR50, the program will only extract the counts from alignments with more than 50% ID or SIM (depending on the weighmode). The purpose of this filter is to discard bad alignments that should not contribute at all to the final matrix
**_SAMPLEn_**- _SAMPLEn_: n is the number of random pairs sampled (without removal) at each training round. The pairs change at each round
- _SAMPLE100_: will randomly select 100 pairs (different at each round)
- _SAMPLE0_: will select all the pairs (Default)
**Identity Measure: _IDX_**- _ID01_: Default, n_id_pairs/n_match
- _ID02_: n_id_pairs/aln_length
- _ID03_: n_id_pairs/MIN(length seq1, length seq2)
- _ID04_: n_id_pairs/MAX(length seq1, length seq2)
- _ID05_: score as measured with the matrix
**[Not implemented]** **_MCHSCOREn: Multichanel scoring scheme**- MCHSCORE0: average over all the channels [Default]
- MCHSCORE1: Minimum over all the channels
- MCHSCORE2: Maximum over all the channels
**_CHANNELn_: multiple channel strings**- _CHANNELn_: n is the number channels in the FASTA sequences.
- _CHANNELn_: By default, n is set to 1
**Principle for multiple channel matrix training**- One can use as many channels as required.
- Alphabets are totally independant (i.e. each channel can use the same alphabet)
- All the channels are simultaneouslz compared and one matrix per channel is estimated
- Using pavie_seq2pavie_aln, it is possible to use multiple channle matrices to align multiple channel strings

**Rules for multiple channel strings**- All the channels are in the same FASTA file
- All the string of Channel X are grouped
- All the channels must contains EXACTLY the same number of strings
- String N of channel X CORRESPONDS to string N of channel Y
- String N of channel X Must have the same lenght as string N of channel Y
**Adding the age in the multichanneling**- The age can be used as a channel
- 36 symbols (A-Z0-9) are available
**Example of Multi-Channel FASTA file (Names are arbitrary): File: myseq.fasta**>string1.channel1 abcdef >string2.channel1 ab >string3.channel1 abc >string1.channel2 abzeff >string2.channel2 ef >string3.channel2 fxx

**Examples of command Lines**- Defaults: without any extra parameter THR0_SAMPLE0 is assumed. The weight mode must be specified
**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_mat _SAMPLE10_TWE00_CHANNEL2_**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_mat _SAMPLE10_TWE00_**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_mat _TWE01_THR40_SAMPLE20_**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_mat _THR40_SAMPLE20_TWE02_**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_mat _THR40_SAMPLE20_TWE02_MCHSCORE1_- The trainning procedure outputs two series of file:
- Matrix file :
*pavie_matrix.ch_#.cy_#.pavie_mat*that contain a matrix describing the specified chanel - Matrix list file:
*pavie_matrix.cy_#.mat_list*that contain a list of matrix files - Matrix list file: can be used to compute alignments (next section)
### Using the age as a channel

**EXAMPLE:**seq_reformat -in myseq.fasta -output pavie_age_channel -out xyz- xyz_pavie_age_matrix.mat_list is a mat list that must be concatenated to the other mat list (cf multi channeling)
- xyz_pavie_age_matrix.mat_list is a mat list that must be concatenated to the other sequence file (cf multi-channeling)
- xyz_age_channel.fasta contains the sequences recoded (decades: channel1, years: channel2). These sequences must be concatenated to the other channels, as indicated in the channel section
**Note:**In the age_sequences, A=0...J=9**Note:**It is possible to set the year corresponding to the first symbol of a sequence in the header:>name _FIRSTYEARXX_

where XX will be used as the offset of the first year**Note:**Gaps are ignored<\LI>### Validation of The Training Procedure

- cost[c][d]=cost[d][c] ~ 0
- cost[c][x]~cost[d][x]

Various mode of identity measure are implemented (Trainning WEighting)

Various mode of identity measure are implemented

Various mode of identity measure are implemented

It is possible to use the age as a channel. This simply requires generating two extra channels that will be used to encode the age, along with the associated substitution matrices

Validation is made by replacing a symbol (a for instance) with two other arbitrarily chosen symbols (c and d) that are otherwise absent from the sequences. The substitution is made across the entire sequence set

The new dataset should then be used to train a matrix. If the trainning procedure is adequate, the matrix should have the following properties:

The random sequences can be generated as folows:

EXAMPLE:seq_reformat -in myseq.fasta -action +pavie_seq2random_seq axw > outseq

in this case, axw indicates that a will be replaced with x OR w

- Given a trained matrix (previous bullet), it is possible to compute all the alignments and the associated distance matrix betwween all the sequences (SA format). It is also possible to specify the measure that will be used to output the distances.
- Parameters
- _MATDIST_: a distance matrix, where the distance measure mode can be specified with _IDX_
- _MATSIM_ : a similarity matrix, where the distance measure mode can be specified with _IDX_
- _IDXX_: the default is _ID01_

- Example
**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_aln pavie_matrix.cycle_1.mat_list _MATDIST_ID01_ - matrix_list
- file containing the name of valid matrix files.
- The number of matrices defines the number of channel in the sequences (cf CHANNEL option).
- A matrix_list file is output automatically when trainning matrices.

- Given a trained matrix (previous bullet), it is possible to compute all the alignments:
EXAMPLE: seq_reformat -in myseq -action +pavie_seq2pavie_aln pavie_matrix_.cy_0.mat_list |

- matrix_list
- file containing the name of valid matrix files.
- The number of matrices defines the number of channel in the sequences (cf CHANNEL option).
- A matrix_list file is output automatically when trainning matrices.

- Example
**EXAMPLE:**seq_reformat -in myseq.fasta -action +pavie_seq2pavie_aln pavie_matrix.cycle_1.mat_list _ID02_

EXAMPLE:seq_reformat -in myseq.fasta -action +pavie_seq2pavie_aln pavie_matrix.cycle_1.mat_list _ID02_MCHSCORE1_

- Given a series of sequences, one can compute the log odd associated with transitions:
**EXAMPLE:**seq_reformat -in myseq.fasta -output transitions|