Choosing protein weight matrix


For protein alignments, you use a weight matrix to determine the similarity of non-identical amino acids. For example, Tyr aligned with Phe is usually judged to be 'better' than Tyr aligned with Pro.

There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. To see the exact details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrices which give a high score to many other frequent substitutions.

1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 45 and 30.

2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 120, 160, 250 and 350 matrices.

3) GONNET . These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.

We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful. Alternatively, you can read in your own (just one matrix, not a series).

A new matrix can be read from a file on disk, if the filename consists only of lower case characters. The values in the new weight matrix must be integers and the scores should be similarities. You can use negative as well as positive values if you wish, although the matrix will be automatically adjusted to all positive scores.

For DNA, a single matrix (not a series) is used. Two hard-coded matrices are available:

1) IUB. This is the default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.

2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.

INPUT FORMAT The format used for a new matrix is the same as the BLAST program. Any lines beginning with a # character are assumed to be comments. The first non-comment line should contain a list of amino acids in any order, using the 1 letter code, followed by a * character. This should be followed by a square matrix of integer scores, with one row and one column for each amino acid. The last row and column of the matrix (corresponding to the * character) contain the minimum score over the whole matrix.