For protein alignments, you use a weight matrix to determine the similarity of
non-identical amino acids. For example, Tyr aligned with Phe is usually judged
to be 'better' than Tyr aligned with Pro.
There are three 'in-built' series of weight matrices offered. Each consists
of several matrices which work differently at different evolutionary distances.
To see the exact details, read the
documentation. Crudely, we store several
matrices in memory, spanning the full range of amino acid distance (from
almost identical sequences to highly divergent ones). For very similar
sequences, it is best to use a strict weight matrix which only gives a high
score to identities and the most favoured conservative substitutions. For
more divergent sequences, it is appropriate to use "softer" matrices which
give a high score to many other frequent substitutions.
1) BLOSUM (Henikoff). These matrices appear to be the best available for
carrying out data base similarity (homology searches). The matrices used are:
Blosum80, 62, 45 and 30.
2) PAM (Dayhoff). These have been extremely widely used since the late '70s.
We use the PAM 120, 160, 250 and 350 matrices.
3) GONNET . These matrices were derived using almost the same
procedure as the Dayhoff one (above) but are much more up to date and are based
on a far larger data set. They appear to be more sensitive than the Dayhoff
series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.
We also supply an identity matrix which gives a score of 1.0 to two identical
amino acids and a score of zero otherwise. This matrix is not very useful.
Alternatively, you can read in your own (just one matrix, not a series).
A new matrix can be read from a file on disk, if the filename consists only
of lower case characters. The values in the new weight matrix must be integers
and the scores should be similarities. You can use negative as well as positive
values if you wish, although the matrix will be automatically adjusted to all
positive scores.
For DNA, a single matrix (not a series) is used. Two hard-coded matrices are
available:
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
of nucleic acid sequences. X's and N's are treated as matches to any IUB
ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score
1.0 and mismatches score 0. All matches for IUB symbols also score 0.
INPUT FORMAT The format used for a new matrix is the same as the BLAST program.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the
1 letter code, followed by a * character. This should be followed by a square
matrix of integer scores, with one row and one column for each amino acid. The
last row and column of the matrix (corresponding to the * character) contain
the minimum score over the whole matrix.