General help for CLUSTAL W
Main Menu
Clustal W is a general purpose multiple alignment program for DNA or proteins.
SEQUENCE INPUT
: all sequences must be in 1 file, one after another.
7 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT,
Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file.
All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP ("." in GCG/MSF).
To do a MULTIPLE ALIGNMENT on
a set of sequences,
use item 1
from the main menu to
INPUT them; go to menu item 2 to do the
multiple alignment.
PROFILE ALIGNMENTS
(menu item 3) are used to align 2 alignments. Use this to
add a new sequence to an old alignment, or to use secondary structure to guide
the alignment process. GAPS in the old alignments are indicated using the "-"
character. PROFILES can be input in ANY of the allowed formats; just
use "-" (or "." for MSF/RSF) for each gap position.
PHYLOGENETIC TREES
(menu item 4) can be calculated from old alignments (read in
with "-" characters to indicate gaps) OR after a multiple alignment while the
alignment is still in memory.
The program tries to automatically recognise the different file formats used
and to guess whether the sequences are amino acid or nucleotide. This is not
always foolproof.
FASTA and NBRF/PIR formats are recognised by having a ">" as the first
character in the file.
EMBL/Swiss Prot formats are recognised by the letters
ID at the start of the file (the token for the entry name field).
CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
GCG/MSF format is recognised by one of the following:
- the word PileUp at the start of the file.
- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
at the start of the file.
- the word MSF on the first line of the line, and the characters ..
at the end of this line.
Note from the htmlizer (sorry): This is not the best way to input
sequences from GCG. For more details see this
additional note.
If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
sequence will be assumed to be nucleotide. This works in 97.3% of cases
but watch out!