Anchored multiple alignment with DIALIGN: manual

Multiple sequence alignment with user-defined constrains

Most multi-alignment programs are fully automated. They create alignments according to a set of mathematical rules, and the only way of influencing the alignment procedure is to specify certain parameters. This approach to sequence alignment is appropriate if no additional information about the input sequences is available, or if large amounts of data are to be processed efficiently.

Often however, the user of an alignment program has some previous knowledge about the sequences in question. He or she may know certain regions of the sequences that are functionally or evolutionary related and should therefore be aligned to each other. If an automated alignment procedure fails to align known homologies, it would be desirable to force the program to correctly align these homologies and to align the remainder of the sequences under the constraints given by this user-specified side condition.

Here, we offer a WWW-based multi-alignment tool that calculates protein and nucleic-acid alignments under a set of user-specified constraints. Our server runs the program DIALIGN on a set of input sequences using a previously proposed anchoring approach.

Our method is described in

B. Morgenstern, S.J. Prohaska, D. Pöhler, P.F. Stadler (2006)
Multiple sequence alignment with user-defined anchor points
Algorithms for Molecular Biology 1,6.

If you use our sofware for your research work, please cite this article.

Input sequence file:

Our software tool requires a single ASCII file containing the sequences to be aligned in FASTA format:

        >HTL2  
        LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
        SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI
        STFNEYTDSLILAPL
        >MMLV   
        PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE
        GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
        CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
        >HEPB 
        RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
        VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT
        GRTSLYADSPSVPSHLPDRVH
        >ECOL   
        MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK
        EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW
        VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV

For each sequence, the first line starts with ">" and contains the name of the sequence.

Basic program options:

Threshold T:

As described in our papers, DIALIGN constructs alignments from gapfree pairs of segments of the sequences. Such segment pairs are referred to as fragments. Every possible fragment is given a so-called weight reflecting the degree of similarity among the two segments involved. A threshold T can be applied to consider only fragments that have a score exceeding T.

Sequence Type:

The user has to decide if nucleic acid or protein sequences are to be aligned.

Fast alignment:

With this option, a number of heuristics are applied to speed-up the alignment procedure: the maximum length of fragments is reduced, and threshold values are imposed for the similarity of the first pair of residues in a fragment as well as for the total weight score of a fragment. Note that this option speeds up the program, but it may also deteriorate the quality of the output alignment.

Anchored alignment:

The anchored-alignment procedure makes it possible to use expert knowledge for improved multiple alignment. The user can specify a list of anchor points, each of which consists of a pair of equal-length segments that are to be aligned by the program.

To be precise, the program first selects a consistent subset of anchor points based on the scores that are specified by the user for each anchor point (or based on the order of anchor points if they are entered via our online form). If some residue x from one of the input sequences is assigned to a residue y from another sequence by one of the selected anchor points, this means that y is the only residue from this latter sequence that can be aligned to x and vice versa. All residues to the left (to the right) of x are aligned to residues to the left (to the right) of y.

Consider, for example, the followig toy example.

>seq1
WKKNADAPKRAMTSFMKAAY
>seq2
WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3
WRMDSNQKNPDSNNPKAAYNKGDANAPK

The non-anchored default version of DIALIGN would calculate the following alignment for this input sequence set:

seq1     1   WKKNAD---- -APKRamtsf mKAAY----- -------
seq2     1   WNLDTN---- -SPEE----- -KQAYiqlaK DDriryd
seq3     1   WRMDSNqknp dSNNP----- -KAAYn---K GDanapk

Now let's assume, the user has some expert knowledge about a certain domain that is present in all of the input sequences; the domains in the three sequences are thought to be homologous to each other. The user therefore wants these domains to be aligned to each other. For example, the user may want to align the following domain that is shown in red in the input sequence set:

>seq1
WKKNADAPKRAMTSFMKAAY
>seq2
WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3
WRMDSNQKNPDSNNPKAAYNKGDANAPK

The default version of the program aligns only parts of this motif. Therefore, the user wants to define this motif as anchor and align the rest of the sequences automatically, given the pre-defined constraints imposed by this anchor. Since anchor points are defined as pairs of equal-length segments, we need two anchor points to enforce alignment of the above motif. For example, one could choose

Anchor point 1:

>seq1
WKKNADAPKRAMTSFMKAAY
>seq2
WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3 
WRMDSNQKNPDSNNPKAAYNKGDANAPK

Anchor point 2:

>seq1
WKKNADAPKRAMTSFMKAAY
>seq2
WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3
WRMDSNQKNPDSNNPKAAYNKGDANAPK

If the above motif is to be aligned by our program, these two anchor points need to be specified.

Format for user-defined anchor points:

To specify a set of anchor points, a file with the coordinates of these anchor points is needed. Since each anchor point corresponds to a equal-length segment pair involving two of the input sequences, coordinates for anchor points are defined as follows: (1) first sequence involved (2) second sequence involved, (3) start of anchor in first sequence (4) start of anchor in second sequence (5) length of anchor. (6) In addition, a sixth coordinate is necessary that specifies a score of an anchor point. This score is necessary to prioritize anchor point in case they are inconsistent with each other, i.e. if not all of them can be used simultaneously for the same alignment. the above coordinates (in the above given order). Thus, the above two anchor points are specified as follows:

   1  2  4   6  6  4.5
   2  3  6  11  8  1.3

Where 4.5 and 1.3 are (arbitrary) scores for anchor points 1 and 2. Note that in this small example, the two specified anchor points are consistent with each other, so both can be used and the scores at the end of the two lines have no influence on the alignment procedure. Nevertheless, these scores have to be provided, even in cases where they turn out to be practically irrelevant.

Thus, the first line corresponds to anchor 1 and specifies: (1) first sequence = 1, (2) second sequence = 2, (3) first start position = 4, (4) second start position = 6, (5) length = 6, (6) score = 4.5

Using the above anchor points, our program creates the following alignment where the specified motif is aligned and the remainder of the sequences is aligned automatically respecting the constraints given by the anchor points:

seq1     1   WKk------- NADAPKRAMT SFMKAa---Y -
seq2     1   WNLDT----- NSPEEKQAYI QLAKDDrirY d
seq3     1   WRMDSnqknp DSNNPKAAYn ---KGDsnap k

If anchor points are uploaded or if they are pasted into the window on the submission page, they need to be specified in the above format.
Alternatively, they can be entered into the boxes of the pre-defined table. In this case, no scores are necessary, and anchors are prioritized in the order they are entered.

Program Output:

Our web server creates different output files containing

A full alignment of the input sequences in DIALIGN format.
The same alignment in FASTA format.

This is DIALIGN alignment format:

  
                    

dog_il4        20565   AGAGCCTGGT CTGGAGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
hum_il4        21730   AGAGCCTGGT CTGGGGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
mus_il4        31390   ---------- ---GGGCCGA GCTGATGTCT ACCTGTGCTT TCTTTAGCAG

                       1111111111 1112222222 2222222222 2222222222 2222222222


dog_il4        20615   ATCAGATAta gGAG---TAC ACCAGTCGGG CATGAGCCTC TCCAGCTCTA
hum_il4        21780   ATCAGATAta gGAG---CAC ACCAGCCGGG CATGAGCCTC TCCAGCTCTA
mus_il4        31427   ATCAGATAta gccgcagCAC AGCAGTCGGG CATGAGCCTC TCCAACTCTA

                       2222222000 0222000333 3334444444 4444444444 4444444444


dog_il4        20662   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAacTG CAGCAGCTGG
hum_il4        21827   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAACTG CAGCAGCTGG
mus_il4        31477   AGGTGATGAT GACTAAAGCA AATGTGGAGC CCTTGAACTG CAGCAGCTGG

                       4444444444 4433666666 6666666666 6666662299 9999999999

Names of the aligned sequences are shown on the left hand side of the alignment.

Numbers on the left hand side of the alignment denote the position of the first residue in a line within the respective sequence.

As explained in our papers, DIALIGN composes alignments from so-called fragments, i.e. from un-gapped local pair-wise alignments. Capital letters denote aligned residues, i.e. residues involved in at least one of the fragments the alignment consists of. Lower-case letters denote residues not belonging to any of these selected fragments. They are not considered to be aligned by DIALIGN. Thus, if a lower-case letter is standing in the same column with other letters, this is pure chance; these residues are not considered to be homologous.

Numbers below the alignment roughly reflect the degree of local similarity among the sequences. More precisely: They represent the sum of `weights' of fragments connecting residues at the respective position. The numbers are normalized such that every position gets a value between 0 and 9. Thus, these numbers reflect the relative degree of similarity within an alignment, since in every alignment, the region of maximum similarity gets a score of 9.
For multiple sequence files, a heuristic tree is constructed based on the pair-wise similarities among the sequences.

This is FASTA alignment format:

>HTL2
ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
-------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
NEYTDSLILApl--------------------------------------
----------
>MMLV
pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
NRMADQAARKAAITETPDTStll---------------------------
----------
>HEPB
rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
--AELLAACFArsrsgan---IIGTDN-----------------------
-------------SVVLSR--------------KYTSFPWLLGCAANWI-
LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
vpshlpdrvh
>ECOL
mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN
NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD
KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE-
NERCDELARAAAMNPTledtgyqvev------------------------
----------

This is PHYLIP tree format:

 
((HTL2:0.111024,
(MMLV:0.078471,
ECOL:0.078471):0.032554):0.121218,
HEPB:0.232242);

Trees can be visualized using the drawtree program contained in Joe Felsenstein's PHYLIP software package.

Back to submission form.

Anchored multiple alignment with DIALIGN [manual]

Multiple sequence alignment with user-defined constrains

Basic program options:

Our web server creates different output files containing

This is FASTA alignment format: