Multiple sequence alignment with user-defined constrains
Most multi-alignment programs are fully automated. They create alignments according to a set of mathematical rules, and the only way of influencing the alignment procedure is to specify certain parameters. This approach to sequence alignment is appropriate if no additional information about the input sequences is available, or if large amounts of data are to be processed efficiently.
Often however, the user of an alignment program has some previous knowledge about the sequences in question. He or she may know certain regions of the sequences that are functionally or evolutionary related and should therefore be aligned to each other. If an automated alignment procedure fails to align known homologies, it would be desirable to force the program to correctly align these homologies and to align the remainder of the sequences under the constraints given by this user-specified side condition.
Here, we offer a WWW-based multi-alignment tool that calculates protein and nucleic-acid alignments under a set of user-specified constraints. Our server runs the program DIALIGN on a set of input sequences using a previously proposed anchoring approach.
Our method is described in
- B. Morgenstern, S.J. Prohaska, D. Pöhler, P.F. Stadler
(2006)
Multiple sequence alignment with user-defined anchor points
Algorithms for Molecular Biology 1,6.
Input sequence file:
Our software tool requires a single ASCII file containing the sequences to be aligned in FASTA format:
>HTL2 LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI STFNEYTDSLILAPL >MMLV PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL >HEPB RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT GRTSLYADSPSVPSHLPDRVH >ECOL MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEVFor each sequence, the first line starts with ">" and contains the name of the sequence.
Basic program options:
Threshold T:
As described in our papers, DIALIGN constructs alignments from gapfree pairs of segments of the sequences. Such segment pairs are referred to as fragments. Every possible fragment is given a so-called weight reflecting the degree of similarity among the two segments involved. A threshold T can be applied to consider only fragments that have a score exceeding T.
Sequence Type:
The user has to decide if nucleic acid or protein sequences are to be aligned.
Fast alignment:
With this option, a number of heuristics are applied to speed-up the alignment procedure: the maximum length of fragments is reduced, and threshold values are imposed for the similarity of the first pair of residues in a fragment as well as for the total weight score of a fragment. Note that this option speeds up the program, but it may also deteriorate the quality of the output alignment.
Anchored alignment:
The anchored-alignment procedure makes it possible to use expert knowledge for improved multiple alignment. The user can specify a list of anchor points, each of which consists of a pair of equal-length segments that are to be aligned by the program.To be precise, the program first selects a consistent subset of anchor points based on the scores that are specified by the user for each anchor point (or based on the order of anchor points if they are entered via our online form). If some residue x from one of the input sequences is assigned to a residue y from another sequence by one of the selected anchor points, this means that y is the only residue from this latter sequence that can be aligned to x and vice versa. All residues to the left (to the right) of x are aligned to residues to the left (to the right) of y.
Consider, for example, the followig toy example.
>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPKThe non-anchored default version of DIALIGN would calculate the following alignment for this input sequence set:
seq1 1 WKKNAD---- -APKRamtsf mKAAY----- ------- seq2 1 WNLDTN---- -SPEE----- -KQAYiqlaK DDriryd seq3 1 WRMDSNqknp dSNNP----- -KAAYn---K GDanapkNow let's assume, the user has some expert knowledge about a certain domain that is present in all of the input sequences; the domains in the three sequences are thought to be homologous to each other. The user therefore wants these domains to be aligned to each other. For example, the user may want to align the following domain that is shown in red in the input sequence set:
>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
The default version of the program aligns only parts of this motif. Therefore, the user wants to define this motif as anchor and align the rest of the sequences automatically, given the pre-defined constraints imposed by this anchor. Since anchor points are defined as pairs of equal-length segments, we need two anchor points to enforce alignment of the above motif. For example, one could choose
Anchor point 1:>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPKAnchor point 2:
>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
If the above motif is to be aligned by our program, these two anchor points need to be specified.
Format for user-defined anchor points:
To specify a set of anchor points, a file with the coordinates of these anchor points is needed. Since each anchor point corresponds to a equal-length segment pair involving two of the input sequences, coordinates for anchor points are defined as follows: (1) first sequence involved (2) second sequence involved, (3) start of anchor in first sequence (4) start of anchor in second sequence (5) length of anchor. (6) In addition, a sixth coordinate is necessary that specifies a score of an anchor point. This score is necessary to prioritize anchor point in case they are inconsistent with each other, i.e. if not all of them can be used simultaneously for the same alignment. the above coordinates (in the above given order). Thus, the above two anchor points are specified as follows:
1 2 4 6 6 4.5 2 3 6 11 8 1.3Where 4.5 and 1.3 are (arbitrary) scores for anchor points 1 and 2. Note that in this small example, the two specified anchor points are consistent with each other, so both can be used and the scores at the end of the two lines have no influence on the alignment procedure. Nevertheless, these scores have to be provided, even in cases where they turn out to be practically irrelevant.
Thus, the first line corresponds to anchor 1 and specifies: (1) first sequence = 1, (2) second sequence = 2, (3) first start position = 4, (4) second start position = 6, (5) length = 6, (6) score = 4.5
Using the above anchor points, our program creates the following alignment where the specified motif is aligned and the remainder of the sequences is aligned automatically respecting the constraints given by the anchor points:seq1 1 WKk------- NADAPKRAMT SFMKAa---Y - seq2 1 WNLDT----- NSPEEKQAYI QLAKDDrirY d seq3 1 WRMDSnqknp DSNNPKAAYn ---KGDsnap k
- If anchor points are uploaded or if they are pasted into the window on the submission page, they need to be specified in the above format.
- Alternatively, they can be entered into the boxes of the pre-defined table. In this case, no scores are necessary, and anchors are prioritized in the order they are entered.
Program Output:
Our web server creates different output files containing
- A full alignment of the input sequences in DIALIGN format.
- The same alignment in FASTA format.
This is DIALIGN alignment format:
dog_il4 20565 AGAGCCTGGT CTGGAGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG hum_il4 21730 AGAGCCTGGT CTGGGGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG mus_il4 31390 ---------- ---GGGCCGA GCTGATGTCT ACCTGTGCTT TCTTTAGCAG 1111111111 1112222222 2222222222 2222222222 2222222222 dog_il4 20615 ATCAGATAta gGAG---TAC ACCAGTCGGG CATGAGCCTC TCCAGCTCTA hum_il4 21780 ATCAGATAta gGAG---CAC ACCAGCCGGG CATGAGCCTC TCCAGCTCTA mus_il4 31427 ATCAGATAta gccgcagCAC AGCAGTCGGG CATGAGCCTC TCCAACTCTA 2222222000 0222000333 3334444444 4444444444 4444444444 dog_il4 20662 AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAacTG CAGCAGCTGG hum_il4 21827 AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAACTG CAGCAGCTGG mus_il4 31477 AGGTGATGAT GACTAAAGCA AATGTGGAGC CCTTGAACTG CAGCAGCTGG 4444444444 4433666666 6666666666 6666662299 9999999999
- Names of the aligned sequences are shown on the left hand side of the alignment.
- Numbers on the left hand side of the alignment denote the position of the first residue in a line within the respective sequence.
- As explained in our papers, DIALIGN composes alignments from so-called fragments, i.e. from un-gapped local pair-wise alignments. Capital letters denote aligned residues, i.e. residues involved in at least one of the fragments the alignment consists of. Lower-case letters denote residues not belonging to any of these selected fragments. They are not considered to be aligned by DIALIGN. Thus, if a lower-case letter is standing in the same column with other letters, this is pure chance; these residues are not considered to be homologous.
- Numbers below the alignment roughly reflect the degree of local
similarity among the sequences.
More precisely: They represent the sum of `weights' of fragments
connecting residues at the respective position. The numbers
are normalized such that every position gets a value between
0 and 9. Thus, these numbers
reflect the relative degree of similarity within an alignment,
since in every alignment, the region of maximum similarity gets
a score of 9.
- For multiple sequence files, a heuristic tree is constructed based on the pair-wise similarities among the sequences.
This is FASTA alignment format:
>HTL2 ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah-- -------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF NEYTDSLILApl-------------------------------------- ---------- >MMLV pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG NRMADQAARKAAITETPDTStll--------------------------- ---------- >HEPB rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt---- --AELLAACFArsrsgan---IIGTDN----------------------- -------------SVVLSR--------------KYTSFPWLLGCAANWI- LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps vpshlpdrvh >ECOL mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE- NERCDELARAAAMNPTledtgyqvev------------------------ ----------
This is PHYLIP tree format:
((HTL2:0.111024, (MMLV:0.078471, ECOL:0.078471):0.032554):0.121218, HEPB:0.232242);
Trees can be visualized using the drawtree program contained in Joe Felsenstein's PHYLIP software package.
Back to submission form.