Alignment of genomic sequences using CHAOS and DIALIGN:
DIALIGN is a widely used software program for pair-wise and multiple alignment of DNA and protein sequences. It assembles global alignments from gap-free local pair-wise alignments, so-called fragments. A number of recent studies have used DIALIGN to discover functional sites in genomic sequences, e.g.-
B. Göttgens, L. Barton, J. Gilbert, A. Bench, M. Sanchez, S. Bahn,
S. Mistry, D. Grafham, A. McMurray, M. Vaudin, E. Amaya, D. Bentley, and
A. Green (2002).
Analysis of vertebrate SCL loci identifies conserved enhancers.
Nature Biotechnology, 18:181-186.
-
B. Göttgens,L. Barton, M. Chapman, A. Sinclair, B. Knudsen, D. Grafham,
J. Gilbert, J. Rogers, D. Bentley, and A. Green (2002)
Transcriptional regulation of the stem cell leukemia gene (SCL) comparative analysis of five vertebrate SCL loci.
Genome Res., 12:749-759.
-
J. Fitch, S. Gardner, T. Kuczmarski, S. Kurtz, R. Myers, L. Ott, T. Slezak,
E. Vitalis, A. Zemla, and P. McCready (2002).
Rapid Development of Nucleic Acid Diagnostics.
Proceedings of the IEEE, 90: 1708-1721.
On our web server, we are using the program CHAOS to identify such anchor points. CHAOS has been developed by Mike Brudno at Stanford University. In our experience, anchor points created by CHAOS speed-up DIALIGN by one to two orders of magnitude without affecting the quality of the output alignments. Details are described
-
M. Brudno, M. Chapman, B. Göttgens, S. Batzoglou, B. Morgenstern (2003)
Fast and sensitive multiple alignment of large genomic sequences
BMC Bioinformatics 4, 66.
-
M. Brudno and B. Morgenstern (2002)
Fast and sensitive alignment of large genomic sequences.
Proceedings IEEE Computer Society Bioinformatics Conference,
Stanford University, pp. 138-147.
-
B. Morgenstern, O. Rinner, S. Abdeddaïm, D. Haase, K. Mayer,
A. Dress, H.-W. Mewes (2002)
Exon Discovery by Genomic Sequence Alignment.
Bioinformatics 18, 777-787.
Input sequence file:
CHAOS/DIALIGN requires a single ASCII file containing the sequences to be aligned in FASTA format:
>HTL2 LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI STFNEYTDSLILAPL >MMLV PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL >HEPB RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT GRTSLYADSPSVPSHLPDRVH >ECOL MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV
For each sequence, the first line starts with ">" and contains the name of the sequence.
Program Output:
Our web server creates different output files containing
- A full alignment of the input sequences in DIALIGN format.
- The same alignment in FASTA format.
- A list of the fragments, i.e. the pair-wise local gap-free alignments that DIALIGN uses to create the output alignment.
- A list of the anchor points created by CHAOS.
This is DIALIGN alignment format:
dog_il4 20565 AGAGCCTGGT CTGGAGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG hum_il4 21730 AGAGCCTGGT CTGGGGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG mus_il4 31390 ---------- ---GGGCCGA GCTGATGTCT ACCTGTGCTT TCTTTAGCAG 1111111111 1112222222 2222222222 2222222222 2222222222 dog_il4 20615 ATCAGATAta gGAG---TAC ACCAGTCGGG CATGAGCCTC TCCAGCTCTA hum_il4 21780 ATCAGATAta gGAG---CAC ACCAGCCGGG CATGAGCCTC TCCAGCTCTA mus_il4 31427 ATCAGATAta gccgcagCAC AGCAGTCGGG CATGAGCCTC TCCAACTCTA 2222222000 0222000333 3334444444 4444444444 4444444444 dog_il4 20662 AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAacTG CAGCAGCTGG hum_il4 21827 AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAACTG CAGCAGCTGG mus_il4 31477 AGGTGATGAT GACTAAAGCA AATGTGGAGC CCTTGAACTG CAGCAGCTGG 4444444444 4433666666 6666666666 6666662299 9999999999
- Names of the aligned sequences are shown on the left.
- Numbers on the left hand side of the alignment denote the position of the first residue in a line within the respective sequence.
- Capital letters denote aligned residues, i.e. residues involved in at least one of the `fragments' (= aligned segment pairs) the alignment consists of. Lower-case letters denote residues not belonging to any of these selected `fragments'. They are not considered to be aligned by DIALIGN. Thus, if a lower-case letter is standing in the same column with other letters, this is pure chance; these residues are not considered to be homologous.
- Numbers below the alignment roughly reflect the degree of local
similarity among the sequences.
More precisely: They represent the sum of `weights' of fragments
connecting residues at the respective position. The numbers
are normalized such that every position gets a value between
0 and 9. Thus, these numbers
reflect the relative degree of similarity within an alignment,
since in every alignment, the region of maximum similarity gets
a score of 9.
- For multiple sequence files, a heuristic tree is constructed based on the pair-wise similarities among the sequences.
This is FASTA alignment format:
>HTL2 ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah-- -------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF NEYTDSLILApl-------------------------------------- ---------- >MMLV pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG NRMADQAARKAAITETPDTStll--------------------------- ---------- >HEPB rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt---- --AELLAACFArsrsgan---IIGTDN----------------------- -------------SVVLSR--------------KYTSFPWLLGCAANWI- LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps vpshlpdrvh >ECOL mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE- NERCDELARAAAMNPTledtgyqvev------------------------ ----------Note that, as in the DIALIGN format, only UPPER-CASE letters are considered to be aligned, lower-case letters are NOT aligned. Please take this into account if you run other programs on the DIALIGN output alignment.
Fragment file returned by DIALIGN:
The fragments (aligned gap-free segment pairs) used by DIALIGN to assemble the alignment are returned in the following format:1) seq: 2 3 beg: 185955 178118 len: 90 wgt: 42.00 olw: 107.68 it: 1 cons 2) seq: 1 2 beg: 201612 185943 len: 90 wgt: 39.98 olw: 105.66 it: 1 cons 3) seq: 1 2 beg: 20700 21865 len: 90 wgt: 48.94 olw: 104.80 it: 1 cons 4) seq: 1 2 beg: 109548 109795 len: 84 wgt: 37.82 olw: 104.30 it: 1 cons 5) seq: 2 3 beg: 39483 53649 len: 90 wgt: 39.09 olw: 104.27 it: 1 consFor each fragment, the file specifies the involved sequences (seq:), the starting positions of the fragment in these sequences (beg:), the fragment length (len:), the weight score of the fragment (wgt:), its overlap weight (olw:), the iteration step during the program in which the fragment has been created (it:), and information about consistency of the fragment (cons or incons).
Anchor points produced by CHAOS:
Anchor points produce by CHAOS are printed in the following format:1 2 178 265 1 4165.000000 1 2 238 325 1 4165.000000 1 2 75 162 1 6237.000000 1 2 157 244 1 6237.000000 1 3 302 289 1 2619.000000 1 3 346 333 1 2619.000000 1 3 174 146 1 6309.000000The first two entries are the sequences in volved, entries 3 and 4 are the starting points in the respective sequences. Entry 5 is the length of an anchored segment pair, and the last entry is a quality score calculated by CHAOS. This is used to prioritize anchor points in case contradicting anchors are specified. In this case, anchors with high scores are accepted first, anchors with lower scores are used only if they are consistent with the previously accepted higher-scoring anchors.
This is PHYLIP tree format:
((HTL2:0.111024, (MMLV:0.078471, ECOL:0.078471):0.032554):0.121218, HEPB:0.232242);
Trees can be visualized using the drawtree program contained in the PHYLIP software package.
Back to CHAOS/DIALIGN submission form.