CHAOS + DIALIGN: manual

Alignment of genomic sequences using CHAOS and DIALIGN:

DIALIGN is a widely used software program for pair-wise and multiple alignment of DNA and protein sequences. It assembles global alignments from gap-free local pair-wise alignments, so-called fragments. A number of recent studies have used DIALIGN to discover functional sites in genomic sequences, e.g.

B. Göttgens, L. Barton, J. Gilbert, A. Bench, M. Sanchez, S. Bahn, S. Mistry, D. Grafham, A. McMurray, M. Vaudin, E. Amaya, D. Bentley, and A. Green (2002).
Analysis of vertebrate SCL loci identifies conserved enhancers.
Nature Biotechnology, 18:181-186.
B. Göttgens,L. Barton, M. Chapman, A. Sinclair, B. Knudsen, D. Grafham, J. Gilbert, J. Rogers, D. Bentley, and A. Green (2002)
Transcriptional regulation of the stem cell leukemia gene (SCL) comparative analysis of five vertebrate SCL loci.
Genome Res., 12:749-759.
J. Fitch, S. Gardner, T. Kuczmarski, S. Kurtz, R. Myers, L. Ott, T. Slezak, E. Vitalis, A. Zemla, and P. McCready (2002).
Rapid Development of Nucleic Acid Diagnostics.
Proceedings of the IEEE, 90: 1708-1721.

While DIALIGN alignments are generally of high quality, the original program is too slow for large-scale sequence comparison. One way of speeding-up DIALIGN without compromizing on alignment quality is to use an anchored-alignment procedure. Here, a fast local alignment search tool identifies regions of high sequence similarity among the input sequeces. These similarities are then used as anchor points to reduce the sarch space and running time of DIALIGN.

On our web server, we are using the program CHAOS to identify such anchor points. CHAOS has been developed by Mike Brudno at Stanford University. In our experience, anchor points created by CHAOS speed-up DIALIGN by one to two orders of magnitude without affecting the quality of the output alignments. Details are described

M. Brudno, M. Chapman, B. Göttgens, S. Batzoglou, B. Morgenstern (2003)
Fast and sensitive multiple alignment of large genomic sequences
BMC Bioinformatics 4, 66.
M. Brudno and B. Morgenstern (2002)
Fast and sensitive alignment of large genomic sequences.
Proceedings IEEE Computer Society Bioinformatics Conference,
Stanford University, pp. 138-147.

In a first step, CHAOS is applied to create a list of anchor points. Then DIALING is run on the input sequences using some new options for genomic alignment that are described in

B. Morgenstern, O. Rinner, S. Abdeddaïm, D. Haase, K. Mayer, A. Dress, H.-W. Mewes (2002)
Exon Discovery by Genomic Sequence Alignment.
Bioinformatics 18, 777-787.

Input sequence file:

CHAOS/DIALIGN requires a single ASCII file containing the sequences to be aligned in FASTA format:

        >HTL2  
        LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
        SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI
        STFNEYTDSLILAPL
        >MMLV   
        PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE
        GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
        CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
        >HEPB 
        RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
        VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT
        GRTSLYADSPSVPSHLPDRVH
        >ECOL   
        MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK
        EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW
        VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV

For each sequence, the first line starts with ">" and contains the name of the sequence.

Program Output:

Our web server creates different output files containing

A full alignment of the input sequences in DIALIGN format.
The same alignment in FASTA format.
A list of the fragments, i.e. the pair-wise local gap-free alignments that DIALIGN uses to create the output alignment.
A list of the anchor points created by CHAOS.

This is DIALIGN alignment format:

  
                    

dog_il4        20565   AGAGCCTGGT CTGGAGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
hum_il4        21730   AGAGCCTGGT CTGGGGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
mus_il4        31390   ---------- ---GGGCCGA GCTGATGTCT ACCTGTGCTT TCTTTAGCAG

                       1111111111 1112222222 2222222222 2222222222 2222222222


dog_il4        20615   ATCAGATAta gGAG---TAC ACCAGTCGGG CATGAGCCTC TCCAGCTCTA
hum_il4        21780   ATCAGATAta gGAG---CAC ACCAGCCGGG CATGAGCCTC TCCAGCTCTA
mus_il4        31427   ATCAGATAta gccgcagCAC AGCAGTCGGG CATGAGCCTC TCCAACTCTA

                       2222222000 0222000333 3334444444 4444444444 4444444444


dog_il4        20662   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAacTG CAGCAGCTGG
hum_il4        21827   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAACTG CAGCAGCTGG
mus_il4        31477   AGGTGATGAT GACTAAAGCA AATGTGGAGC CCTTGAACTG CAGCAGCTGG

                       4444444444 4433666666 6666666666 6666662299 9999999999

Names of the aligned sequences are shown on the left.

Numbers on the left hand side of the alignment denote the position of the first residue in a line within the respective sequence.

Capital letters denote aligned residues, i.e. residues involved in at least one of the `fragments' (= aligned segment pairs) the alignment consists of. Lower-case letters denote residues not belonging to any of these selected `fragments'. They are not considered to be aligned by DIALIGN. Thus, if a lower-case letter is standing in the same column with other letters, this is pure chance; these residues are not considered to be homologous.

Numbers below the alignment roughly reflect the degree of local similarity among the sequences. More precisely: They represent the sum of `weights' of fragments connecting residues at the respective position. The numbers are normalized such that every position gets a value between 0 and 9. Thus, these numbers reflect the relative degree of similarity within an alignment, since in every alignment, the region of maximum similarity gets a score of 9.
For multiple sequence files, a heuristic tree is constructed based on the pair-wise similarities among the sequences.

This is FASTA alignment format:

>HTL2
ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
-------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
NEYTDSLILApl--------------------------------------
----------
>MMLV
pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
NRMADQAARKAAITETPDTStll---------------------------
----------
>HEPB
rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
--AELLAACFArsrsgan---IIGTDN-----------------------
-------------SVVLSR--------------KYTSFPWLLGCAANWI-
LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
vpshlpdrvh
>ECOL
mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN
NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD
KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE-
NERCDELARAAAMNPTledtgyqvev------------------------
----------

Note that, as in the DIALIGN format, only UPPER-CASE letters are considered to be aligned, lower-case letters are NOT aligned. Please take this into account if you run other programs on the DIALIGN output alignment.

Fragment file returned by DIALIGN:

The fragments (aligned gap-free segment pairs) used by DIALIGN to assemble the alignment are returned in the following format:

    1) seq:   2   3  beg:  185955  178118 len:  90 wgt:  42.00 olw: 107.68 it: 1 cons
    2) seq:   1   2  beg:  201612  185943 len:  90 wgt:  39.98 olw: 105.66 it: 1 cons
    3) seq:   1   2  beg:   20700   21865 len:  90 wgt:  48.94 olw: 104.80 it: 1 cons
    4) seq:   1   2  beg:  109548  109795 len:  84 wgt:  37.82 olw: 104.30 it: 1 cons
    5) seq:   2   3  beg:   39483   53649 len:  90 wgt:  39.09 olw: 104.27 it: 1 cons

For each fragment, the file specifies the involved sequences (seq:), the starting positions of the fragment in these sequences (beg:), the fragment length (len:), the weight score of the fragment (wgt:), its overlap weight (olw:), the iteration step during the program in which the fragment has been created (it:), and information about consistency of the fragment (cons or incons).

Anchor points produced by CHAOS:

Anchor points produce by CHAOS are printed in the following format:

   1 2   178   265   1  4165.000000
   1 2   238   325   1  4165.000000 
   1 2   75   162   1  6237.000000
   1 2   157   244   1  6237.000000 
   1 3   302   289   1  2619.000000
   1 3   346   333   1  2619.000000 
   1 3   174   146   1  6309.000000

The first two entries are the sequences in volved, entries 3 and 4 are the starting points in the respective sequences. Entry 5 is the length of an anchored segment pair, and the last entry is a quality score calculated by CHAOS. This is used to prioritize anchor points in case contradicting anchors are specified. In this case, anchors with high scores are accepted first, anchors with lower scores are used only if they are consistent with the previously accepted higher-scoring anchors.

This is PHYLIP tree format:

 
((HTL2:0.111024,
(MMLV:0.078471,
ECOL:0.078471):0.032554):0.121218,
HEPB:0.232242);

Trees can be visualized using the drawtree program contained in the PHYLIP software package.

Back to CHAOS/DIALIGN submission form.

CHAOS + DIALIGN [manual]