CHAOS + DIALIGN [manual]

Alignment of genomic sequences using CHAOS and DIALIGN:

DIALIGN is a widely used software program for pair-wise and multiple alignment of DNA and protein sequences. It assembles global alignments from gap-free local pair-wise alignments, so-called fragments. A number of recent studies have used DIALIGN to discover functional sites in genomic sequences, e.g. While DIALIGN alignments are generally of high quality, the original program is too slow for large-scale sequence comparison. One way of speeding-up DIALIGN without compromizing on alignment quality is to use an anchored-alignment procedure. Here, a fast local alignment search tool identifies regions of high sequence similarity among the input sequeces. These similarities are then used as anchor points to reduce the sarch space and running time of DIALIGN.

On our web server, we are using the program CHAOS to identify such anchor points. CHAOS has been developed by Mike Brudno at Stanford University. In our experience, anchor points created by CHAOS speed-up DIALIGN by one to two orders of magnitude without affecting the quality of the output alignments. Details are described

In a first step, CHAOS is applied to create a list of anchor points. Then DIALING is run on the input sequences using some new options for genomic alignment that are described in

Input sequence file:

CHAOS/DIALIGN requires a single ASCII file containing the sequences to be aligned in FASTA format:

        >HTL2  
        LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
        SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI
        STFNEYTDSLILAPL
        >MMLV   
        PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE
        GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
        CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
        >HEPB 
        RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
        VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT
        GRTSLYADSPSVPSHLPDRVH
        >ECOL   
        MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK
        EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW
        VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV



For each sequence, the first line starts with ">" and contains the name of the sequence.

Program Output:

Our web server creates different output files containing


This is DIALIGN alignment format:

  
                    

dog_il4        20565   AGAGCCTGGT CTGGAGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
hum_il4        21730   AGAGCCTGGT CTGGGGCAAA GTTGATGTCT ACCTGTGCTT TCTTTAGCAG
mus_il4        31390   ---------- ---GGGCCGA GCTGATGTCT ACCTGTGCTT TCTTTAGCAG

                       1111111111 1112222222 2222222222 2222222222 2222222222


dog_il4        20615   ATCAGATAta gGAG---TAC ACCAGTCGGG CATGAGCCTC TCCAGCTCTA
hum_il4        21780   ATCAGATAta gGAG---CAC ACCAGCCGGG CATGAGCCTC TCCAGCTCTA
mus_il4        31427   ATCAGATAta gccgcagCAC AGCAGTCGGG CATGAGCCTC TCCAACTCTA

                       2222222000 0222000333 3334444444 4444444444 4444444444


dog_il4        20662   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAacTG CAGCAGCTGG
hum_il4        21827   AGGTGATGAT GACCAAGGCC AGTGTGGAGC CCTTGAACTG CAGCAGCTGG
mus_il4        31477   AGGTGATGAT GACTAAAGCA AATGTGGAGC CCTTGAACTG CAGCAGCTGG

                       4444444444 4433666666 6666666666 6666662299 9999999999

								     


This is FASTA alignment format:

>HTL2
ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
-------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
NEYTDSLILApl--------------------------------------
----------
>MMLV
pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
NRMADQAARKAAITETPDTStll---------------------------
----------
>HEPB
rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
--AELLAACFArsrsgan---IIGTDN-----------------------
-------------SVVLSR--------------KYTSFPWLLGCAANWI-
LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
vpshlpdrvh
>ECOL
mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN
NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD
KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE-
NERCDELARAAAMNPTledtgyqvev------------------------
----------

Note that, as in the DIALIGN format, only UPPER-CASE letters are considered to be aligned, lower-case letters are NOT aligned. Please take this into account if you run other programs on the DIALIGN output alignment.

Fragment file returned by DIALIGN:

The fragments (aligned gap-free segment pairs) used by DIALIGN to assemble the alignment are returned in the following format:
    1) seq:   2   3  beg:  185955  178118 len:  90 wgt:  42.00 olw: 107.68 it: 1 cons
    2) seq:   1   2  beg:  201612  185943 len:  90 wgt:  39.98 olw: 105.66 it: 1 cons
    3) seq:   1   2  beg:   20700   21865 len:  90 wgt:  48.94 olw: 104.80 it: 1 cons
    4) seq:   1   2  beg:  109548  109795 len:  84 wgt:  37.82 olw: 104.30 it: 1 cons
    5) seq:   2   3  beg:   39483   53649 len:  90 wgt:  39.09 olw: 104.27 it: 1 cons

For each fragment, the file specifies the involved sequences (seq:), the starting positions of the fragment in these sequences (beg:), the fragment length (len:), the weight score of the fragment (wgt:), its overlap weight (olw:), the iteration step during the program in which the fragment has been created (it:), and information about consistency of the fragment (cons or incons).

Anchor points produced by CHAOS:

Anchor points produce by CHAOS are printed in the following format:
   1 2   178   265   1  4165.000000
   1 2   238   325   1  4165.000000 
   1 2   75   162   1  6237.000000
   1 2   157   244   1  6237.000000 
   1 3   302   289   1  2619.000000
   1 3   346   333   1  2619.000000 
   1 3   174   146   1  6309.000000
The first two entries are the sequences in volved, entries 3 and 4 are the starting points in the respective sequences. Entry 5 is the length of an anchored segment pair, and the last entry is a quality score calculated by CHAOS. This is used to prioritize anchor points in case contradicting anchors are specified. In this case, anchors with high scores are accepted first, anchors with lower scores are used only if they are consistent with the previously accepted higher-scoring anchors.

This is PHYLIP tree format:

 
((HTL2:0.111024,
(MMLV:0.078471,
ECOL:0.078471):0.032554):0.121218,
HEPB:0.232242);


Trees can be visualized using the drawtree program contained in the PHYLIP software package.


Back to CHAOS/DIALIGN submission form.