DaliLite.v5 is a standalone program for protein structural alignment and structure database search. It is based on the Dali method (Dali stands for distance matrix alignment). It consists of two wrapper scripts:
1 System requirements 2 Installation 3 import.pl: Importing structure data 3.1 Example: import private structure 3.2 Example: import public structures from PDB 4 Preparing Blast database for structure database search 5 dali.pl: Structure comparison 5.1 Inputs 5.2 Outputs 5.3 Pairwise comparison examples 4.3.1 applymatrix.pl 5.4 All-against-all comaprison example 5.5 Database search examples 5.5.1 Preparatory steps 5.5.2 Hierarchical search 5.5.3 Knowledge-based search 5.6 Lock file 6 Version history 7 Contact
Dali generates structural alignments using only C-alpha coordinates. The database search options use sequence comparison (Blast) for a soft clustering of the PDB. The clustering is used as a filter to select candidates for explicit structural alignment.
Replace /home/you by the parent directory where DaliLite is installed
cd /home/you wget http://ekhidna2.biocenter.helsinki.fi/dali/DaliLite.v5.tar.gz tar -zxvf DaliLite.v5.tar.gz
cd /home/you/DaliLite.v5/bin make clean make # ignore Warnings # if using openmpi (check OPENMPI_PATH in Makefile) # make parallel
cd /home/you/DaliLite.v5 # the script assumes that blastp and makeblastdb are in your PATH # if not, get them from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ ./test.csh # compare output to ./test_output
/home/you/DaliLite.v5/bin/import.pl /home/you/DaliLite.v5/bin/dali.pl [-h]
* Import single PDB entry to ./DAT: import.pl --pdbfile <filename> --pdbid <xxxx> [ --dat <path> ] * Import a subset of the PDB archive: bin/import.pl --pdblist <filename> [ --dat <path> ] * Automated PDB mirroring: import.pl --rsync [ --pdbmirrordir <path> ] [ --dat <path> ] * Options: --dat <path> directory to store imported data [default: ./DAT] --pdbfile <filename> PDB formatted file, may be compressed (.gz). <filename> includes path --pdbid <xxxx> four-letter PDB identifier --pdblist <filename> list of PDB entries. <filename> includes path and must match the pattern pdb????.ent --rsync automated PDB mirroring --pdbmirrordir <path> PDB mirror directory [default: /data/pdb] --clean remove temporary files --verbose verboseThis is an example of an imported data file (1pptA.dat). Comments have been inserted as lines starting with '#':
# The header line gives the structure identifier, number of residues, total number of secondary structure elements (SSEs), number of helices, number of strands, sequence of SSEs >>>> 1pptA 36 1 1 0 H # For each SSE, list its sequential number, start and end position, modified start and end position, length check code (0 = ok, >0 = short) 1 14 31 14 31 0 # C-alpha coordinates: (x,y,z) triples for each residue sequentially 1.5 -9.0 17.3 -1.1 -10.6 15.0 -0.6 -14.2 14.1 0.5 -14.9 10.5 -2.4 -15.0 8.1 -3.5 -18.2 6.3 -3.1 -18.0 2.5 -6.6 -18.6 1.0 -5.4 -20.7 -2.0 -4.5 -19.9 -5.6 -8.1 -20.4 -6.5 -9.6 -18.1 -4.0 -12.0 -15.4 -5.5 -10.3 -11.9 -5.9 -12.1 -10.6 -2.9 -10.5 -13.2 -0.7 -7.0 -12.4 -2.1 -7.7 -8.7 -1.4 -8.6 -9.6 2.3 -5.4 -11.8 2.5 -3.4 -8.9 1.0 -4.7 -6.5 3.7 -4.0 -8.9 6.5 -0.5 -9.7 5.2 0.1 -5.9 5.1 -0.9 -5.6 8.8 1.4 -8.5 9.7 4.3 -7.3 7.8 4.1 -3.7 9.3 4.1 -5.4 12.8 7.0 -7.7 12.2 9.1 -4.9 10.7 8.0 -2.5 13.7 6.8 0.0 11.1 3.1 0.6 11.9 2.8 3.4 9.3 # Unfolding units in terms of SSEs >>>> 1pptA 1 # node identifier, status, parent node, two child nodes, SSEs in this node # node status codes: + / above domain level, * / selected domain, - / below domain level, = / small domain 1 = 0 0 1 1 # Unfolding units in terms of residues >>>> 1pptA 1 1 = 0 0 36 1 1 36 # secondary structure states per residue -dssp "LLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHLLLLL # amino acid sequence -sequence "GPSQPTYPGDDAPVEDLIRFYDNLQQYLNVVTRHRY # COMPND record from PDB entry -compnd " MOLECULE: AVIAN PANCREATIC POLYPEPTIDE; # copied from DSSP output: sequential residue number, chain identifier, PDB residue number, accessibility, C-alpha coordinates " -acc 1 A 1 101 1.500 -9.000 17.300 # lines for residues 2-35 removed -acc 36 A 36 224 2.800 3.400 9.300
/home/you/DaliLite.v5/bin/import.pl --pdbfile mymodel.pdb --pdbid mine --dat /data/private/DAT --cleanimport.pl accepts uncompressed and compressed files (extension .gz). Each structure has a four-letter identifier. The length of the identifier must be four, this is hard-coded. The chain identifiers will be appended automatically. The resulting five-letter identifier is used in Dali's internal database; the example would create the file /data/private/DAT/mineA.dat for chain 'A'. Structure comparison requires that all query structures are in one directory (DALIDATDIR_1). Similarly, all target structures must be in one directory (DALIDATDIR_2). DALIDATDIR_1 and DALIDATDIR_2 can be identical, but usually DALIDATDIR_2 contains public structures downloaded from the Protein Data Bank (PDB) and DALIDATDIR_1 contains private structures.
/home/you/DaliLite.v5/bin/import.pl --rsync --pdbmirrordir /data/pdb --dat /data/DAT --cleanThe --rsync option logs the downloaded PDB entries to pdb_update.log. If anything went wrong in the Dali import step, you can extract the list of new PDB files and run the import step again:
grep '^..\/pdb' pdb_update.log | perl -pe 's/^/\/data\/pdb\//' > pdb_new.list # prepend path /data/pdb to PDB entries bin/import.pl --pdblist pdb_new.list --dat /data/DAT --clean
Install the BLAST executables blastp and makeblastdb from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. If the blastp program is not in your $PATH (Linux environment variable), you can specify it using the --BLASTP_EXE option of dali.pl.
The following commands can be used to extract sequences from the imported structures into a FASTA file:
# create pdb.fasta ls /data/DAT/ | perl -pe 's/\.dat//' > pdb.list /home/you/DaliLite.v5/bin/dat2fasta.pl /data/DAT < pdb.list | awk -v RS=">" -v FS="\n" -v ORS="" ' { if ($2) print ">"$0 } ' > pdb.fasta # awk removes empty sequencesMake the PDB sequences searchable by BLAST:
makeblastdb -in pdb.fasta -out /home/you/pdb.blast -dbtype protThe location of the database can be specified using the --BLAST_DB option of dali.pl.
The PDB is highly redundant. Database searches are faster, if a non-redundant subset is searched systematically and homologs of dissimilar structures are eliminated without explicit alignment. CD-HIT (from https://github.com/weizhongli/cdhit) can be used to generate non-redundant subsets of PDB. We use all-against-all comparison by BLAST and the PDB_SELECT algorithm to generate PDB25, a non-redundant set at 25% sequence identity. Protein sequences which align globally with higher than 25% sequence identity are usually structurally very similar.
USAGE: bin/dali.pl [ BASIC-OPTIONS] [MPI-OPTIONS] \ ( --cd1 <xxxxX> | --pdbfile1 <first.pdb> [ --pdbid1 <mol1> ] | --query <query.list> ) \ ( --matrix | --cd2 <yyyyY> | --pdbfile2 <second.pdb> [ --pdbid2 <mol2> ] | --db <target.list> \ [ ( --hierarchical | --walk [WALK-OPTIONS] ) --repset <pdb25.list> [BLAST-OPTIONS] ] ) --cd1 <xxxxX> query structure identifier --pdbfile1 <filename> query structure in PDB format --pdbid1 <xxxx> four-letter query structure identifier (chain identifier will be appended automatically) [default: mol1] --query <filename> list of query structure identifiers --matrix all-against-all comparison. Generates additional outputs called 'ordered' (similarity matrix) and 'newick_unrooted' (dendrogram). --cd2 <xxxxX> target structure identifier --pdbfile2 <filename> target structure in PDB format --pdbid2 <yyyyy> four-letter target structure identifier (chain identifier will be appended automatically) [default: mol2] --db <filename> list of target structure identifiers --hierarchical hierarchical structure database search --walk knowledge-based structure database search --repset <filename> list of structure identifiers of non-redundant subset of PDB BASIC-OPTIONS: --dat1 <path> path to directory containing query data [default: ./DAT/] --dat2 <path> path to directory containing target data [default: ./DAT/] --oneway asymmetric structure comparison (A,B) only [default: symmetric (A,B) and (B,A)] --title <string> written to output [default: test] --outfmt <string> result blocks to output: summary,alignments,equivalences,transrot [default: summary] --clean remove temporary files MPI-OPTIONS: --np <integer> number of processes if using openmpi (between 1 and 99) [default: 1] ` --MPIRUN_EXE <string> location of mpirun executable [default: /usr/lib64/openmpi/bin/mpirun ] BLAST-OPTIONS: --HMAX <integer> number of top scoring representatives to send to final BLAST [default: 200] --KMAX <integer> number of final BLAST hits to align structurally [default: 2000] --BLAST_DB <string> location of Blast database [default: pdb.blast] --BLASTP_EXE <string> location of Blast executable [default: blastp] --BLAST_NUM_THREADS <integer> number of threads when runnign Blast [default: 32] WALK-OPTIONS: --targetset <pdb25.list> used with H to limit the radius of the walk [default: same as --repset] --H <integer> walk radius is Z-score of Hth hit in the target set [default: 100] --MAX_HITS <integer> number of hits returned from walk [default: 10000] --MAX_DALICON <integer> max number of comparisons performed during walk [default: 10000] * Pairwise alignment: bin/dali.pl ( --cd1 <xxxxX> | --query <query.list> ) ( --cd2 <yyyyY> | --db <target.list> ) [BASIC-OPTIONS] bin/dali.pl --pdbfile1 first.pdb [ --pdbid1 mol1 ] --pdbfile2 second.pdb [ --pdbid2 mol2 ] [BASIC-OPTIONS] * All-against-all comparison: bin/dali.pl --matrix --query <query.list> [BASIC-OPTIONS] * Database searches: bin/dali.pl --hierarchical --repset <pdb25.list> ( --cd1 <xxxxX> | --query <query.list> ) --db <pdb.list> [BASIC-OPTIONS] [MPI-OPTIONS] [BLAST-OPTIONS] bin/dali.pl --walk --repset <pdb25.list> ( --cd1 <xxxxX> | --query <query.list> ) --db <pdb.list> [BASIC-OPTIONS] [MPI-OPTIONS] [BLAST-OPTIONS] [WALK-OPTIONS]
# import two PDB entries. They will be split into chains. /home/you/DaliLite.v5/bin/import.pl --pdbfile ./toy_PDB/pdb1ppt.ent.gz --pdbid 1ppt --dat ./DAT > /dev/null /home/you/DaliLite.v5/bin/import.pl --pdbfile ./toy_PDB/pdb1bba.ent.gz --pdbid 1bba --dat ./DAT > /dev/null # pairwise alignment of two structures /home/you/DaliLite.v5/bin/dali.pl --cd1 1pptA --cd2 1bbaA --dat1 ./DAT --dat2 ./DAT --title "output options" --outfmt "summary,alignments,equivalences,transrot" --clean 2> errThe output is 1pptA.txt, which looks like this:
# Job: output options # Query: 1pptA # No: Chain Z rmsd lali nres %id PDB Description 1: 1bba-A 3.6 1.8 33 36 39 MOLECULE: BOVINE PANCREATIC POLYPEPTIDE; # Pairwise alignments No 1: Query=1pptA Sbjct=1bbaA Z-score=3.6 DSSP LLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHLLlll Query GPSQPTYPGDDAPVEDLIRFYDNLQQYLNVVTRhry 36 ident | | |||| | | | | | || Sbjct APLEPEYPGDNATPEQMAQYAAELRRYINMLTRpry 36 DSSP LLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHLLlll # Structural equivalences 1: 1ppt-A 1bba-A 1 - 33 <=> 1 - 33 (GLY 1 - ARG 33 <=> ALA 1 - ARG 33 ) # Translation-rotation matrices -matrix "1ppt-A 1bba-A U(1,.) 0.631906 -0.761372 -0.144939 -0.890845" -matrix "1ppt-A 1bba-A U(2,.) 0.512616 0.550832 -0.658642 -10.882093" -matrix "1ppt-A 1bba-A U(3,.) 0.581308 0.341902 0.738366 4.946664"You can select which blocks to output using the --outfmt option. By default, only the summary block is output. The keyword for blocks 1-4 below are summary, alignments, equivalences, and transrot, respectively. The first line of the output file is a job description, which can be set using the --title option.
/home/you/DaliLite.v5/bin/dali.pl --cd1 1pptA --cd2 1bbaA --dat1 /home/you/DAT --dat2 /home/you/DAT --clean 2> errYou can use lists of structures as queries and targets. Each query will be compared to all targets. The following is equivalent to the example above:
echo 1pptA > query.list echo 1bbaA > target.list /home/you/DaliLite.v5/bin/dali.pl --query query.list --db target.list --dat1 /home/you/DAT --dat2 /home/you/DAT --clean 2> errTo systematically compare a set of queries to a non-redundant subset of PDB (prepared beforehand), you can do:
/home/you/DaliLite.v5/bin/dali.pl --query query.list --db pdb25.list --dat1 /home/you/DAT --dat2 /data/DAT --clean 2> err
# import PDB structures /home/you/DaliLite.v5/bin/import.pl --pdbfile /home/you/DaliLite.v5/toy_PDB/pdb1ppt.ent.gz --pdbid 1ppt /home/you/DaliLite.v5/bin/import.pl --pdbfile /home/you/DaliLite.v5/toy_PDB/pdb1bba.ent.gz --pdbid 1bba # structural alignment, output translation-rotation matrices to 1pptA.txt /home/you/DaliLite.v5/bin/dali.pl --cd1 1pptA --cd2 1bbaA --dat1 /home/you/DaliLite.v5/DAT --dat2 /home/you/DaliLite.v5/DAT --outfmt "summary,transrot" --clean # transform the coordinates of the original target PDB file /home/you/DaliLite.v5/bin/applymatrix.pl /home/you/DaliLite.v5/toy_PDB/pdb1bba.ent.gz < 1pptA.txt > sup.pdb # we know that 1pptA:1-33 and 1bbaA:1-33 are structurally equivalent segments # peek at the transformed coordinates of 1bba grep ^ATOM sup.pdb | grep ' CA ' | head # compare to 1ppt zcat /home/you/DaliLite.v5/toy_PDB/pdb1ppt.ent.gz | grep ^ATOM | grep ' CA ' | head
101mA MYOGLOBIN 1a00A HEMOGLOBIN (ALPHA CHAIN) 1a87A COLICIN N 1allA ALLOPHYCOCYANIN 1binA LEGHEMOGLOBIN AThe above structures have been imported to the distribution package. Execute the commands
/home/you/DaliLite.v5/bin/dali.pl --matrix --query query.list --dat1 /home/you/DaliLite.v5/DAT --clean 2> /dev/nullIn addition to five xxxxX.txt files, this similarity matrix (file 'ordered') is generated:
5 1a87A 48.7 7.4 3.1 6.4 5.6 1allA 7.4 29.7 8.1 9.0 8.7 1binA 3.1 8.1 30.8 15.2 13.6 101mA 6.4 9.0 15.2 31.7 20.6 1a00A 5.6 8.7 13.6 20.6 30.5A structural dendrogram is generated from the similarity matrix by average linkage clustering. Branch lengths are converted to ad hoc distances, where the distance is the difference of similarities. Structural dendrograms are output in Newick format (files 'newick' and 'newick_unrooted'):
((((1a00A_HEMOGLOBIN_ALPHA_CHAIN:9.9,101mA_MYOGLOBIN:11.1):6.2,1binA_LEGHEMOGLOBIN_A:16.4):5.8,1allA_ALLOPHYCOCYANIN:21.1):2.975,1a87A_COLICIN_N:43.075);Newick format is accepted by many phylogenetic tree drawing programs. For example, you can paste the Newick string to iTOL or phylo.io.
# mirror PDB /home/you/DaliLite.v5/bin/import.pl --rsync --pdbmirrordir /data/pdb --dat /data/DAT --clean # extract PDB sequences ls /data/DAT/ | perl -pe 's/\.dat//' > pdb.list /home/you/DaliLite.v5/bin/dat2fasta.pl /data/DAT < pdb.list | awk -v RS=">" -v FS="\n" -v ORS="" ' { if ($2) print ">"$0 } ' > pdb.fasta # awk removes empty sequences # create PDB-Blast database makeblastdb -in pdb.fasta -out /home/you/pdb.blast -dbtype prot # create PDB70 non-redundant subset of PDB cd-hit -i pdb.fasta -c 0.7 -o pdb70.fasta grep '^>' pdb70.fasta | perl -pe 's/^>//' > pdb70.listPDB70 is conveniently generated using cd-hit. PDB25 can be used instead of PDB70 without loss in performance. However, PDB25 requires all-against-all sequence comparison by Blast.
Remember to import your private PDB structure, if not done yet:
/home/you/DaliLite.v5/bin/import.pl --pdbfile mymodel.pdb --pdbid mine --dat /data/private/DAT --clean
/home/you/DaliLite.v5/bin/dali.pl --hierarchical --repset pdb70.list --cd1 mineA --db pdb.list --dat1 /data/private/DAT --dat2 /data/pdb --np 40 --clean--np npara is the number of parallel processes. The default is npara=1 which will run the serial version of the software and does not require openmpi. Output is generated in nxxxA.txt, where nxxxA is the query identifier. Targets with a Z-score above 2 are reported. There is a size dependence in the Z-score. The low threshold is needed to catch fold level similarities of small domains, but for larger structures there can be thousands of hits in the output.
/home/you/DaliLite.v5/bin/dali.pl --walk --repset pdb70.list --cd1 mineA --db pdb.list --dat1 /data/private/DAT --dat2 /data/pdb --np 40 --H 100 --targetset pdb70.list --cleanOutput is generated in nxxxA.txt, where nxxxA is the query identifier. The knowledge-based search dynamically adjusts the Z-score threshold for output. It aims for complete coverage of hits that have a Z-score higher than the Hth (--H 100) hit belonging to targetset (--targetset pdb70.list). The purpose is to limit the amount of output, yet reach down to interesting fold level similarities. If the query structure contains multiple domains, you are advised to search each domain separately, otherwise hits may be concentrated to one domain leaving others not covered.
Directory is locked by dali.lock there may be another DALI process running in this work directory or, the previous run crashed: remove the dali.lock file