Preprocessed AlphaFold Database for use with DaliLite running locally

The AlphaFold Database is a set of one million structural models in version 2.. We have preprocessed the AlphaFold Database so that you can perform structure comparisons against it using a local installation of DaliLite. The Dali web server performs equivalent searches but it has long queueing times. Sequence searches against the AlphaFold Database can be performed, for example, with the SANSparallel server.

Download the data here:

The tar file contains two subdirectories: Digest/AFDB2.list contains all structures of the AlphaFold Database. Subsets at 70% sequence identity (AFDB2_70.list etc.) were generated using CD-HIT. The mapping between DaliLite's internal structure identifiers and original AlphaFold Database file names can be retrieved from the lists in Digest/*.list.

Why you need to use digested data

A few models exceed the dimensions that DaliLite can handle. The critical parameter is the number of secondary structure elements, which must not exceed 200 - the program crashes otherwise. The limit cannot be changed for historical reasons (including the use of Fortran). The models listed in Table 2 were cut to two chains, labelled A and B. Cut points were selected visually in low confidence segments between globular domains.

Installing

Create a directory for installation:
mkdir alphafold
cd alphafold
Download the tarball:
wget http://ekhidna2.biocenter.helsinki.fi/dali/AF-Digest.tar.gz
tar -zxvf AF-Digest.tar.gz 
You should find populated subdirectories DAT/ and Digest/ under your current working directory.

The internal identifiers for AF-DB fill the name space from a000 to xzzz. You can import structures from the Protein Data Bank to your local DaliLite database, because PDB identifiers start with a number and don't clash with the internal AF-DB identifiers. If you have lots of locally generated structures, you can store them in another data directory, like DAT_special_1/, DAT_special_2, etc.

Create a Blast database for hierarchical search:

makeblastdb -in Digest/AFDB1.fasta -dbtype prot
makeblastdb -in Digest/AFDB2.fasta -dbtype prot 

Searching

We recommend to run DaliLite in hierarchical search mode. This example compares your structure to human proteins in the AlphaFold Database (assuming you have installed DaliLite.v5 in your home directory and you are in the alphafold/ directory where you installed the data as above):

~/DaliLite.v5/bin/dali.pl --hierarchical --oneway --BLAST_DB Digest/AFDB1.fasta \
--pdbfile mystructure.pdb --db Digest/HUMAN.list --repset Digest/HUMAN_70.list \
--dat1 ./ --dat2 ./DAT/ --title "my search" --np 40
Bold parameters refer to the digest of the AlphaFold Database. The hierarchical search is rather slow. The 70% identity subsets are significantly smaller than the full set mainly in plants (Table 1).

DaliLite imports structures giving them a four-letter identifier. Chains shorter than 30 amino acids are excluded. DaliLite results list both the four-letter identifier and the original file name, which is based on the Uniprot accession number. Note that DaliLite detects structural similarities between compact, globular domains. Searches with non-compact and non-globular AlphaFold models yield no hits with significant structural similarity.

Custom made subsets of AlphaFold Database

The search database for DaliLite is defined a list of internal Dali identifiers (four-letter entry identifier + one-letter chain identifier).

AlphaFold Database v.2 contains one million model structures of model species and Swissprot (Table 1). You can map the amino acid sequences of interest to the nearest match in AlphaFold Database v.2 by running BLAST against Digest/AFDB2.fasta. The resulting list of identifiers refers to the ./DAT/ directory which you have already populated with structures from the Digest.

AlphaFold Database v.4 contains 200 million model structures of almost all proteins in the Uniprot database. Here you find instructions for (1) obtaining the Uniprot accession numbers of a given species from Uniprot, (2) downloading the model structures from EBI, (3) importing the model structures to locally installed DaliLite. Create your subsets in a special project directory to avoid clashes with the Digest's name space.

Having created you target database list (mySubset.list), you can run structural commparisons between structures of interest and the target database (take care to point to the correct --dat1 and --dat2 directories):


~/DaliLite.v5/bin/dali.pl --cd1 a000A  --db mySubset.list \
--dat1 ./DAT/ --dat2 ./DAT/ --title "a000A against mySubset" --np 40

Table 1: Subset lists

ShortScientific nameCommon NameFullSetSubset70
AFAlphaFold Database version 1All models364717241174
ARATHArabidopsis thalianaArabidopsis 27400 22895
CAEELCaenorhabditis elegansNematode worm 19645 18233
CANALCandida albicansC. albicans 5974 5829
DANREDanio rerioZebrafish 24640 20023
DICDIDictyostelium discoideumDictyostelium 12620 11484
DROMEDrosophila melanogasterFruit fly 13432 13074
ECOLIEscherichia coliE. coli 4301 4174
HUMANHomo sapiensHuman 23332 18899
LEIINLeishmania infantumL. infantum 7924 7708
MAIZEZea maysMaize 39220 27990
METJAMethanocaldococcus jannaschiiM. jannaschii 1773 1740
MOUSEMus musculusMouse 21558 18146
MYCTUMycobacterium tuberculosisM. tuberculosis 3979 3896
ORYSJOryza sativaAsian rice 43581 38243
PLAF7Plasmodium falciparumP. falciparum 5186 5016
RATRattus norvegicusRat 21254 18017
SCHPOSchizosaccharomyces pombeFission yeast 5124 4961
SOYBNGlycine maxSoybean 55693 31054
STAA8Staphylococcus aureusS. aureus 2882 2812
TRYCCTrypanosoma cruziT. cruzi 19053 9255
YEASTSaccharomyces cerevisiaeBudding yeast 6019 5615
AFDB2AlphaFold Database version 2992000701022
swissprotswissprot571708201968
AJECGAjellomyces capsulatus91729142
BRUMABrugia malayi87197007
CAMJECampylobacter jejuni 15801572
9EURO1Cladophialophora carrionii1111311103
DRAMEDracunculus medinensis1089510504
ENTFCEnterococcus faecium27982697
9EURO2Fonsecaea pedrosoi1247312401
HAEINHaemophilus influenzae16651605
HELPYHelicobacter pylori 15691487
KLEPHKlebsiella pneumoniae57545573
9PEZI1Madurella mycetomatis 95049216
MYCLEMycobacterium leprae15721556
MYCULMycobacterium ulcerans 89557732
NEIG1Neisseria gonorrhoeae20601991
9NOCA1Nocardia brasiliensis82928189
ONCVOOnchocerca volvulus 1201511560
PARBAParacoccidioides lutzii 87678699
PSEAEPseudomonas aeruginosa54455186
SALTYSalmonella typhimurium46984384
SCHMASchistosoma mansoni138219302
SHIDSShigella dysenteriae39343445
SPOS1Sporothrix schenckii 86298606
STRR6Streptococcus pneumoniae 19901913
STRERStrongyloides stercoralis1278111934
TRITRTrichuris trichiura96779040
TRYB2Trypanosoma brucei84647983
WUCBAWuchereria bancrofti1269412394

Table 2: AlphaFold models split into two chains

idshortoriginal filechain B start
cwy4DANREAF-A0A0R4II06-F1-model_v11303
c8mqRATAF-F1M5Q4-F1-model_v1742
e020HUMANAF-P02751-F1-model_v1999
fcn8HUMANAF-O75369-F1-model_v11035
fh10TRYCCAF-Q4CU46-F1-model_v11226
fiazTRYCCAF-Q4CTN6-F1-model_v11195
finbTRYCCAF-Q4DVS3-F1-model_v11200
fjigTRYCCAF-Q4DFV2-F1-model_v11200
fjydTRYCCAF-Q4CRW2-F1-model_v11218
flusTRYCCAF-Q4CTC1-F1-model_v11120
flw0TRYCCAF-Q4DH14-F1-model_v11228
fnedTRYCCAF-Q4CSQ4-F1-model_v11151
fnldTRYCCAF-Q4CY82-F1-model_v11243
fop7TRYCCAF-Q4CST2-F1-model_v11208
fpakTRYCCAF-Q4D802-F1-model_v11240
fpjjTRYCCAF-Q4CZ74-F1-model_v11250
fpndTRYCCAF-Q4CTR3-F1-model_v11209
fqdkTRYCCAF-Q4CX92-F1-model_v11248
frwdTRYCCAF-Q4CUD3-F1-model_v11186
frx2TRYCCAF-Q4CX06-F1-model_v11250
fsm1TRYCCAF-Q4CXH5-F1-model_v11250
ftc2TRYCCAF-Q4CSF7-F1-model_v11132
fuhhTRYCCAF-Q4CSJ3-F1-model_v11231
fuslTRYCCAF-Q4CT27-F1-model_v11236
fuucTRYCCAF-Q4CTS2-F1-model_v11230
fu0uTRYCCAF-Q4CVN2-F1-model_v11210
fvnnTRYCCAF-Q4D1P3-F1-model_v11252
f91qMOUSEAF-Q8BTM8-F1-model_v11064
ggudMOUSEAF-Q8R4Y4-F1-model_v11126
ij5jPSEAEAF-Q9I2M3-F1-model_v21406
iqliBRUMAAF-A0A5S6P8V9-F1-model_v2964
is2zBRUMAAF-A0A5S6P8X1-F1-model_v21150
j0n4DRAMEAF-A0A158Q5A1-F1-model_v2917
o43kswissprotAF-P15921-F1-model_v21210
oftyswissprotAF-Q8X8V7-F1-model_v21137
v862TRITRAF-A0A077Z2J4-F1-model_v21008