SANS: Suffix Array Neighborhood Search

SANS is a program for searching protein sequence databases. It is as sensitive as BLAST when sequence identity is 50-100 %. SANS is faster than BLAST in batch mode. For example, a bacterial genome can be compared to Uniprot in one hour compared to 100 hours by BLAST.

Download SANS from here

Requirements:

  1. Linux operationg system
  2. gfortran (Fortran-90 compiler)
  3. Perl
  4. RAM: at least 1 byte per amino acid (current Uniprot is ~ 8 Gbytes)

Installation:

  1. tar -zxvf sans.tar.gz
  2. cd sans
  3. edit MEMORY parameter in globals.f to match your computer's RAM
  4. edit HOME and ESCAPED_HOMEDIR in Makefile
  5. make install

Running SANS:

The inputs are FASTA formatted protein sequences. Multiple or gzipped database.fasta files are accepted. The maximum size of the query sequences is 2 G amino acids. The default number of top hits returned is 250. The steps are 1) indexing of database, 2) indexing of query sequences, 3) searching.
  1. perl sansformatdb.pl db-name database.fasta
  2. perl sansindex.pl query-name query.fasta db-name
  3. perl sans.pl query-name db-name [number-of-top-hits-to-output]
For example, to compare Dickeya solanii against Uniprot and report top-1000 hits:
  1. perl sansformatdb.pl uniprot /shared/databases/uniprot/uniprot-sprot.fasta.gz /shared/databases/uniprot/uniprot-trembl.fasta.gz
  2. perl sansindex.pl solanii /data/liisa/dickeya_solanii.fasta uniprot
  3. perl sans.pl solanii uniprot 1000 > solanii.test

Step 1 creates a number of indices for the uniprot database in files with the prefix db-name (uniprot.pin, uniprot.psq, uniprot.phr, uniprot.SA, uniprot.ISA). Step 2 creates a number of indices for the query sequences in files with the prefix query-name (solanii.pin, solanii.psq, solanii.phr, solanii_uniprot.ISA_mapped). Step 3 uses the precomputed indices for the database and query sequences, with prefixes db-name and query-name, and computes the similarity scores between the query and database proteins.

The columns of the output are:

  1. query protein identifier
  2. sbjct protein identifier
  3. query numerical identifier
  4. sbjct numerical identifier
  5. SANS score
  6. bitscore (BLOSUM62)
  7. number of aligned positions
  8. sequence identity
  9. query coverage
  10. sbjct coverage
  11. sbjct description line
Advanced options are shown by the --help option of individual programs.

Reference:

J.P. Koskinen & L. Holm (2012) SANS: high-throughput retrieval of protein sequences allowing 50% mismatches. Bioinformatics 28: i438-i443