SANSPANZ framework: For developers

SANSPANZ is a framework for homology based protein function prediction and evaluation of GO predictions. It is written in Python and has a modular design that is easy to extend.

SANSPANZ provides access methods to two core servers: SANSparallel, for sequence similarity search, and the DictServer, a key/value store of metadata related to protein annotations (GO annotations, GO class frequencies, description line word frequencies, species taxonomy, etc). SANSPANZ client scripts access the servers via TCP (local servers) or HTTP protocols (remote servers). The remote servers are maintained by the authors (Holm group).

SANSPANZ accepts query sequences in FASTA format or tabular format similar to BLAST output format 6. The main loop breaks the input data into batches and, in the case of raw protein sequences, calls the SANSparallel server to get the sequence neighbourhood for each query. For all database hits, the DictServer is called to retrieve annotation metadata.

Data processing is based around a two-dimensional data matrix containing individuals (rows) and attributes (columns). Columns and rows can be appended to SpreadSheets by classes that inherit from the Operator class. Operators are defined as either row-wise (per database hit) or block-wise (per input query) depending on the nature of the statistic being calculated. Within SANSPANZ, all data processing is performed by Operators which can even be chained together to create composite Operators.

The idea behind the SANSPANZ framework is that new functionality can be developed by simply writing a new operator class, which is called from the standard workflow (Figure 1).

Table of Contents

Overview of the class library

There is help text in the Python code. The main classes are listed below.

ParametersDefines the workspace with spreadsheets, dictionaries, and parameters
RunnerThe main loop handling the input data stream
DictServerHandles communications with the core servers
SpreadSheetA data object
operatorsSubclassed from myoperator. Modifies the contents of spreadsheets. The only part a developer needs to touch.

The operators can call utility functions including the following modules:
XMLParserSpecific function to parse SANSparallel results
HYGEFunctions for calculating hypergeometric statistics
GSZFunctions for calculating the Gene Set Z-score statistic
Read_and_PrintFunctions for loading dictionaries from files
PannzerFunctionsSpecific functions used by the Pannzer predictor.
GO_dist_functionsA collection of GO distance related functions.


Figure 1. The main loop of SANSPANZ.

Generic script

This script will apply an operator to the input data stream. All parameters are stored in the dictionary glob.param. Default values can be overwritten by command line options or a configuration file. The glob object handle is passed to all operators. The appropriate data structures are created when Runner instantiates the operator. LazyRunner has two obligatory arguments: the input file ("--" for STDIN) and a list of output files ("--" for STDOUT). The default input format is FASTA.
import Parameters, Runner
if __name__ == '__main__' :
   glob=Parameters.WorkSpace(configFile=None)
   z=Runner.Runner(glob,operator_name=glob.param['input_OPERATOR'],CHUNK=glob.param['input_CHUNK'], liveData=glob.param['input_LIVEDATA'])
   z.lazyRunner(glob.param['input_FILE'], glob.param['input_OUTFILES'].split(","), queryspecies=glob.param['input_QUERYSPECIES'], colnames=glob.param['input_COLNAMES'], block_column_name=glob.param['input_BLOCK_
COLUMN_NAME'], input_format=glob.param['input_FORMAT'])
The customized script below uses hardcoded parameters to execute the Pannzer method of function prediction starting from a set of sequences in FASTA format. On line 4, we make sure that the SANSparallel server and DictServer will be accessed remotely. The query species is defined using the -s option on the command line.
import Parameters, Runner
if __name__ == '__main__' :
   glob=Parameters.WorkSpace()
   glob.param['CONN_REMOTE']=True
   z=Runner.Runner(glob,operator_name='Pannzer')
   z.lazyRunner("--",[None, "DE.out", "GO.out", "anno.out"],queryspecies=glob.param['input_QUERYSPECIES'])
The lazyRunner method has a number of qualifiers that can be used to pass on information about the structure of the input data. As said above, the defult input format is FASTA formatted sequences (input_format="FASTA"). The other alternative of input format is tabular data (input_format="tab").

Raw sequences are automatically sent to the SANSparallel server to get the sequence neighbourhood for each query and converted to a tabular format.

Tabular data is assumed to have column names on the first row (comment lines starting with '#' are ignored). The column separator is a tab. If the input data does not have a header row, you can define column names using the colnames qualifier. It is a string of concatenated column names, separated by spaces. For example, GO predictions could have three columns defined by colnames="qpid goid score". Column order is free, as the operators identify data columns by the header string.

Depending on the nature of the operator, the input data stream can be grouped by the variable defined by block_column_name. The default is 'qpid' (query protein identifier) when the data is a list of sequence neighbours of the query protein.

Finally, lazyRunner can be given the queryspecies qualifier if the query species is not defined in the input data. The species information is used by Pannzer.

Worked example

This command applies the taxonassignment operator to user inputs, which are FASTA-formatted protein sequences.
python /data/liisa/PANNZER2/runsanspanz.py -m taxonassignment -i example_input.fasta -o ',--' 2> /dev/null
This is the input:
>scaffold100_0M.1052_C SIZE=0 RANGE=448350-449212 (+) RNA-directed DNA polymerase
YFKRIIHHNQVGFIPGLQGWFNILKSINVIQYINKRKNKNHMILSIDAEKAFDNVQHPFLIKTLQSVGIEGTYLNIIKAIYEKPTANIILNGEKLRAFPLRSGTWQGCPLSPLLFNIVLEVLVTAIRQQKIKGIIGKEEVKLSLFADDMILYVENPKDSTPKLELIQEFSKVAGYKFNAQKSIAFIYTNNKTEEREIKESIPFTIAPKTIR
YLGINLTKEAKNLYCENYKILMKAIEEDTKKWKNVPCLWIGRTNIVRMSMLPRAIYTFNEILSKYHQHFSKKWN
>scaffold100_0M.1321_A SIZE=0 RANGE=497083-497991 (+) Endonuclease (Fragment)
KNIHIDQWHRIESPDMDPQLYGQLIFSKAGKNIQWKKDSLFNKWCWENWTSIYRRMKLNHSLTPYTKINSKWMKDLNVRQEAIKVLEENIDSNLLDIGHSNFFQDMSPKARETKVKMNFWDFIKIKSFCTAKETVNKTKRQPMEWEKIFANDTTDKGLVSKIYKELLKLNTQKTNNQVKKWAEDMNRHFSEEDIQMANRHMKKCSSSLAIR
EIQIKTTLRYHLTPVRMAETDKARNNKCWRGCGERGTLLHCWWECKLVPLWKTVWSFLKKLKIELLYDPAIALLGIYPKDSDVVKRRAICT
This is the output (the numbers may change with database updates):
qpid    taxon   count   total   frequency       rank
scaffold100_0M.1052_C   Eukaryota; Metazoa      186     195     0.953846153846  1
scaffold100_0M.1321_A   Eukaryota; Metazoa      208     220     0.945454545455  1

Implementation

The taxonassigment operator (operators/taxonassigment.py) is implemented as a composite operator which calls two simpler operators. Each operator works on one or more spreadsheets. The rows of the spreadsheet represent data associated with a primary key. If you compute a summary over a group of rows, you can output the summary data to a new spreadsheet which uses the group as the new primary key. Below, we will dissect the simpler operator classes first. The first operator illustrates simple data manipulation within a spreadsheet. The second operator illustrates how you can import data from online dictionaries. Finally, the third operator computes a summary over groups of rows in the first spreadsheet and outputs the summary to a second spreadsheet.

Taxon: a simple operator

Operators create new columns in spreadsheets. This is a simple operator (operators/taxon.py) which does some string processing and demonstrates the essential elements of an operator class.
 1:from myoperator import RowOperator
 2:import sys
 3:
 4:# Clone a rowwise operator. Class name and file name (operators/classname.py) must be identical.
 5:class taxon(RowOperator) :
 6:    """Extract taxon from lineage to taxon_depth."""
 7:    def __init__(self,glob):
 8:        # define nicknames for column indices
 9:        [self.lineage_col, self.taxon_col]=glob.use_sheet("data").use_columns(["lineage","taxon"])
10:        self.taxon_depth=int(glob.param['TAXON_DEPTH'])
11:
12:    def process(self, row):
13:     x=row[self.lineage_col]
15:     tmp=x.split(";")
16:     if len(tmp) > self.taxon_depth:
17:             row[self.taxon_col]=";".join(tmp[0:self.taxon_depth])
18:     else:
19:             row[self.taxon_col]=";".join(tmp)

Lineage operator: using dictionaries

operators/lineage.py is another simple operator with much the same anatomy as the taxon operator except that it uses a dictionary.
 1: import sys
 2: from myoperator import RowOperator
 3: 
 4: # Clone a rowwise operator. Class name and file name (operators/classname.py) must be identical.
 5: class lineage(RowOperator) :
 6:    """Do a dictionary lookup for the taxonomic lineage of a species."""
 7:     def __init__(self,glob):
 8:         self.glob=glob
 9:         # define nicknames for column indices
10:         [self.species_col, self.lineage_col]=self.glob.use_sheet("data").use_columns(["species","lineage"])
11:         # use online dictionary. Object handles in glob are hardcoded
12:         self.glob.use_online_dictionaries(["LINEAGE"])
13: 
14:     def process(self,row):
15:         species=row[self.species_col]
16:         row[self.lineage_col]=self.get_lineage(species)
17: 
18:     def get_lineage(self,species):
19:         ucspecies=species.upper() # lineage keys are UPPERCASE species
20:         if ucspecies in self.glob.lineage: return(self.glob.lineage[ucspecies])
21:         if species != "unknown": sys.stderr.write("# Warning: unknown species ||%s||\n" %species)
22:         return("unclassified")
  • Line 12 subscribes to the LINEAGE dictionary.
  • Line 20 does a lookup in the LINEAGE dictionary.

    Dictionaries can be online or local. We use online dictionaries to save memory and loading time. The DictServer.lookup_key_values() method queries an online dictionary with a (table, key) tuple and it returns a (table, key, value) tuple.

    Runner.lazyRunner caches data for all dictionaries below, because the same keys are likely to recur frequently in lists of homologous proteins. For LINEAGE, keys are taken from column 'species'. For GODICT, keys are taken from column 'spid' or 'qpid'. For DESCCOUNT and WORDCOUNT, keys are taken from column 'desc'. GOIDELIC downloads a copy of the entire GO structure (about 44,000 GO terms). Object handles to the cached dictionaries are defined in glob and fixed.

    The available online dictionaries are the following. All keys must be uppercase.

    None
    Subscribe toTable nameKeyValueObject handle to cached data
    LINEAGELINEAGEspecies scientific nameTaxonomic lineage of speciesglob.lineage
    TAXIDTAXIDspecies scientific nameNCBI numeric identifier of taxonglob.taxid
    WORDCOUNTWORDCOUNTwordNumber of descriptions in uniprot that contain wordglob.wordcounts
    WORDCOUNTNWORDTOTALNoneTotal number of unique words (cleaned) in uniprot DE-linesglob.nwordtotal (constant)
    DESCCOUNTDESCCOUNTuniprot descriptionNumber of identical descriptions in uniprotglob.desccounts
    DESCCOUNTNPROTNoneTotal number of protein entries in uniprotglob.nprot (constant)
    GODICTGODICTuniprot accessionGO identifiers assigned to uniprot entry by GOAglob.GOdict
    GOIDELICGOCOUNTSGO identifierTotal number of uniprot entries assigned to GO term (propagated to parents)glob.GOcounts
    GOIDELICGOPARENTSGO identifierGO identifiers of parental lineglob.GOparents
    GOIDELICGODESCGO identifierShort description of GO termglob.godesc
    GOIDELICONTOLOGYGO identifierBP, MF or CCglob.ontology
    GOIDELICICGO identifierInformation content of GO termglob.IC
    GOIDELICECGO identifierInverse of ec2goglob.EC
    GOIDELICKEGGGO identifierInverse of kegg2goglob.KEGG

    You can naturally also load your own dictionary from a file using, for example, Read_and_Print.read_dict_data(). The data should be tab-separated (key,value) pairs. Attach your dictionary to the workspace object (here: glob) if it needs to be visible to other operators.

    Taxonassigment: a composite operator using multiple spreadsheets

    This operator calls the above simple operators to add lineage and taxon columns to the inputs. It then proceeds to compute a summary for each query sequence. In this case, it outputs the support of the most frequent taxon.
    1:from myoperator import BlockOperator
    2:import operator
    3:
    4:# Clone a blockwise operator. Class name and file name (operators/classname.py) must be identical.
    5:class taxonassignment(BlockOperator) :
    6:    """Outputs most frequent taxon per block."""
    7:    def __init__(self,glob, maxrank=1):
    8:      # parameters used by self.process()
    9:      self.glob=glob
    10:     self.maxrank=maxrank
    11:     # define private object handle to spreadsheets (glob maps handles by name)
    12:     [self.data,self.summary_data]=glob.use_sheets(["data","summary"])
    13:     # define nicknames for column indices
    14:     [self.qpid_col1,self.isquery_col,self.taxon_col1]=self.data.use_columns(["qpid","isquery","taxon"])
    15:     self.summary_data.use_columns(["qpid","taxon","count","total","frequency","rank"])
    16:     # this is a composite operator
    17:     [self.a,self.b]=glob.use_operators(['lineage','taxon'])
    18:
    19:    def process(self, block):
    20:     # call preprocessing operators
    21:     for row in block:
    22:                self.a.process(row) # add lineage
    23:                self.b.process(row) # add taxon
    24:     # get query
    25:     qpid='unknown'
    26:     for row in block:
    27:                if row[self.isquery_col]=="1": qpid=row[self.qpid_col1] # string comparison
    28:     # compute sum statistics
    29:     n=0
    30:     cnt={}
    31:     for row in block:
    32:             n+=1
    33:             taxon=row[self.taxon_col1]
    34:             if not taxon in cnt: cnt[taxon]=0
    35:             cnt[taxon]+=1
    36:     # reverse sort by counts
    37:     sorted_cnt = sorted(cnt.items(), key=operator.itemgetter(1), reverse=True)
    38:     # save top ranks in summary table
    39:     rank=0
    40:     for (taxon,cnt) in sorted_cnt:
    41:             rank+=1
    42:             freq=0
    43:             if n>0: freq=cnt/float(n)
    44:             # assume fixed column order
    45:             datarow=[qpid,taxon,str(cnt),str(n),str(freq),str(rank)]
    46:             self.summary_data.append_row(datarow)
    47:             if rank>=self.maxrank: break
    
    

    Implementing and evaluating novel scoring functions for function prediction

    Scoring function

    The Pannzer method involves homology search, filtering and clustering of the sequence neighborhood, and scoring GO classes found in the sequence neighborhood. The scoring is done by an operator called RM3 and different variants of scoring functions are implemented as functions within the RM3 class. This way they can share basic statistical variables that are computed by the RM3.process() function. The function below implements a simple baseline function called 'SANS':
            def SANS_score(self,goid):
                    "Frequency of GO classes in sequence neighborhood. Returns (raw score, PPV)."
                    obs=float(self.goclasscount[goid])/self.sampleSize
                    ppv=obs
                    return(obs,ppv)
    
    Scoring functions can be specified by name from the command line (--PANZ_PREDICTOR, --eval_SCOREFUNCTIONS), and the RM3.process method will then execute the appropriate xxx_score function.

    Evaluation

    Let's assume we have test set of protein sequences called testset.fasta and a reference of truth in GAF format (qpid, goid) called testset_truth.gaf. The reference of truth is preprocessed by propagating the direct GO term assignments to all parents in the GO hierarchy:
    python runsanspanz.py -m gaf2propagated -i testset_truth.gaf -f tab -c "qpid goid" -o ",testset_truth_propagated"
    
    Below, we save the result from homology search for reuse later with different prediction parameters. Scoring functions for prediction and evaluation are selected with the --PANZ_PREDICTOR and --eval_SCOREFUNCTIONS arguments.
    python runsanspanz.py -m SANS -f FASTA -i testset.fasta -o testset.fasta.sans
    python runsanspanz.py -m Pannzer -f tab -i testset.fasta.sans --PANZ_PREDICTOR "RM3,ARGOT,JAC,HYGE,SANS" -o ',,testset.fasta.sans.GO,'
    python runsanspanz.py -m GOevaluation -f tab -i testset.fasta.sans.GO --eval_TRUTH testset_truth_propagated --eval_SCOREFUNCTIONS "RM3_PPV ARGOT_PPV JAC_PPV HYGE_PPV SANS_PPV" -o ',,testset.fasta.sans.GO.eval'
    

    Summary: how to write your own SANSPANZ application

    The main application script

    The main application script is very short. It must instantiate a Parameters.WorkSpace object (glob) which is passed to the Runner object. Runner.lazyRunner must be informed about the format and source of input data and about the destinations of output data. Below is a minimal example.
    import Parameters, Runner
    if __name__ == '__main__' :
        glob=Parameters.WorkSpace()
        z=Runner.Runner(glob, operator_name='myfavourite')
        z.lazyRunner("--", ["--"],input_format="tab",colnames="col1 col2")
    

    Operator classes

    You can implement a new functionality by writing a new operator in the SANSPANZ/operators directory. The operator class is subclassed from BlockOperator or RowOperator, for example:
    from myoperator import BlockOperator
    class myfavourite(BlockOperator):
    
    or
    from myoperator import RowOperator
    class myfavourite(RowOperator):
    
    (The type TextOperator was created as an afterthought, it bypasses the main loop and passes each line of the input stream directly to the operator's process() function. operators/obo.py is a TextOperator that parses the GO hierarchy out of an OBO file.)

    The __init__ method is called with a Parameters.WorkSpace() object handle, for example:

    	def __init__(self,glob):
    
    The __init__ method should define:
    1. which spreadsheets are used, for example
      	 [self.data,self.summary_data]=glob.use_sheets(["data","summary"])
      
      or
      	self.data=glob.use_sheet("data")
      
    2. which columns are going to be processed, for example
      	[self.qpid_col1,self.isquery_col,self.taxon_col1]=self.data.use_columns(["qpid","isquery","taxon"])
      	[self.qpid_col2,self.taxon_col2,self.count_col,self.total_col,self.freq_col,self.rank_col]=self.summary_data.use_columns(["qpid","taxon","count","total","frequency","rank"])
      
      or
      	[self.lineage_col, self.taxon_col]=glob.use_sheet("data").use_columns(["lineage","taxon"])
      
    3. which dictionaries are going to be used, for example
      	glob.use_online_dictionaries(["LINEAGE","WORDCOUNT","DESCCOUNT","GOIDELIC","GODICT"])
      
      or
      	glob.load_dictionaryname()
      
      where dictionaryname is one of lineage, wordcounts, desccounts, GOdict, GOdict_noIEA, goidelic, nprot or nwordtotal. The respective data file names are defined by parameters in glob.param. Small private dictionaries can be loaded from a file using the function Read_and_Print.read_dict_data()
    4. which other operators are going to be used, for example
      	[self.a,self.b]=glob.use_operators(['lineage','taxon'])
      
    Every operator class must have a method called process, for example:
    	def process(self, block):
    
    or
    	def process(self, row):
    

    The process() method writes new values to the cells of the spreadsheet objects, and this is the part that is entirely up to you. The Runner object (see main application script) takes automatically care of all input and output.

    The operator class can also have a finalise() method which is called after the main loop has finished. This can for example print out the total number of queries that have been processed.

    Novel function prediction scores