We present a fast method GOParGenPy for generating GO data matrices. The link to the BMC publication is here. The link for pre-print version of manuscript is here. The package provides methods that:
  1. Parses the input annotation file
  2. Reads standard gene_ontology_edit.obo file, parses it and stores all the GO classes and their attributes.
  3. Links these GO annotations to their parent node. The linking also looks for alternative ids for those terms which have become obsolete.
  4. Prints out the list of genes with added parent nodes.
  5. Prints out the sparse or full matrix with genes as rows and GO classes as columns. Usage of sparse matrix is encouraged.
  6. Methods for processing sparse matrix in R.

GOParGenPy allows you to:

  1. Generate a GO binary matrix with different data sources in fraction of time of other methods
  2. Select up-to-date GO annotations for the analyzed genes.
  3. Select up-to-date GO structure.
  4. Alternatively one can use older version of GO annotations and/or GO structure. This could be needed when:
    1. Replicating earlier work.
    2. Doing performance comparison with analysis tool that uses specific older GO structure.
  5. Generate a GO binary matrix that can be used to compare various analysis tools in different analysis platforms.
  6. Allows analysis with in-house generated GO annotations.
GOParGenPy does not do the following things:
  1. Annotate the sequences. User has to find/select the optimal annotation source for the analyzed data.
    1. Best source for GO annotations is usually latest annotation from
  2. Select the GO structure. User has to find/select the used GO structure.
    1. Again best source is usually latest structure at
  3. GOParGenPy does not perform any statistical analysis or visualizations etc. It only generates the matrix. This however can be further used for various tasks in different analysis platforms, like R or MATLAB.