GOParGenPy: High-Throughput method to generate Gene Ontology data matrices

©Holm Group, Institute of Biotechnology

University of Helsinki

 

         

Link to new version of GOParGenPy:  

Our new web page and new version of GOParGenPy is here. We will redirect users automatically to this web page in the future.

DESCRIPTION:  

We present a fast method GOParGenPy for generating GO data matrices. The link for pre-print version of manuscript is here. The package provides methods that:

·         Parses the input annotation file

·         Reads standard ‘gene_ontology_edit.obo’ file, parses it and stores all the GO classes and their attributes.

·         Links these GO annotations to their parent node. The linking also looks for alternative ids for those terms which have become obsolete.

·         Prints out the list of genes with added parent nodes.

·         Prints out the sparse or full matrix with genes as rows and GO classes as columns. Usage of sparse matrix is encouraged.

·         Methods for processing sparse matrix in R

 

GOParGenPy allows you to:

·         Generate a GO binary matrix with different data sources in fraction of time of other methods

·         Select up-to-date GO annotations for the analyzed genes

·         Select up-to-date GO structure

·         Alternatively one can use older version of GO annotations and/or GO structure. This could be needed when:

1.      Replicating earlier work

2.      Doing performance comparison with analysis tool that uses specific older GO structure.

·         Generate a GO binary matrix that can be used to compare various analysis tools in different analysis platforms.

·         Allows analysis with in-house generated GO annotations

 

GOParGenPy does not do the following things:

·         Annotate the sequences. User has to find/select the optimal annotation source for the analyzed data.

1.      Best source for GO annotations is usually latest annotation from www.geneontology.org

·         Select the GO structure. User has to find/select the used GO structure.

1.      Again best source is usually latest structure is www.geneontology.org

·         GOParGenPy does not perform any statistical analysis or visualizations etc. It only generates the matrix. This however can be further used for various tasks in different analysis platforms, like R or MATLAB.

 

 

          DOWNLOAD & INSTALLATION

·         The package and tutorial can be downloaded here.

·         User should install Python v2.5/2.6.

·         GOParGenPy can be run on UNIX or Windows operating systems.

 

Installation

·         Extract the downloaded package.

·         You should get the following files.

1.      GOParGenPy        

2.      GeneOntology.py

3.      gene_ontology_edit.obo.2013-03-01   [Standard OBO file of 01.03.2013]

4.      birch_unigenes_annotation.tab  [ other format annotation of type 1]

5.      gene_association.tair                    [species specific annotation format of type 2]

6.      gene_ass.goa_uniprot                   [Sample uniprot format of type 3 (first 80975 gene ids)]

7.      process_sparse.R                            [Sample code for processing sparse matrix in R]

8.      process_sparse.m                           [Sample code for processing sparse matrix in MATLAB]

        

·         Copy the GOParGenPy and GeneOntology.py files to your destined directory and issue the commands (see USAGE & TUTORIAL section).

 

 

            USAGE:

            Command(s):

                          For generating sparse/full matrix:

                     $ python   GOParGenPy   obo_file   -i  [input_value]  -h  [header_value]  -a [altid_value]   user_annotation_file_name    gene_col_num   GO_col_num    mat_flag

 

                  Parameter description:

 

                          $ python GOParGenPy    --help

 

·         obo_file    :  Standard Gene Ontology OBO file. The default version (gene_ontology_edit.obo.2013-03-01) available with this package can be used or can be downloaded from here. This will be the GO structure used by program to map GO classes to their parents.

·         input_value : Integer value 1 or 2 or 3. 

                        1:  For user defined tab separated input annotation file.

                        2:  For species specific annotation file. Example format can be found here.

                        3:  For UNIPROT-GOA file. Example format can be found here.

·         header_value : Boolean Type “T” or “F”.

                        T:  if the input file contains header.

                        F:  if the input file does not contain header.

     Example header(s):

           For input_value: 1 

ref         Unigene accession            Unigene sequence              Number of ESTs       …                      GO ids             

              For input_value: 2

!gaf-version: 2.0

!Project_name: WormBase

!....

                                                         For input_value : 3

                                                   !gaf-version: 2.0

·         altid_value: Boolean type “T” or “F”

                     T: if the user wants to look for alternate ids of obsolete GO terms.

                     F: if the user does not want to look for alternate ids of obsolete GO terms.

·        user_annotation_file_name:  input annotation file. The file should ONLY be tab separated.

     Example file formats:

           For input_value : 1 

ref         Unigene accession               Unigene sequence              Number of ESTs                         GO ids             

10466   S_C0000100109D05R2       AAAAGAAGTGGATAA                  34                                 GO:0019538,GO:0003674

4737     S_C0S00200071C03R1        CTCTCTCGGGCCCCAC                   1                                 GO:0031224,GO:0070085,GO:0035250

                                                                                                                                                                                                 

              For input_value : 2

!gaf-version: 2.0

!Project_name: WormBase

!….

WB      WBGene00000001    aap-1       GO:0005942      PMID:12520011|PMID:12654719     IEA     INTERPRO:IPR001720      C

WB      WBGene00000001    aap-1       GO:0035014      PMID:12520011|PMID:12654719     IEA     INTERPRO:IPR001720      F

….. 

                                                        

                                                      For input_value : 3

!gaf-version: 2.0

UniProtKB    A0A000  moeA5    GO:0003824    GO_REF:0000002  IEA     InterPro:IPR015421     F    MoeA5   A0A000_9ACTO|moeA5

UniProtKB    A0A001  moeD5    GO:0005524    GO_REF:0000002  IEA     InterPro:IPR003439     F    MoeD5   A0A001_9ACTO|moeD5

……

                                                     

·        gene_col_num:  the column number(s) of the tab separated annotation file that contains gene ids and other descriptors. E.g. 2,4,5,6 or just 2

         Multiple columns will be combined with ‘||’ into a single identifier.

         Example: For input file format of type 1

         If user wants only gene id column then value of gene_col_num should be 2

         Or

         If user wants other descriptors including gene id column, then enter comma separated values for gene_col_num such as 1, 3, 4

·        GO_col_num : the column number of tab separated annotation file that contains GO IDs.

         Example: For input file format of type 2, the value of GO_col_num should be 4

·        mat_flag : S or F

                 S: For printing in sparse binary matrix.

                 F: For printing in full binary matrix.                          

                               

       

              Output files:

                  GOParGenPy generates following files in the same directory:

·         *_preprocess_file.txt: Tab separated annotation file that contains gene ids and other descriptors and associated GO terms.

·         *_gene_go.txt: Tab separated file that contains gene ids and other descriptors and associated GO terms and their combined ancestor terms.

·         *_no_go.txt:  Contains those GO terms in *_preprocess_file.txt that which are not defined in OBO file used.

·         *_rownames.txt:  Contains the gene ids and other descriptors that define the row names of corresponding sparse matrix generated.

·         *_colnames.txt: Contains the total GO terms of *_gene_go.txt file that constitute the column names of corresponding sparse matrix generated.

·         *_sparse.txt:  The corresponding sparse matrix representation of *_gene_go.txt file. First column indicates the rows and second column indicates the columns of binary matrix.

 

                          where, *  represent the corresponding input file name.

         

 

         

 

          TUTORIAL:

·         Download the tutorial and package here for UNIX and here for Windows.

·         Extract the files in your current working directory. You should get following files :

1.      GOParGenPy

2.      GeneOntology.py

3.      gene_ontology_edit.obo.2013-03-01   [Sample OBO file as of 01.03.2013]

4.      birch_unigenes_annotation.tab  [ other format annotation of type 1]

5.      gene_association.tair                    [species specific annotation format of type 2]

6.      gene_ass.goa_uniprot                   [Sample uniprot format of type 3 (first 80975 gene ids)]

7.      process_sparse.R                            [Sample code for processing sparse matrix in R]

8.      process_sparse.m                           [Sample code for processing sparse matrix in MATLAB]

 

                                 N.B: If users have their own annotation file then put the input file in the same directory as the package.

 

·         Case 1: when user inputs a different format for annotation files. For e.g. `birch_unigenes_annotation.tab’. Here the gene id column number is 2 and GO id column number is 8. Sample command to be issued is:

 

         $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01   -i  1  -h T  -a T   birch_unigenes_annotation.tab  2    8    S

         Or

         $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01   -i  1  -h T  -a F   birch_unigenes_annotation.tab    1,2,3,4    8   S

 

·         Case 2 : The input file format is of species specific annotation, for e.g `gene_association.tair’ , sample command to be issued is:

 

        $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01    -i  2   -h T   -a T  gene_association.tair   2    5   S    

        Or

        $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01    -i  2   -h T   -a F  gene_association.tair   2,3,7,9,11    5   S

 

·         Case 3: The input file is of UNIPROT-GOA format. For e.g. `gene_ass.goa_uniprot’ , sample command to be issued :

 

        $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01    -i  3   -h T  -a T   gene_ass.goa_uniprot   2    5   S   

        Or 

        $ python   GOParGenPy     gene_ontology_edit.obo.2013-03-01    -i  3   -h T  -a F   gene_ass.goa_uniprot   2,3    5    S   

·         After running the GOParGenPy code, you will get output files mentioned above.

 

       Reading data to R/MATLAB:

 

·         For processing the generated sparse matrix in statistical computing environment such as R, we have provided sample helper functions. Sample usage:

>source ("process_sparse.R")

## Read the sparse matrix, column names and rownames and returns a list object.

>GO_tair_sparse <- read_sparse_data ("gene_association_sparse.txt","gene_association_colnames.txt", "gene_association_rownames.txt")

## Extract the sparse matrix from the list

>mat  <- GO_tair_sparse$sparse_data

## Long or short column names can be taken from the column names

>colnames(mat) <- GO_tair_sparse[[2]][,2]

## Row names matrix can include several alternative name columns

## These can be merged into a single column, which can be taken as

## Row names to data matrix

>dat_sz <- dim(mat)

>tmp_row_names <- rep("a", dat_sz[1])

>for( l in 1:dat_sz[1]){

   tmp_row_names[l] <- paste(GO_tair_sparse[[3]][l,2], GO_tair_sparse[[3]][l,3])

   }

>rownames(mat) <- tmp_row_names

## Generated matrix can be now used in the data analysis

## Make sure that rows in the different data files match each other

 

·         For processing the generated sparse matrix in MATLAB, we have provided sample helper functions. Sample usage:

## Read sparse matrix, row names and column names file

>> [a, b, c] = process_pyth_data ('gene_association_sparse.txt','gene_association_rownames.txt','gene_association_colnames.txt');

 

 

GOParGenPy and PLANT ONTOLOGY

 

 Here, we are demonstrating about application of customized version of GOParGenPy to other ontology resources such as Plant Ontology (PO) which follows OBO format for storing PO classes as DAG structure.

·         Download the example tutorial package here.

·         Extract the files in your current working directory. You should get following files :

1.      GOParGenPy

2.      GeneOntology.py

3.      plant_ontology.obo (Downloaded from here).

4.      po_anatomy_gene_arabidopsis_tair.assoc (Downloaded from here).

5.      po_temporal_gene_arabidopsis_tair.assoc (Downloaded from here).

·         On the command line execute the following command:

 

$ python   GOParGenPy     plant_ontology.obo    -i  2   -h T   po_anatomy_gene_arabidopsis_tair.assoc   2    5   S

and

$ python   GOParGenPy     plant_ontology.obo    -i  2   -h T   po_temporal_gene_arabidopsis_tair.assoc   2    5   S

·         After running the customized GOParGenPy code, you will get output files mentioned above.

·         The sparse matrices obtained can thus be easily loaded in to analysis environment such as R, MATLAB etc for further analysis related to Plant Ontology as shown in the tutorial section.

 

 

 

     CONTACT

    For any bugs/issues/queries/suggestions, please be free to contact us at:

      ajay.kumar-at-helsinki.fi

      petri.toronen-at-helsinki.fi