Organism / Association file
GENERATOR supports 8 different species for which the Gene Ontology annotations are dowloaded from GO.current.annotations.shtml. The source databases of the annotations are indicated in the brackets.

Genelist
GENERATOR takes in lists of gene identifiers, symbols or their synonyms that must correspond to the fields indicated in the annotation files or README files at GO.current.annotations.shtml. The identifiers must be separated with line breaks. Before performing the factorizatoin, GENERATOR removes genes from the analysis, which are not associated to the used ontology (genes without any GO-terms in the current ontology).

Ontologies
The data for the GENERATOR clustering and analysis is obtained from the Gene Ontology database. One of the three main branches must be selected for the analysis.

Max nb of clusters
GENERATOR creates several partitive clustering results starting from two groups and ending to user selected number of partitions. The maximum number of partitions is indicated in this field.

NMF iterations
Indicates how many times the update rules in the Non-negative Matrix Factorization (NMF) algorithm are repeated. Each additional step will make the algorithm more convergent with the local optimum. The default 100 iterations should be enough in most cases.

NMF repeats
GENERATOR repeats each clustering into K partitions the number of times that is indicated in this field. From these the most convergent clustering result (the one with the smallest least squares error) is selected to represent the Kth level.

p-value cutoff (C)
The p-value (C) measure indicates the over-representation of the GO-class in a cluster when compared to the complete genome (Bonferroni corrected p-value of Fisher's exact test). By default it is used for sorting classes within the result clusters. Still it is partly dependent on the preceeding clustering, and therefore can rank high in some classes that are not over-represented in the whole user given gene list. Such classes are filtered by using p-value (O) as cutoff. On the other hand, p-value (C) may also raise some classes that do not represent the contents of the cluster very well. These are filtered by using p-value (S) as cutoff. The p-values (C) and (S) indicate the over-representation but are not statistically analyzable because of the dependency on the clustering. The p-values are calculated with Fisher's Exact test (1) from hypergeometric probability density distribution (2) with following formulaes:

(1) Fisher's Exact test

(2) Hypergeometric probability density

where:
x = number of genes that associate with the GO class in subset A
n = number of genes in subset A
N = number of genes that associate with the GO class in background set B
M = number of genes in background set B


With p-value (O) subset A is the user given gene list, e.g. co-expressed genes, whereas with p-values (C) and (S) subset A is a single cluster. In p-value measures (C) and (O) background set B is whole genome, and in p-value (S) it is the user given gene set.

Because of multiple testing each of the p-value measures is corrected with Bonferroni correction by multiplying it with number of executed tests i.e. the number of GO-classes in user selected Gene Ontology branch. Still it should be noted that p-values (S) and (C) do not become statistically analyzable as a result of correction.

p-value cutoff (O)
The p-value (O) measure indicates statistical significance of the GO-class over-representation in the whole user given gene list when compared to the whole genome (Bonferroni corrected p-value of Fisher's exact test; look the explanation of C.log(p) above for formulae). By default it is used in cluster description for filtering the classes that are not over-represented in the whole gene list. p-value (O) is statistically analyzable as it is not dependent on the clustering.

p-value cutoff (S)
The p-value (S) measure indicates the over-representation of the GO-class in a cluster when compared only to the user given gene list (Bonferroni corrected p-value of a Fisher's exact test ; look the explanation of C.log(p) above for formulae). By default it is used in the cluster description for filtering the classes that are under-represented and therefore do not represent the cluster contents very well. It is optionally suitable for sorting classes within the clusters which allows the viewing of classes that have enriched as a result of clustering. The p-value (S) values are not statistically analyzable as they are dependent on the clustering.



This tool was developed by Matti Kankainen (University of Helsinki) and Petri Pehkonen (University of Kuopio)
Contact the Webmaster.
© 2006 University of Helsinki and University of Kuopio