Web page of Gene Set Z-score

Petri Toronen, Pauli J. Ojala, Pekka Marttinen and Liisa Holm

This page represents the preprint of the manuscript representing the Gene Set Z-score. We also represent supplementary texts, supplementary figures and supplementary tables. Furthermore some additional figures, not included into the original article or to supplementary material, are shown. Matlab functions and scripts are also distributed for the testing of the represented method. An R function that calculates the GSZ score is also represented. This is for R hackers and not for end users.

Summary

One of the central tasks in biosciences is the analysis of the gene expression data using various functional gene classes (typically Gene Ontology classes, GO classes) to aid the analysis. Normally such work first generates differential gene expression scores and then computes a class level score by combining the signal of class members using a class level scoring function. We propose a novel class scoring function, Gene Set Z-score (GSZ-score), for analysis of gene classes with gene expression data. It can be considered as a gene expression weighted hypergeometric Z-score.

We compared its performance first with other popular class scoring functions. We selected standard Kolmogorov-Smirnov test (used in the original Gene Set Enrichment Analysis), modified Kolmogorov-Smirnov test (used in the current Gene Set Enrichment Analysis), hypergeometric test calculated at every threshold position (similarly to iterative Group Analysis) and to modified t-test calculated between the class members and class non-members. All other parts of the analysis were kept exactly identical in these comparisons. Our evaluations include:

  • Detailed comparison of class scoring functions on artificial datasets with various positive signals.
  • Cross validation style comparison with split real dataset. Scoring functions were tested using half of the data and evaluated using the positive GO classes reported from the other half of the data. Positive GO classes were selected using a) first the same evaluated method and b) next all the methods in combination.
  • Comparison of empirical p-values obtained with each scoring function across the top scoring GO classes from two real datasets.
  • Evaluation of biologically relevant GO classes across the top scoring classes.
  • Pairwise comparison GSZ and each of the competing scoring functions by looking at the GO classes that showed largest differences in empirical p-values in two datasets
  • Our scoring function outperformed other functions in these comparisons.

    We also compared different actual program packages (Gene Set Analysis, Gene Set Enrichment Analysis, Signal Pathway) with our analysis pipeline using two real datasets. We monitored again the biologically relevant classes. We kept the different variables between the program packages identical. These included the number of permutations, the analyzed set of GO classes, minimum and maximum size for the allowed GO class.

    Our analysis pipeline surpassed others also in this comparison.

    Manuscript preprints

    Here is the preprint manuscript version (tables and figures embedded). Here's the link to article at the BMC web site .

    Poster from ISMB 2009

    Here is a PDF of the poster presented in the ISMB 2009. This summarizes some of the findings from the manuscript.

    Supplementary texts

    Supplementary figures

    Supplementary figures are described in the manuscript, at the very end.These are PDF files

    Supplementary tables

    Supplementary tables are also described at the very end of the manuscript. These are excel files.

    Additional material

    This is material not included to submission. These details might be still informative, as they show an omitted analysis on the effect of various prior variance weights on the stability of the Z score across different threshold positions and across different class sizes. Analysis shows only row randomizations

    These figures are encapsulated postscript files.

    Matlab code, Demo-datasets

    Here's a link to a folder that includes matlab code and data matlab code

    Here's a link to *.tar.gz . This package excludes the demodatasets 1 and 2

    Code is mostly documented. Demo data currently available. Check the README for details... Comments on code are very wellcome

    R code

    Here is the central function coded in R. No wrapper function included. Therefore this is for R programmers and not for end users.

    Demo data in R here. No explanations yet but this should be self-explanatory.