Mathematical Modelling of Gene Expression
Gene expression can be regulated in a graded or a binary fashion, though the majority of eukaryotic genes are either fully activated or not expressed at all in individual cells. This binary response might be an inherent property of many eukaryotic promoters. However, analysis of transcription under the control of yeast GAL1 promoter has suggested that graded and binary modes of transcription are not mutually exclusive, but both can occur at the same promoter when the activity of different signalling pathways is varied. It was shown that these experimental observations can be explained by mathematical modelling based on simple principles.
Work was continued on a quantitative model for the regulation of Sex-lethal (Sxl), a gene that controls sex determination in the somatic cells of the fruit flies. Early during embryogenesis, the sexual fate is cell-autonomously determined by the X/A-ratio. The X/A-ratio is a polygenic signal made up of proteins encoded by genes on the X-chromosome and autosome as well as maternal effect genes. Most of the genes involved are transcription factors, functioning as either repressors or activators. While the single dose of X chromosomes leaves Sxl virtually un-expressed in males, its double dose in females ensures stable and permanent production of Sxl. The presence or absence of Sxl is decisive in directing the development of sexual dimorphism as well as dosage compensation. The integrated model of the system is composed of three parts: complex formation of transcription factors, transcriptional regulation at the promoter of Sxl, and post-transcriptional control by alternative splicing. The consideration of all these processes together leads to a reappraisal of prevailing over-simplistic ideas in the field.
The model shows that a high level of activator alone is not sufficient, but instead two conditions must be simultaneously fulfilled for transcription to be induced: the quantity of activators must be sufficiently high and the quantity of repressors must be sufficiently low. This is illustrated in Figure 1 where the plane representing the concentrations of activators and repressors is divided into a yellowish region where transcription is allowed and another blackish region where transcription is silent. The trajectory of wild type males remains in the region where transcription is disallowed. In contrast, the trajectory of wild type females visits the region that allows transcription for a relatively long period of time. Consequently, substantial production Sxl is restricted to females. Numerical simulations also enable us to assess the effects of gene knock-outs. Wild type females have two copies of the X-linked genes scute (sc) and sisterless-A (sisA). The effects of reducing the dosage of scute and/or sisterless-A in females is studied in Figure 1B. The mutant with a reduced dose of scute has a trajectory that is shifted towards the lower value of activator but the trajectory still visits the region that allows transcription. Consequently, the accumulation of Sxl protein is thus sufficient to induce female development. Thus, simulations predict viability in agreement with the experimental observation. The mutant with a reduced dose of sisterless-A has a trajectory that is shifted towards a higher value of repressor and a lower value of activator. Despite these changes, the simulation predicts that the quantity of Sxl protein accumulated over the time remains sufficient to ensure female development. Experimental observations agree. The mutant with a reduced dose of both scute and sisterless-A has a trajectory that does not visit the region allowing transcription. Sxl protein is therefore very little produced. Viability of this female mutant is expected to be severely reduced in conformity with the experimental observations.
Analysing the Transductome
Current protein-protein interaction networks, however incomplete they may still be, are so complex that they cannot be perceived or analysed as a whole. We have developed a method that automatically and reliably extracts known signalling cascades in yeast, yielding results that are consistent with available biological knowledge and meet human expectations. The method produces an overview over a large body of biological literature in an easily understandable graphical format. The transfer of an external signal from a given receptor to transcription factors in the nucleus is modelled based on concepts of information flow. This results in a method that works unsupervised and without prior training, clustering or pre-filtering of the input data. The method is essentially an adaptation of the maximum spanning tree algorithm and results in fast and accurate reconstruction of pathways. The complete calculation and visualisation are done within less than a minute on a standard desktop PC. The algorithm is very tolerant to noise in the input data due to the dynamic filtering that resorts to weak associations only if no stronger links are left. The fast and robust greedy strategy makes the method well suited for an interactive environment in which researchers can explore and navigate all possible signalling pathways. We call this superset of signalling pathways the transductome. Because S. cerevisiae is comparatively well characterised, we have focussed our analyses on this model organism. However, the pathway extraction strategy is general, insensitive to noise in biological networks and readily applicable to any receptors or organisms, and even to networks built using different experimental data or computational techniques.
protein interactome map
Counting All Domain Families
Evolution modifies and recombines existing building blocks instead of inventing everything from scratch. In the protein world these building blocks have been termed 'domains' and the identification and characterisation of new domains and domain families is a major goal of protein science. Grouping domains into families is useful in two ways. Firstly, it leads to more sensitive detection of new members and improved discrimination against spurious hits in database searches. The essential conserved features in a family are enhanced by using profiles (position-specific scoring matrices) or hidden Markov models or patterns (regular expressions). Secondly, having established family membership, one can place the query sequence in the context of the evolutionary tree of the family for accurate inference of biological function. It is also easier to spot inconsistent similarity-derived annotations in the tree context.
adda protein clustering
An intermediate step during the ADDA clustering process
Traditionally, domain families have been defined manually. Recently, automated methods have been developed that systematically try to find shared building blocks between proteins. The most sensitive methods employ exhaustive structure comparisons, but are limited by the availability of structural data, which is still scarce. More complete methods in terms of protein space coverage use sequence data alone. We have developed an exhaustive and complete sequence based domain decomposition and family classification. The method, ADDA, first decomposes sequences into domains. Domains are defined based on a probabilistic model that compensates for artefacts due to sequence fragments and motif alignments. The model is biased towards conservative decisions in domain cutting to avoid trivial over-fragmentation of sequences. After domain decomposition, ADDA clusters domains into families using the transitivity property of homology, where homology is inferred by profile-profile comparison between domains. This ensures that domains classified in the same family share common sequence motifs.
Application to a database of 782,238 non-redundant proteins yielded 450,462 domains in 33,879 domain families containing at least two members with less than 40 % sequence identity. Validation against manually curated family classifications SCOP and Pfam showed that ADDA achieves almost perfect unification of various large domain families. At the same time, the contamination of clusters by unrelated domains remains on a very low level. The definition of domain families is a pre-requisite for many bioinformatics analyses. ADDA is based on precisely formulated concepts, which are implemented in a simple and fast algorithm that produces high quality clusters. Numerous future applications will profit from ADDA's global annotation of all sequences with domains.
Improvements to Sequence Alignments
The transitivity property of homology can be used to generate alignments between proteins that are homologous but not detectably similar in direct pairwise comparison. We have developed a novel method which derives a consensus over all possible transitive alignment paths between two proteins. Large sequence distances are covered by using many layers of more closely spaced intermediate sequences as stepping stones. The method improves both the coverage and reliability of sequence alignment compared to PSI-Blast, the current baseline standard. In particular, our method yields accurate alignments between proteins which are indirect PSI-Blast neighbours. Interestingly, the method performed better with harder prediction targets in the CASP5 competition.
fuzzy alignment