Mathematical Modelling of Gene Expression
Gene expression can be regulated in a graded or a binary fashion, though the majority of eukaryotic genes are either
fully activated or not expressed at all in individual cells. This binary response might be an inherent property of many
eukaryotic promoters. However, analysis of transcription under the control of yeast GAL1 promoter has suggested that
graded and binary modes of transcription are not mutually exclusive, but both can occur at the same promoter when the
activity of different signalling pathways is varied. It was shown that these experimental observations can be explained
by mathematical modelling based on simple principles.
Work was continued on a quantitative model for the regulation of Sex-lethal (Sxl), a gene that controls sex
determination in the somatic cells of the fruit flies. Early during embryogenesis, the sexual fate is cell-autonomously
determined by the X/A-ratio. The X/A-ratio is a polygenic signal made up of proteins encoded by genes on the
X-chromosome and autosome as well as maternal effect genes. Most of the genes involved are transcription factors,
functioning as either repressors or activators. While the single dose of X chromosomes leaves Sxl virtually un-expressed
in males, its double dose in females ensures stable and permanent production of Sxl. The presence or absence of Sxl is
decisive in directing the development of sexual dimorphism as well as dosage compensation. The integrated model of the
system is composed of three parts: complex formation of transcription factors, transcriptional regulation at the promoter
of Sxl, and post-transcriptional control by alternative splicing. The consideration of all these processes together leads
to a reappraisal of prevailing over-simplistic ideas in the field.
The model shows that a high level of activator alone is not sufficient, but instead two conditions must be simultaneously
fulfilled for transcription to be induced: the quantity of activators must be sufficiently high and the quantity of
repressors must be sufficiently low. This is illustrated in Figure 1 where the plane representing the concentrations of
activators and repressors is divided into a yellowish region where transcription is allowed and another blackish region
where transcription is silent. The trajectory of wild type males remains in the region where transcription is disallowed.
In contrast, the trajectory of wild type females visits the region that allows transcription for a relatively long period
of time. Consequently, substantial production Sxl is restricted to females. Numerical simulations also enable us to assess
the effects of gene knock-outs. Wild type females have two copies of the X-linked genes scute (sc) and sisterless-A (sisA).
The effects of reducing the dosage of scute and/or sisterless-A in females is studied in Figure 1B. The mutant with a
reduced dose of scute has a trajectory that is shifted towards the lower value of activator but the trajectory still visits
the region that allows transcription. Consequently, the accumulation of Sxl protein is thus sufficient to induce female
development. Thus, simulations predict viability in agreement with the experimental observation. The mutant with a reduced
dose of sisterless-A has a trajectory that is shifted towards a higher value of repressor and a lower value of activator.
Despite these changes, the simulation predicts that the quantity of Sxl protein accumulated over the time remains sufficient
to ensure female development. Experimental observations agree. The mutant with a reduced dose of both scute and sisterless-A
has a trajectory that does not visit the region allowing transcription. Sxl protein is therefore very little produced.
Viability of this female mutant is expected to be severely reduced in conformity with the experimental observations.
Analysing the Transductome
Current protein-protein interaction networks, however incomplete they may still be, are so complex that they cannot
be perceived or analysed as a whole. We have developed a method that automatically and reliably extracts known
signalling cascades in yeast, yielding results that are consistent with available biological knowledge and meet
human expectations. The method produces an overview over a large body of biological literature in an easily
understandable graphical format. The transfer of an external signal from a given receptor to transcription factors
in the nucleus is modelled based on concepts of information flow. This results in a method that works unsupervised
and without prior training, clustering or pre-filtering of the input data. The method is essentially an adaptation
of the maximum spanning tree algorithm and results in fast and accurate reconstruction of pathways. The complete
calculation and visualisation are done within less than a minute on a standard desktop PC. The algorithm is very
tolerant to noise in the input data due to the dynamic filtering that resorts to weak associations only if no stronger
links are left. The fast and robust greedy strategy makes the method well suited for an interactive environment in
which researchers can explore and navigate all possible signalling pathways. We call this superset of signalling
pathways the transductome. Because S. cerevisiae is comparatively well characterised, we have focussed our analyses
on this model organism. However, the pathway extraction strategy is general, insensitive to noise in biological
networks and readily applicable to any receptors or organisms, and even to networks built using different experimental
data or computational techniques.
Counting All Domain Families
Evolution modifies and recombines existing building blocks instead of inventing everything from scratch.
In the protein world these building blocks have been termed 'domains' and the identification and characterisation
of new domains and domain families is a major goal of protein science. Grouping domains into families is useful
in two ways. Firstly, it leads to more sensitive detection of new members and improved discrimination against
spurious hits in database searches. The essential conserved features in a family are enhanced by using profiles
(position-specific scoring matrices) or hidden Markov models or patterns (regular expressions). Secondly, having
established family membership, one can place the query sequence in the context of the evolutionary tree of the
family for accurate inference of biological function. It is also easier to spot inconsistent similarity-derived
annotations in the tree context.
An intermediate step during the ADDA clustering process
Traditionally, domain families have been defined manually. Recently, automated methods have been developed
that systematically try to find shared building blocks between proteins. The most sensitive methods employ
exhaustive structure comparisons, but are limited by the availability of structural data, which is still
scarce. More complete methods in terms of protein space coverage use sequence data alone. We have developed
an exhaustive and complete sequence based domain decomposition and family classification. The method, ADDA,
first decomposes sequences into domains. Domains are defined based on a probabilistic model that compensates
for artefacts due to sequence fragments and motif alignments. The model is biased towards conservative
decisions in domain cutting to avoid trivial over-fragmentation of sequences. After domain decomposition,
ADDA clusters domains into families using the transitivity property of homology, where homology is inferred
by profile-profile comparison between domains. This ensures that domains classified in the same family share
common sequence motifs.
Application to a database of 782,238 non-redundant proteins yielded 450,462 domains in 33,879 domain families
containing at least two members with less than 40 % sequence identity. Validation against manually curated
family classifications SCOP and Pfam showed that ADDA achieves almost perfect unification of various large
domain families. At the same time, the contamination of clusters by unrelated domains remains on a very low
level. The definition of domain families is a pre-requisite for many bioinformatics analyses. ADDA is based
on precisely formulated concepts, which are implemented in a simple and fast algorithm that produces high
quality clusters. Numerous future applications will profit from ADDA's global annotation of all sequences with domains.
Improvements to Sequence Alignments
The transitivity property of homology can be used to generate alignments between proteins that are homologous
but not detectably similar in direct pairwise comparison. We have developed a novel method which derives a
consensus over all possible transitive alignment paths between two proteins. Large sequence distances are covered
by using many layers of more closely spaced intermediate sequences as stepping stones. The method improves both
the coverage and reliability of sequence alignment compared to PSI-Blast, the current baseline standard. In
particular, our method yields accurate alignments between proteins which are indirect PSI-Blast neighbours.
Interestingly, the method performed better with harder prediction targets in the