Computational and intergenic
microarray approaches to investigate biological
relevance of transcription factor binding site clustering in the yeast genome.
Background
So we have the sequences of entire genomes. Why do we care?
Sequencing
of the human genome was completed in early 20011,
and increasing amounts of genome sequence information are becoming available
for a growing number of organisms.
Investigating the functions and expression patterns of vast numbers of
newly discovered genes in sequenced genomes using conventional laboratory
techniques thus becomes an impractical endeavor. However, the combination of available
sequence information and powerful computational and high-throughput
experimental techniques engenders new possibilities to investigate rapidly the
functions of uncharacterized genes. In
particular, the use of DNA microarrays permits a genome-wide
investigation of the organization of gene expression2.
Gene
expression is typically transcriptionally controlled. In higher eukaryotes, transcription is a
tightly regulated process in which general, tissue-, and cell-type specific
proteins known as transcription factors (TFs)
interact both with promoter elements in the DNA and each other to recruit RNA
polymerase to the promoter to regulate transcription3,4,5,6. TFs bind to
specific sequences in DNA called TF binding sites, ranging in size from 5-10
bases.
Given
the sizes of TF binding sites and complex genomes (~109 bases),
statistical analysis predicts that individual transcription factor binding
sites will be found not only in gene promoters, but also randomly scattered
throughout genomes. Inductive reasoning
suggests that if a given TF bound avidly to every sequence it is capable of
recognizing, fruitless TF binding events could occur throughout the genome,
titrating out the TF,
keeping it unavailable for promoter binding and shutting down transcription of
the genes controlled by the TF. However,
clustering of TF binding sites in gene promoter regions would enhance binding
of TF to those regions. Furthermore,
clustering TF binding sites close together would permit co-operativity
of TF binding, thus enhancing fruitful TF-promoter binding events.
Screening
the yeast genome for known TF binding sites has demonstrated that some of these
sites do indeed cluster in promoter regions7, and that looking for
clustering of binding sites may have predictive power in finding out how the
transcription of insufficiently characterized ORFs is
regulated. In addition, finding TF
binding sites may help infer function of such ORFs
using cluster analysis8. Yeast lends
itself particularly well to the studies described here, as it is
straightforward to culture and maintain, its genome size is small (12.4 Mbp), its genome is sequenced, the positions of its ORFs are known, and most yeast genes lack introns that can complicate systematic investigation of TF
binding sites in intergenic regions.
Here,
a project proposal is described in which the biological relevance of TF binding
site clustering is investigated using:
The
scope of my final project for CH391L is limited to the computational aspects of
this proposal.
While
there is some knowledge about the distribution of TF binding sites in the yeast
genome7,
few systematic attempts have been made at looking where TF binding sites tend
to cluster upstream of yeast genes.
Experimental results from yeast will hopefully provide a guideline for
applying similar approaches to look for clustering of TF binding sites in
higher eukaryotes, potentially simplifying the design of higher eukaryote intergenic arrays to a significant degree. Such arrays in turn would make possible a
methodical investigation of how the interaction of transcription factors with
complex higher eukaryotic genomes results in normal and experimentally induced
gene expression patterns9.
Experimental Design
A. Finding TF Binding Site Clusters
First, do the in silico experiments…
The
sequence of the yeast genome is available on the Saccharomyces Genome Database10, and both the
entire genome and each individual chromosome sequence are available as simple
text files. The programming language
PERL was used to write programs that performed the following functions:
Program 1:
A program to calculate the frequencies of dinucleotides
in an overlapping manner in an input DNA sequence.
Program 2: A
program that assembles a scrambled genome of the same size as the yeast genome
using the dinucleotide frequencies calculated in
Program 1.
Program 3: A program that moves a sliding window of size
w one nucleotide at a time along the
yeast genome and the scrambled genome and finds transcription factor binding
motifs in the window and its reverse complement.
Program 4: New! A program to calculate the
frequencies of trinucleotides in an overlapping
manner in an input DNA sequence.
Program 5: New! A program that assembles a scrambled genome
of the same size as the yeast genome using the trinucleotide
frequencies calculated in Program 4.
Program
3 in its final form will output the number and positions of the binding sites
in the window into a spreadsheet, which in turn will generate a graph. In the graph, the number of occurrences of a
TF binding motif will be plotted as a function of nucleotide position for both
Watson and Crick strands.
Simple
manipulations to the code of Program 3 permit searching for known sequence
polymorphisms for TF binding sites using IUB codes for mixed bases. Dinucleotide
frequencies as a second-order approximation of yeast genome sequence are used to
make a scrambled genome resemble more closely the structure of the yeast genome
that a simple scrambling of individual yeast nucleotide frequencies could
accomplish11.
A. Microarray Experiments
…then test hypotheses from
the in silico results in vitro.
The
creation of yeast intergenic microarrays
is described elsewhere9. Yeast will be grown under experimental
conditions (i.e. heat shock, starvation, etc.) that induce gene expression
responses in which defined transcription factors are known to participate.
Cross-linking the TFs to their specific binding sites
in the yeast genome by formaldehyde treatment, followed by chromatin-IP, will
isolate the segments of the genome to which TFs are
bound. Reversing the cross-links of the
DNA-bound TFs, followed by random-primed
amplification and labeling of experimental and control samples with Cy3 and Cy5
fluorescent dyes, will yield probe populations with which the intergenic microarrays will be
probed9. Microarray analysis
is then performed as described previously12.
Preliminary Results
In silico,
first pass:
The
TF binding site chosen for generating preliminary data was the degenerate sequence
ACGCGN, which is a binding site for the yeast transcription factor MBF7. In a genome
with individual nucleotide frequencies equivalent to those of yeast (p(A) = 0.3098, p(C) = 0.1909, p(G) = 0.1906, p(T) =
0.3087), this motif would be expected to occur with a frequency of 4.101 x 10-4. Window sizes were used as listed the table
below. n is the
number of binding sites found in a window; nmax is the
largest number of TF binding sites found for a given window size.
window size
(nt) |
scrambled
genome, ditype freqs: |
scrambled
genome, tritype freqs: |
actual
genome: |
|||
|
n ≥ 3 |
nmax |
n ≥ 3 |
nmax |
n ≥ 3 |
nmax |
2000 |
491 |
6 |
556 |
5 |
455 |
8 |
1000 |
138 |
4 |
150 |
4 |
163 |
6 |
750 |
33 |
4 |
83 |
4 |
122 |
5 |
500 |
7 |
4 |
38 |
3 |
89 |
5 |
250 |
7 |
3 |
14 |
3 |
42 |
5 |
200 |
6 |
3 |
9 |
3 |
36 |
5 |
175 |
6 |
3 |
8 |
3 |
33 |
5 |
150 |
6 |
3 |
5 |
3 |
28 |
5 |
125 |
1 |
3 |
2 |
3 |
21 |
4 |
100 |
1 |
3 |
1 |
3 |
12 |
4 |
85 |
1 |
3 |
1 |
3 |
10 |
4 |
70 |
1 |
3 |
1 |
3 |
8 |
3 |
Discussion
of preliminary results:
While
the number of windows in which three or more TF binding sites drops off with
decreasing window size in both scrambled and actual genomes, this number of windows
decreases significantly faster in the scrambled vs. the actual genome. In the actual yeast genome, for every 50%
decrease in window size, the number of windows in which three or more TF
binding sites are found also decreases by approximately a factor of two. In the scrambled genome, for every 50%
decrease in window size, the number of windows in which n ≥ 3 decreases
on average by a factor of ~8.
Note: new data (as of
The
tabulated results lend support to the idea that clusters of ACGCGN motifs,
found more frequently in the actual than in the scrambled genome, have
biological relevance.
Further considerations
OK Martin, what’s next?
The
following are some refinements outside the scope of the final project that may
prove useful for further work:
1) Improvement of the precision
of Program 2. Currently, when the output
of this program is imported into Program 1 for calculating dinucleotide
frequencies, the percentage differences between dinucleotide
frequencies from scrambled and actual genomes deviate by no more than
~0.3%. Further precision could be easily
obtained by simple manipulations to Program 2.
2) Expansion of Program 2 to
scramble a genome based on trinucleotide
frequencies. As noted by Shannon13, the
resemblance to English of mono-, di-, tri- and tetragrams of “words” formed at random based on
single-, double-, triple- and quadruple-letter frequencies in the English
language increases noticeably for higher order n-grams. It is not
unreasonable to assume that a “random” genome that more closely approximates
the yeast genome at the level of trigrams would also be of value to investigate
statistical significance of motif clustering in comparison to the actual yeast
genome.
→ This has already been carried out;
see results above (in red).
3) Generalization of Program
2. This program is currently designed to
scramble DNA sequences using the dinucleotide
frequencies specific to the yeast genome.
An additional subroutine could be written to expand the usefulness of
the program to any genome.
4) Statistical analysis. With the frequencies of sites from actual and
scrambled genomes, the probabilities of finding n words in a window of size w
can be calculated in an iterative manner as the window is moved along the
genome sequence nucleotide by nucleotide.
For each window position, a score can be assigned based on calculated
probabilities. In addition to graphs
plotting frequency of occurrence vs. nucleotide position, graphs of scores vs.
frequency of occurrence could be plotted for scrambled and actual genomes. Comparison of the graphs might show a cutoff
score above which word clustering occurs due to more than random chance. Running Program 3 on a large number of
scrambled genomes scrambled with Program 2 would give a spread of the types of
values tabulated above. The width of
this spread would be a measure of the validity of results from statistical
analysis.
5) Testing validity of results
for other known TF binding sites. Here,
the only TF binding site investigated was the ACGCGN motif; the relevance of
binding site clustering in general terms should be investigated for other
transcription factors as well.
Acknowledgements
This
project comprises contributions from three individuals, to whom I owe my
thanks:
·
Dr. Edward Marcotte, in whose bioinformatics
class I learned some PERL programming, and about algorithms to analyze DNA and
protein sequence data.
·
Dr. Vishy Iyer, one
of the pioneers in the field of intergenic/promoter microarrays.
·
Mr. Patrick Killion, himself a bioinformatician in the making, who showed kindness and
patience as he helped me through programming problems. The programs used in the in silico analysis would have taken me much longer to write
and perfect without his expert advice.
Bibliography
1.
Lander,
E.S., Linton, L.M., Birren, B. et al (2001) Initial
sequencing and analysis of the human genome.
Nature 409, 860-921
3. Struhl, K. (2001) Gene regulation: a paradigm for precision. Science 293, 1054-1055
4. Struhl, K. (1997) Selective roles for TATA-binding protein factors in vivo. Genes Funct 1, 5-9
10. http://genome-www.stanford.edu/Saccharomyces/
11. Sandberg, R., Winberg, G., Branden, C. I. (2001) Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res 11(8), 1404-1409
12. http://www.microarrays.org/
13. Shannon, C.E. (1948) A mathematical theory
of communication. The