The following research proposal is based on my rotation project in the laboratory of Dr

Computational and intergenic microarray approaches to investigate biological relevance of transcription factor binding site clustering in the yeast genome.

Background

So we have the sequences of entire genomes. Why do we care?

Sequencing of the human genome was completed in early 2001¹, and increasing amounts of genome sequence information are becoming available for a growing number of organisms. Investigating the functions and expression patterns of vast numbers of newly discovered genes in sequenced genomes using conventional laboratory techniques thus becomes an impractical endeavor. However, the combination of available sequence information and powerful computational and high-throughput experimental techniques engenders new possibilities to investigate rapidly the functions of uncharacterized genes. In particular, the use of DNA microarrays permits a genome-wide investigation of the organization of gene expression².

Gene expression is typically transcriptionally controlled. In higher eukaryotes, transcription is a tightly regulated process in which general, tissue-, and cell-type specific proteins known as transcription factors (TFs) interact both with promoter elements in the DNA and each other to recruit RNA polymerase to the promoter to regulate transcription^3,4,5,6. TFs bind to specific sequences in DNA called TF binding sites, ranging in size from 5-10 bases.

Given the sizes of TF binding sites and complex genomes (~10⁹ bases), statistical analysis predicts that individual transcription factor binding sites will be found not only in gene promoters, but also randomly scattered throughout genomes. Inductive reasoning suggests that if a given TF bound avidly to every sequence it is capable of recognizing, fruitless TF binding events could occur throughout the genome, titrating out the TF,

keeping it unavailable for promoter binding and shutting down transcription of the genes controlled by the TF. However, clustering of TF binding sites in gene promoter regions would enhance binding of TF to those regions. Furthermore, clustering TF binding sites close together would permit co-operativity of TF binding, thus enhancing fruitful TF-promoter binding events.

Screening the yeast genome for known TF binding sites has demonstrated that some of these sites do indeed cluster in promoter regions⁷, and that looking for clustering of binding sites may have predictive power in finding out how the transcription of insufficiently characterized ORFs is regulated. In addition, finding TF binding sites may help infer function of such ORFs using cluster analysis⁸. Yeast lends itself particularly well to the studies described here, as it is straightforward to culture and maintain, its genome size is small (12.4 Mbp), its genome is sequenced, the positions of its ORFs are known, and most yeast genes lack introns that can complicate systematic investigation of TF binding sites in intergenic regions.

Here, a project proposal is described in which the biological relevance of TF binding site clustering is investigated using:

computational techniques, to find clustering of TF binding sites in intergenic regions in the yeast genome, and
microarray analysis using intergenic microarrays to test the computationally obtained results.

The scope of my final project for CH391L is limited to the computational aspects of this proposal.

While there is some knowledge about the distribution of TF binding sites in the yeast genome⁷, few systematic attempts have been made at looking where TF binding sites tend to cluster upstream of yeast genes. Experimental results from yeast will hopefully provide a guideline for applying similar approaches to look for clustering of TF binding sites in higher eukaryotes, potentially simplifying the design of higher eukaryote intergenic arrays to a significant degree. Such arrays in turn would make possible a methodical investigation of how the interaction of transcription factors with complex higher eukaryotic genomes results in normal and experimentally induced gene expression patterns⁹.

Experimental Design

A. Finding TF Binding Site Clusters

First, do the in silico experiments…

The sequence of the yeast genome is available on the Saccharomyces Genome Database¹⁰, and both the entire genome and each individual chromosome sequence are available as simple text files. The programming language PERL was used to write programs that performed the following functions:

Program 1: A program to calculate the frequencies of dinucleotides in an overlapping manner in an input DNA sequence.

Program 2: A program that assembles a scrambled genome of the same size as the yeast genome using the dinucleotide frequencies calculated in Program 1.

Program 3: A program that moves a sliding window of size w one nucleotide at a time along the yeast genome and the scrambled genome and finds transcription factor binding motifs in the window and its reverse complement.

Program 4: New! A program to calculate the frequencies of trinucleotides in an overlapping manner in an input DNA sequence.

Program 5: New! A program that assembles a scrambled genome of the same size as the yeast genome using the trinucleotide frequencies calculated in Program 4.

Program 3 in its final form will output the number and positions of the binding sites in the window into a spreadsheet, which in turn will generate a graph. In the graph, the number of occurrences of a TF binding motif will be plotted as a function of nucleotide position for both Watson and Crick strands.

Simple manipulations to the code of Program 3 permit searching for known sequence polymorphisms for TF binding sites using IUB codes for mixed bases. Dinucleotide frequencies as a second-order approximation of yeast genome sequence are used to make a scrambled genome resemble more closely the structure of the yeast genome that a simple scrambling of individual yeast nucleotide frequencies could accomplish¹¹.

A. Microarray Experiments

…then test hypotheses from the in silico results in vitro.

The creation of yeast intergenic microarrays is described elsewhere⁹. Yeast will be grown under experimental conditions (i.e. heat shock, starvation, etc.) that induce gene expression responses in which defined transcription factors are known to participate. Cross-linking the TFs to their specific binding sites in the yeast genome by formaldehyde treatment, followed by chromatin-IP, will isolate the segments of the genome to which TFs are bound. Reversing the cross-links of the DNA-bound TFs, followed by random-primed amplification and labeling of experimental and control samples with Cy3 and Cy5 fluorescent dyes, will yield probe populations with which the intergenic microarrays will be probed⁹. Microarray analysis is then performed as described previously¹².

Preliminary Results

In silico, first pass:

The TF binding site chosen for generating preliminary data was the degenerate sequence ACGCGN, which is a binding site for the yeast transcription factor MBF⁷. In a genome with individual nucleotide frequencies equivalent to those of yeast (p(A) = 0.3098, p(C) = 0.1909, p(G) = 0.1906, p(T) = 0.3087), this motif would be expected to occur with a frequency of 4.101 x 10^-4. Window sizes were used as listed the table below. n is the number of binding sites found in a window; n_maxis the largest number of TF binding sites found for a given window size.

window size (nt)	scrambled genome, ditype freqs:		scrambled genome, tritype freqs:		actual genome:
	n ≥ 3	n_max	n ≥ 3	n_max	n ≥ 3	n_max
2000	491	6	556	5	455	8
1000	138	4	150	4	163	6
750	33	4	83	4	122	5
500	7	4	38	3	89	5
250	7	3	14	3	42	5
200	6	3	9	3	36	5
175	6	3	8	3	33	5
150	6	3	5	3	28	5
125	1	3	2	3	21	4
100	1	3	1	3	12	4
85	1	3	1	3	10	4
70	1	3	1	3	8	3

Discussion of preliminary results:

While the number of windows in which three or more TF binding sites drops off with decreasing window size in both scrambled and actual genomes, this number of windows decreases significantly faster in the scrambled vs. the actual genome. In the actual yeast genome, for every 50% decrease in window size, the number of windows in which three or more TF binding sites are found also decreases by approximately a factor of two. In the scrambled genome, for every 50% decrease in window size, the number of windows in which n ≥ 3 decreases on average by a factor of ~8.

Note: new data (as of 12/18/2001)! New programs, Programs 4 and 5, were written in order to scramble the yeast genome according to the frequencies of occurrence of trinucleotides in the yeast genome. This second scrambled genome was run through Program 3, with the new tabulated results above. Note that with decreasing window size in this trigram-scrambled genome, the number of windows in which three or more motifs are found still drops off faster than in the actual yeast genome, but slower than in the digram-scrambled genome. In the trigram-scrambled genome, for every 50% decrease in window size, the number of windows in which n ≥ 3 decreases on average by a factor of ~4.

The tabulated results lend support to the idea that clusters of ACGCGN motifs, found more frequently in the actual than in the scrambled genome, have biological relevance.

Further considerations

OK Martin, what’s next?

The following are some refinements outside the scope of the final project that may prove useful for further work:

1) Improvement of the precision of Program 2. Currently, when the output of this program is imported into Program 1 for calculating dinucleotide frequencies, the percentage differences between dinucleotide frequencies from scrambled and actual genomes deviate by no more than ~0.3%. Further precision could be easily obtained by simple manipulations to Program 2.

2) Expansion of Program 2 to scramble a genome based on trinucleotide frequencies. As noted by Shannon¹³, the resemblance to English of mono-, di-, tri- and tetragrams of “words” formed at random based on single-, double-, triple- and quadruple-letter frequencies in the English language increases noticeably for higher order n-grams. It is not unreasonable to assume that a “random” genome that more closely approximates the yeast genome at the level of trigrams would also be of value to investigate statistical significance of motif clustering in comparison to the actual yeast genome.

→ This has already been carried out; see results above (in red).

3) Generalization of Program 2. This program is currently designed to scramble DNA sequences using the dinucleotide frequencies specific to the yeast genome. An additional subroutine could be written to expand the usefulness of the program to any genome.

4) Statistical analysis. With the frequencies of sites from actual and scrambled genomes, the probabilities of finding n words in a window of size w can be calculated in an iterative manner as the window is moved along the genome sequence nucleotide by nucleotide. For each window position, a score can be assigned based on calculated probabilities. In addition to graphs plotting frequency of occurrence vs. nucleotide position, graphs of scores vs. frequency of occurrence could be plotted for scrambled and actual genomes. Comparison of the graphs might show a cutoff score above which word clustering occurs due to more than random chance. Running Program 3 on a large number of scrambled genomes scrambled with Program 2 would give a spread of the types of values tabulated above. The width of this spread would be a measure of the validity of results from statistical analysis.

5) Testing validity of results for other known TF binding sites. Here, the only TF binding site investigated was the ACGCGN motif; the relevance of binding site clustering in general terms should be investigated for other transcription factors as well.

Acknowledgements

This project comprises contributions from three individuals, to whom I owe my thanks:

· Dr. Edward Marcotte, in whose bioinformatics class I learned some PERL programming, and about algorithms to analyze DNA and protein sequence data.

· Dr. Vishy Iyer, one of the pioneers in the field of intergenic/promoter microarrays.

· Mr. Patrick Killion, himself a bioinformatician in the making, who showed kindness and patience as he helped me through programming problems. The programs used in the in silico analysis would have taken me much longer to write and perfect without his expert advice.

Bibliography

1. Lander, E.S., Linton, L.M., Birren, B. et al (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921

2. Lockhart, D. J. and Winzeler, E. A. (2000) Genomics, gene expression and DNA arrays. Nature 405, 827-836

3. Struhl, K. (2001) Gene regulation: a paradigm for precision. Science 293, 1054-1055

4. Struhl, K. (1997) Selective roles for TATA-binding protein factors in vivo. Genes Funct 1, 5-9

5. Mencia, M. and Struhl, K. (2001) Region of Yeast TAF 130 Required for TFIID To Associate with Promoters. Mol Cell Biol 21(4), 1145-1154

6. Geisberg, J., Holstege, F.C, Young, R. et al (2001) Yeast NC2 Associates with the RNA Polymerase II Preinitiation Complex and Selectively Affects Transcription In Vivo. Mol Cell Biol 21(8), 2736-2742

7. Spellman, P. T.Sherlock, G.Zhang, M. Q. et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell (12), 3273-3297

8. Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95, 14863-14868

9. Iyer, V.R., Horak, C.E., Scafe, C.S. et al (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533-538

10. http://genome-www.stanford.edu/Saccharomyces/

11. Sandberg, R., Winberg, G., Branden, C. I. (2001) Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res 11(8), 1404-1409

12. http://www.microarrays.org/

13. Shannon, C.E. (1948) A mathematical theory of communication. The Bell System Technical Journal 27, pp. 379-423, 623-656