國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,單一核苷酸多型性序列區塊預測及其標籤篩選演算法,SNP Haplotype Block Inference and Tag Selection Algorithm

論文名稱 Title	單一核苷酸多型性序列區塊預測及其標籤篩選演算法 SNP Haplotype Block Inference and Tag Selection Algorithm
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	92 學年度第 2 學期 The spring semester of Academic Year 92	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	50
研究生 Author	孫嘉璘 Chia-Ling Sun
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	張貿翔 Mau-Shiang Chang
口試委員 Advisory Committee	柯正雯, 薛佑玲, 吳邦一 Cheng-Wen Co; You-Ling Shiue; Bang-Ye Wu
口試日期 Date of Exam	2004-07-09	繳交日期 Date of Submission	2004-07-27
關鍵字 Keywords	單一核苷酸多型性、變異量、單套 diversity, haplotype, SNP
統計 Statistics	本論文已被瀏覽 5685 次，被下載 2280 次 The thesis/dissertation has been browsed 5685 times, has been downloaded 2280 times.

中文摘要
單一核苷酸多型性是由於在人體基因中單一個核苷酸改變所造成的。這些單一的核苷酸變異大約每一千個鹼基對就會有一個這樣的現象發生。在這些核苷酸位置上只會有二種可能的核苷酸會顯現出來。由於單一核苷酸多型性序列資料的變異有限加上其資料量十分的豐富，因此很適合拿來當做人類疾病特徵的標誌。在近期的研究結果中曾指出人類基因體中有區塊狀的結構產生，而且在每個區塊中的變異是有限的。因此，在每個區塊中我們可以利用少部份的單一核苷酸多型性資料來表示這個區塊的變異情形。而這少部份的單一核苷酸多型性資料即被稱之為單一核苷酸多型性標籤。我們提出了定量變異方法去求得資料內部的變異值。用此方法切割單一核苷酸多型性區塊之後，我們提出一些客觀的評估法來衡量切割的區塊是否恰當。從這個演算法我們求得人類第二十一號染色體的變異值為0.5。切割出來的區塊與NCBI網頁上的haplotype資料有著共同的性質。最後，我們發展標籤篩選演算法去挑選每個區塊中所需要的標籤為何，根據此演算法去挑選標籤我們得到資料壓縮的比率為0.78。
Abstract
SNP (single nucleotide polymorphisms, pronounce as snip) is one nucleotide position difference within human population. These differences can be detected in human genome and the difference occurs once about every 1000 base pairs. There are only two possible nucleotides in each SNP position. As a genetic marker, SNP data can be used to capture human disease traits because of its abundance and low diversity. In recent research results, it has been shown that there is a block-like structure in human genome, and only limited haplotype diversity can be observed. Consequently, we can use only a small fraction of SNPs to capture haplotype diversity in each block, and these SNPs are called tagSNPs. We propose a fixed-diversity approach to capture the diversity of the entire data. After partitioning the haplotype blocks, we will provide an objective way for evaluating the result. We obtain that the diversity of chromosome 21 SNPs locates at 0.5 by using our algorithm. The partition result shows the concurrence property of the haplotype data downloaded from NCBI web site. Finally, we develop an algorithm for tagSNP selection within each block, and obtain the compression ratio 0.78.

目次 Table of Contents
LIST OF FIGURES . . . . . . . . . . . . . . .. . . . . . 4 LIST OF TABLES . . . .. . . . . . . . . . . . . . . . . . 8 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . 0 Chapter 1. Introduction . . . . . . . . . . . . . . . . . 1 1.1 Introduction to SNP . . . . . . . . . . . . . . . . . 1 1.2 Characteristics of the Blocks . . . . . . . . . . 4 1.3 The Need of tagSNP Selection . . . . . . . . . . . 5 Chapter 2. Previous Works . . . . . . . . . . . . . . . . 7 2.1 Different Criteria for Block and tagSNP Selection . . 7 2.2 The Fixed-diversity Approach . . . . . . . . . . . . . 9 2.3 The NP-Complete Property . . . . . . . . . . . . . . 11 2.4 SCP Approaches . . . . . . . . . . . . . . . . . . . 12 Chapter 3. Main Ideas . . . . . . . . . . . . . . . . . . 14 3.1 Data Format . . . . . . . . . . . . . . . . . . . . 14 3.2 Problem Definition . . . . . . . . . . . . . . . . . 14 3.3 Diversity Calculation within One Block . . . . . . . 16 3.3.1 Classification . . . . . . . . . . . . . . . . . . 16 3.3.2 Calculation for Diversity . . . . . . . . . . . . . 19 3.4 Setting Fixed-diversity Threshold . . . . . . . . . . 19 Chapter 4. Our Methods . . . . . . . . . . . . . . . . . 20 4.1 Diversity Calculation . . . . . . . . . . . . . . . . 20 4.2 Dealing with Missing Data . . . . . . . . . . . . . . 21 4.2.1 Method 1 . . . . . . . . . . . . . . . . . . . . . 21 4.2.2 Method 2 . . . . . . . . . . . . . . . . . . . . . 24 4.2.3 Method 3 . . . . . . . . . . . . . . . . . . . . . 25 4.2.4 Method 4 . . . . . . . . . . . . . . . . . . . . . 26 4.3 Tag Selection Idea . . . . . . . . . . . . . . . . . 28 Chapter 5. Results and Discussion . . . . . . . . . . . . 31 5.1 The Power of Fixed-diversity Threshold . . . . . . . 31 5.1.1 Random Data Testing . . . . . . . . . . . . . . . 31 5.1.2 Properties of Block Length . . . . . . . . . . . . 32 5.1.3 Implication of Block Number Variance between Diversities . . . . . . . . . . . . . . . . . . . . . . 36 5.1.4 Secondary Block Boundary Effects . . . . . . . . . 38 5.2 Adopting Haplotype Data in Verification . . . . . . 39 5.3 Evaluation of the Partition Results . . . . . . . . 40 5.3.1 Definitions of Penalty Function . . . . . . . . . 40 5.3.2 Statistics of Partition Results . . . . . . . . . 43 5.4 The Number of Required Tags . . . . . . . . . . . . 44 Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 46 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . 47

參考文獻 References
[1] http://www.hapmap.org/. [2] http://www.perlegen.com/haplotype/. [3] http://www.ncbi.nlm.nih.gov/. [4] “A haplotype map of the human genome,” Physiol Genomics, Vol. 13, pp. 3–9, 2003. [5] N.W. J. Akey, K. Zhung, R. Chakraborty, and L. Jin, “Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination and mutation,” The American Journal of Human Genetics, Vol. 71, pp. 1227–1234, 2002. [6] E. Anderson and M. Slatkin, “Population-genetic basis of haplotype block in the 5q31 region,” The American Journal of Human Genetics, Vol. 74, pp. 40–49, 2004. [7] E. C. Anderson, “Finding haplotype block boundaries by using the Minimum- Description-Length principle,” The American Journal of Human Genetics, Vol. 73, pp. 336–354, 2003. [8] N. Arnheim, P. Calabrese, and M. Nordborg, “Hot and cold spots of recombination in the human genome: the reason we should find them and how this can be achieved,” The American Journal of Human Genetics, Vol. 73, pp. 5–16, 2003. [9] V. Bafna, B. Halldorsson, R. Schwartz, and A. Clark, “Haplotype and informative SNP selection algorithms: don’t block out information,” International Conference on Research in Computational Molecular Biology, Berlin, Germany, 2003. [10] Bntridder, B. V. Halldorsson, M. M. Halldorsson, Hurkens, Lenstra, Ravi, and Stougie., “Approximation algorithms for the test cover problem,”Mathematical Programming,Vol. 98, pp. 477–491, 2003. [11] A. J. Brookes, “The essence of SNPs,” Gene, Vol. 234, pp. 177–186, 1999. [12] D. Claayton, “Choosing a set of haplotype tagging SNPs from a larger set of diallelic loci.” www.nature.com/ng/journal/v29/n2/extref/ng1001-233-S10.pdf, 2001. [13] M. J. Daly, J. Rioux, S. Schaffner, T. Hudson, and E. Lander, “High-resolution haplotype structure in the human genome,” Nature Genetics, Vol. 29, pp. 229–232, 2001. [14] S. Gabriel, S. Schaffner, H. Ngyen, J. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. Deflice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. Lander, M. Daly, and D. Altshuler, “The structure of haplotype blocks in the human genome,” Science, Vol. 296, No. 21, pp. 2225–2229, 2002. [15] M. R. Garey and D. S. Johnson, Computers and Intracrability: a Guide to the Theory of NP-completeness. WH Freeman and Company, first ed., 1979. [16] I. Gray, D. Campbell, and B. Spurr, “Single nucleotide polymorphisms as tools in human genetics,” Human Molecular Genetics, Vol. 9, No. 16, pp. 2403–2408, 2000. [17] B. Halldorsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Istrail, “Combinatorial problems arising in SNP and haplotype analysis,” Discrete Mathematics and Theoretical Computer Science, pp. 26–47, 2003. [18] X. Ke and L. R. Cardon, “Efficient selective screening of haplotype tag SNPs,” Bioinformatics, Vol. 19, No. 2, pp. 287–288, 2003. [19] Koivisto, Perola, Varilo, Hennah, Ekelund, Lukk, Peltonen, Ukkonen, and Mannila, “An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries,” Proceedings of Pacific Symposium on Biocomputing, Vol. 8, Stanford University, USA, pp. 502–513, 2003. [20] P. Nowotnyand, J. Kwon, and A. Goate, “SNP analysis to dissect human traits,” Current Opinion in Neurobiology, Vol. 11, pp. 637–641, 2001. [21] N. Patil, A. Berno, D. Hinds, W. Barrett, J. Doshi, C. Hacker, C. Kautzer, D. Lee, C. Marjoribanks, C. Kautzer, B. Nguyen, M. Norris, J. Sheehan, N. Shen, D. Stern, R. Stokowski, D. Thomas, M. Trulson, K. Vyas, K. Frazer, S. Fodor, and D. Cox, “Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21,” Science, Vol. 294, No. 23, pp. 1719–1723, 2001. [22] D. Reich, S. Schaffner, M. Daly, G. McVean, J. Mllikin, J. Higgins, D. Richter, E. Lander, and D. Altshuler, “Human genome sequence variation and the influence of gene history, mutation and recombination,” Nature Genetics, Vol. 32, pp. 135–142, 2002. [23] M. Remm, A. Metspalu, E. Biocentre, and U. of Tartu, “How many SNPs do we need for whole-genome linkage disequilibrium mapping?,” Human Genome Meeting, 2002. [24] J. A. Schneider, M. Pungliya, J. Choi, R. Jiang, X. J. Sun, B. Salisbury, and C. Stephens, “DNA variability of human genes,” Mechanisms of Ageing and Development, Vol. 124, pp. 17–25, 2003. [25] R. Schwartz, “Haplotype motifs: an algorithmic approach to locating evolutionarily conserved patterns in haploid sequences,” Proceedings of the Computational Systems Bioinformatics, Stanford University, USA, pp. 1–9, 2003. [26] P. Sebastiani, R. Lazarus, S.Weiss, L. Kunkel, I. Kohane, andM. Ramoni, “Minimal haplotype tagging,” Proceedings of the National Academy of Sciences, Vol. 100, No. 17, pp. 9900–9905, 2003. [27] B. S. Shastry, “SNP alleles in human disease and evolution,” Journal of Human Genetics, Vol. 47, pp. 561–566, 2002. [28] H. A.-I. X. Su and F. D. L. Viga, “Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity,” Pacific Symposium on Biocomputing, Lihue, Hawaii, USA, 2003. [29] D.Wang, J. Fan, C. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour, N. Perkins, E. Winchester, J. Spencer, L. Kruglyak, L. Stein, L. Hsie, T. Topaloglou, vHubbell, E. Robinson, M. Mittmann, M. Morris, N. Shen, D. Kilburn, J. Rioux, C. Nusbaum, S. Rozen, T. Hudson, and E. Lander, “Large-scale identification, mapping and genotyping of single-nucleotide polymorphisms in the human genome,” Science, Vol. 280, No. 5366, pp. 1077–1082, 1998. [30] L. Wen-Hsiung and D. Graur, Fundamentals of Molecular Evolution. Sinauer associates, Inc., first ed., 1990. [31] L. WH and S. LA, “Low nucleotide diversity in man,” Genetics, Vol. 129, pp. 513–523, 1991. [32] K. Zhang, T. Chen, M. Waterman, and F. Sun, “A set of dynamic programming algorithms for haplotype block partitioning and tag SNP selection via haplotype data or genotype data,” In Proceedings of Discrete Mathematics and Theoretical Computer Science Workshop on SNP, Rutgers University Busch Campus, Piscataway,NJ, USA, pp. 1–26, 2003. [33] K. Zhang, M. Deng, T. Chenn, M.Waterman, and F. Sun, “A dynamic programming algorithm for haplotype block partition,” Proceedings of the National Academy of Sciences, Vol. 99, No. 11, pp. 7335–7339, 2002. [34] K. Zhang, F. Sun,M. S.Waterman, and T. Chen, “Dynamic programming algorithms for haplotype block partitioning: applications to human chromosome 21 ahplotype data,” International Conference on Research in Computational Molecular Biology, Berlin, Germany, 2003.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0727104-171533.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS