國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,共同多重集區間的高效率演算法,Efficient Algorithms for the Common Multiset Interval Problem

論文名稱 Title	共同多重集區間的高效率演算法 Efficient Algorithms for the Common Multiset Interval Problem
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	103 學年度第 1 學期 The fall semester of Academic Year 103	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	47
研究生 Author	陳慶耀 Cing-Yao Chen
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	李宗南 Chungnan Lee
口試委員 Advisory Committee	安興彥, 彭永興 Hsing-Yen Ann; Yung-Hsing Peng
口試日期 Date of Exam	2015-01-26	繳交日期 Date of Submission	2015-02-08
關鍵字 Keywords	行為知識空間、共同多重集區間、大學能力程式檢定、最長共同子序列、共同區間、組合語言碼 Assembly Code, CI, LCS, CPE, BKS, CMI
統計 Statistics	本論文已被瀏覽 5696 次，被下載 395 次 The thesis/dissertation has been browsed 5696 times, has been downloaded 395 times.

中文摘要
對於兩序列 A = a1a2a3 … am 以及 B = b1b2b3 … cn，一個多重集區間為 ∆(A, i, j) = [ax \| i ≤ x ≤ j]，以及一個同時出現在兩序列的多重集，共同多重集區間（common multisets interval，CMI）為∆(A, iA, jA) = ∆(B, iB, jB) 對於 $iA, jA, iB, jB. 1 ≤ iA ≤ jA ≤ m 且 1 ≤ iB ≤ jB ≤ n。先前，研究者推出了用以找到兩個排列（permutation）以及序列（sequence）的共同區間演算法。在這篇碩士論文中，我們推出兩個用來在兩序列中找到共同多重集區間的演算法。第一個演算法是 occurrence counting 演算法，它計算出現元素在兩輸入序列中所有區間的次數並且計算元素出現次數的差值。它的時間複雜度是O(n3)，n 代表輸入序列的長度。第二個演算法是 hash key 演算法，使用質數乘積以及模運算來建立哈希表以便加快搜尋。第二個演算法的時間複雜度是 O(n2 + Gn + qn) 或 O(n2\|Σ\| + G\|Σ\|+ q\|Σ\|)， G 代表答案的數量而 q 代表錯誤碰撞的數量。在我們的實驗中，我們使用CPE (Collegiate Programming Examination of Taiwan) 中的C/C++程式碼作為我們分類用的資料集。實驗結果顯示BKS (behavior knowledge space) 混合 LCS (longest common subsequence) 和 CMI 可以得到比兩個方法單獨使用還要高的準確度。
Abstract
For two sequences A = a1a2a3 … am and B = b1b2b3 … cn, a multiset interval ∆(A, i, j) = [ax \| i ≤ x ≤ j], and a common multisets interval (CMI) is ∆(A, iA, jA) = ∆(B, iB, jB) for some $iA, jA, iB, jB. 1 ≤ iA ≤ jA ≤ m and 1 ≤ iB ≤ jB ≤ n, which is a multiset that appears in both sequences. Previously, researchers have proposed algorithms for finding the common set interval of permutations and sequences. In this thesis, we propose two algorithms to find common multiset intervals of two sequences. The first is the occurrence counting algorithm, which counts the occurrences of the characters in all intervals of the two input sequences and calculate the difference of character occurrences. Its time complexity is O(n3) time, where n denotes the length of the input sequences. The second is the hash key algorithm, which use the product of prime numbers and the modulo operation to build a hash table for quick search. The time complexity of the second algorithm is O(n2 + Gn + qn) or O(n2\|Σ\| + G\|Σ\|+ q\|Σ\|), where G denotes the number of answers and q denotes the number of error collisions. In our experiments, we use C/C++ source codes in CPE (Collegiate Programming Examination of Taiwan) as the data set for classification. The experimental results show that the BKS (behavior knowledge space) method with the combination of the LCS (longest common subsequence) and CMI classifiers can obtain better accuracy than the two methods alone.

目次 Table of Contents
中文論文審定書 i 英文論文審定書 ii 謝辭 iii 中文摘要 iv 英文摘要 v TABLE OF CONTENTS vii LIST OF FIGURES viii LIST OF TABLES ix Chapter 1. Introduction 1 1.1 Definitions 1 Chapter 2. Previous Works 3 2.1 The Common Set Intervals of Two Permutations 3 2.1.1 Algorithm 1 of Uno and Yagiura 4 2.1.2 Algorithm 2 of Uno and Yagiura 4 2.1.3 Algorithm 3 of Uno and Yagiura 6 2.1.4 Algorithm 4 of Uno and Yagiura 7 2.2 The Common Set Intervals of k Permutations 7 2.3 The Common Set Intervals of Two Sequences 8 2.4 The Common Set Intervals of k Sequences 10 2.5 The Longest Common Subsequence Problem 11 2.6 The Behavior Knowledge Space Method 11 Chapter 3. Algorithms for the Common Multiset Interval Problem 13 3.1 The Occurrence Counting Algorithm 13 3.2 The Hash Key Algorithm 14 Chapter 4. Experimental Results 26 4.1 Assembly Process 26 4.2 BKS with LCS and CMI 28 Chapter 5. Conclusions 31 BIBLIOGRAPHY 32

參考文獻 References
[1]H.-Y. Ann, C.-B. Yang, C.-T. Tseng, and C.-Y. Hor, “A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings,” Information Processing Letters, Vol. 108, pp. 360–364, 2008. [2] M.-P. B´eal, A. Bergeron, S. Corteel, and M. Raffinot, “An algorithmic view of gene teams,” Theoretical Computer Science, Vol. 320, No. 2-3, pp. 395–418, 2004. [3] K.-Y. Cheng, K.-S. Huang, and C.-B. Yang, “The longest common subsequence problem with the gapped constraint,” Proc. of the 30th Workshop on Combinatorial Mathematics and Computation Theory, pp. 80–85, 2013. [4] M. Clauss, M. Bernt, and M. Middendorf, “A common interval guided aco algorithm for permutation problems,” 2013 IEEE Symposium on Swarm Intelligence (SIS), pp. 64–71, 2013. [5] G. Didier, “Common intervals of two sequences,” Algorithms in Bioinformatics, Vol. 2812, pp. 17–24, 2003. [6] M. Y. Galperin and E. V. Koonin, “Who’s your neighbor? new computational approaches for functional genomics,” Nature Biotechnology, Vol. 18, No. 6, pp. 609–13, 2009. [7] S. Heber and J. Stoye, “Finding all common intervals of k permutations,” In Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001, pp. 207–218, Springer Verlag, 2001. [8] D. S. Hirschberg, “A linear space algorithm for computing maximal common subsequences,” Communications of the ACM, Vol. 18, pp. 341–343, 1975. [9] K.-S. Huang, C.-B. Yang, and K.-T. Tseng, “Fast algorithms for finding the common subsequence of multiple sequences,” Proceedings of International Computer Symposium, Taipei, Taiwan, pp. 90(Abstract, full text in CD), 2004. [10] K.-S. Huang, C.-B. Yang, K.-T. Tseng, H.-Y. Ann, and Y.-H. Peng, “Efficient algorithms for finding interleaving relationship between sequences,” Information Processing Letters, Vol. 105(5), pp. 188–193, 2008. [11] K.-S. Huang, C.-B. Yang, K.-T. Tseng, Y.-H. Peng, and H.-Y. Ann, “Dynamic programming algorithms for the mosaic longest common subsequence problem,” Information Processing Letters, Vol. 102, pp. 99–103, 2007. [12] Y. Huang and C. Suen, “The behavior-knowledge space method for combination of multiple classifiers,” Computer Vision and Pattern Recognition, 1993. Proceedings CVPR ’93., 1993 IEEE Computer Society Conference on, pp. 347–352, Jun 1993. 32[13] J. W. Hunt and T. G. Szymanski, “A fast algorithm for computing longest common subsequences,” Communications of the ACM, Vol. 20(5), pp. 350–353, 1977. [14] C. S. Iliopoulos and M. S. Rahman, “Algorithms for computing variants of the longest common subsequence problem,” Theoretical Computer Science, Vol. 395, pp. 255–267, 2008. [15] C. S. Iliopoulos and M. S. Rahman, “New efficient algorithms for the LCS and constrained LCS problems,” Information Processing Letters, Vol. 106(1), pp. 13–18, 2008. [16] W. C. Lathe, B. Snel, and P. Bork, “Gene context conservation of a higher order than operons,” Trends in Biochemical Sciences, Vol. 25, No. 10, pp. 474–479, 2000. [17] N. Luc, J.-L. Risler, A. Bergeron, and M. Raffinot, “Gene teams: a new formalization of gene clusters for comparative genomics,” Computational Biology and Chemistry, Vol. 27, No. 1, pp. 59–67, 2003. [18] R. Overbeek, M. Fonstein, M. D’Souza, G. D. Pusch, and N. Maltsev, “The use of gene clusters to infer functional coupling,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 96, No. 6, pp. 2896–2901, 1999. [19] R. J. Parikh, “On context-free languages,” J. ACM, Vol. 13, No. 4, pp. 570–581, Oct. 1966. [20] Y.-H. Peng, C.-B. Yang, K.-S. Huang, C.-T. Tseng, and C.-Y. Hor, “Efficient sparse dynamic programming for the merged LCS problem with block constraints,” International Journal of Innovative Computing, Information and Control, Vol. 6, pp. 1935–1947, 2010. [21] Y.-H. Peng, C.-B. Yang, K.-S. Huang, and K.-T. Tseng, “An algorithm and applications to sequence alignment with weighted constraints,” International Journal of Foundations of Computer Science, Vol. 21, pp. 51–59, 2010. [22] I. Rusu, “Extending common intervals searching from permutations to sequences,” The Computing Research Repository, Vol. abs/1310.4290, 2013. [23] T. Schmidt and J. Stoye, “Quadratic time algorithms for finding common intervals in two and more sequences,” In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM 2004, pp. 347–58, Springer, 2004. [24] I. Stewart, Galois theory. Chapman Hall/CRC Mathematics, third ed., 2003. [25] J. Tamames, “Evolution of gene order conservation in prokaryotes,” Genome Biology, Vol. 2, No. 6, p. research0020.1Vresearch0020.11, 2001. [26] J. Tamames, G. Casari, C. Ouzounis, and A. Valencia, “Conserved clusters of functionally related genes in two bacterial genomes,” Journal of Molecular Evolution, Vol. 44, No. 1, pp. 66–73, Jan. 1997. [27] T. Uno and M. Yagiura, “Fast algorithms to enumerate all common intervals of two permutations,” Algorithmica, Vol. 26, No. 2, pp. 290–309, 2000.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0108115-142017.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS