國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,資料品質改善之研究：錯誤資料偵測技術之發展與評估,Improving Data Quality: Development and Evaluation of Error Detection Methods

論文名稱 Title	資料品質改善之研究：錯誤資料偵測技術之發展與評估 Improving Data Quality: Development and Evaluation of Error Detection Methods
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	90 學年度第 2 學期 The spring semester of Academic Year 90	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	50
研究生 Author	李念秋 Nien-Chiu Lee
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	黃三益 San-Yih Hwang
口試委員 Advisory Committee	邱兆民 Chao-Min Chiu
口試日期 Date of Exam	2002-07-23	繳交日期 Date of Submission	2002-07-25
關鍵字 Keywords	資料品質、錯誤偵測、決策樹歸納法、語意限制、異常點偵測、資料淨化 Semantic Constraint, Error Detection, Data Quality, Outlier Detection, Data Cleaning, Decision Tree Induction
統計 Statistics	本論文已被瀏覽 5703 次，被下載 11945 次 The thesis/dissertation has been browsed 5703 times, has been downloaded 11945 times.

中文摘要
在組織中，決策所需資料的品質是影響決策好壞的重要依據，但是根據估計顯示，在組織中15%-20%的資料庫可能含有錯誤的資料。在資料庫中錯誤的資料常常導致了潛在的決策問題，為了改善資料的品質，資料淨化(Data Cleaning)的技術必須要被實行。廣泛地說，資料品質的問題可以被分為三種，分別是：不完整性(Incompleteness)、不一致性(Inconsistency)以及不正確性(Incorrectness)。在這三種資料品質的問題中，不正確性資料的問題是資料品質低落的主要來源。因此，本研究的目的在資料庫中偵測不正確的資料以改善資料品質。根據語意限制(semantic constraint)的架構，本研究發展一套錯誤偵測的方法，偵測的範圍包括唯一性偵測(uniqueness detection)、值域偵測(domain detection)、屬性相依性偵測(attribute value dependency detection)、屬性值域間包容性偵測(attribute domain inclusion detection)以及實體參與性偵測(entity participation detection)。實證評估結果顯示在部份提出的方法中(如唯一性偵測)可以得到較低的失誤率(miss rate)和錯誤警報率(false alarm rate)。整體而言，本研究所提出的偵測方法可以在錯誤資料中偵測出約50%的錯誤資料。
Abstract
High quality of data are essential to decision support in organizations. However estimates have shown that 15-20% of data within an organization’s databases can be erroneous. Some databases contain large number of errors, leading to a large potential problem if they are used for managerial decision-making. To improve data quality, data cleaning endeavors are needed and have been initiated by many organizations. Broadly, data quality problems can be classified into three categories, including incompleteness, inconsistency, and incorrectness. Among the three data quality problems, data incorrectness represents the major sources for low quality data. Thus, this research focuses on error detection for improving data quality. In this study, we developed a set of error detection methods based on the semantic constraint framework. Specifically, we proposed a set of error detection methods including uniqueness detection, domain detection, attribute value dependency detection, attribute domain inclusion detection, and entity participation detection. Empirical evaluation results showed that some of our proposed error detection techniques (i.e., uniqueness detection) achieved low miss rates and low false alarm rates. Overall, our error detection methods together could identify around 50% of the errors introduced by subjects during experiments.

目次 Table of Contents
CHAPTER 1 Introduction................................ 1 1.1 Background.........................................1 1.2 Research Motivations and Objective ................2 1.3 Organization of the Thesis ........................4 CHAPTER 2 Literature Review .......................... 6 2.1 Data Cleaning Techniques for Data Incompleteness ..6 2.2 Data Cleaning Techniques for Data Inconsistency ...7 2.3 Data Cleaning Techniques for Data Incorrectness....8 CHAPTER 3 Error Detection Methods Based on Semantic Constraint Framework ...................... 13 3.1 Uniqueness Detection .............................14 3.2 Attribute Domain Detection........................20 3.3 Attribute Value Dependency Detection..............20 3.4 Domain Inclusion Detection .......................22 3.5 Entity Participation Detection....................24 3.6 Summary...........................................26 CHAPTER 4 Empirical Evaluation ...................... 28 4.1 Database Collection for Evaluation ...............28 4.2 Evaluation Criteria ..............................29 4.3 Evaluation Design.................................31 4.4 Profile of Subjects...............................33 4.5 Evaluation Results ...............................34 CHAPTER 5 Conclusions and Future Research Directions. 41 References .......................................... 43

參考文獻 References
[AS94] Agrawal, R. and Srikant, R., “Fast Algorithms for Mining Association Rules,” Proc. of the 20th VLDB Conference, Santiago, Chile, 1994, [B00] Brauer, B., “Data Quality –Spinning Straw Into Gold,” Available [Online] at: http://www2.sas.com/proceedings/sugi26/p117-26.pdf, 2000. [BFOS84] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J.: Classification and Regression Trees, Wadsworth Int. Group, Belmont, California, USA, [BL94] Barnett, V. and Lewis, T., Outliers in Statistical Data, Wiley & Sons, Chichester, New York, 3rd Ed., 1994. [CB91] Clark, P. and Boswell, R., “Rule Induction with CN2: Some Recent Improvements,” Proceedings of the 5th European Conference (EWSL ‘91), [CFPS99] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J., “Distributed Data Mining in Credit Card Fraud Detection”, IEEE Intelligent Systems, Vol. 14, No. 6, 1999, pp.67-74. [CN89] Clark, P. and Niblett, T., “The CN2 Induction Algorithm”, Machine Learning, Vol. 3, No. 4, 1989, pp.261−283 [F77] Friedman, j. H., “A recursive partitioning decision rule for nonparametric classifiers,” IEEE Trans. on Comp., Vol. 26, 1977, pp.404-408. [F97] Firth, C., “When Do Data Quality Problems Occur?,” Available [Online] at: http://wunflower.singnet.com.sg/~cfirth/dql.htm, 1997. [H80] Hawkins, D., Identification of Outliers, Chapman and Hall, London, 1980. [H95] Hernandez, M., “A Generalization of Band Joins and the Merge/Purge Problem,” Technical report CUCS-005-95, Department of Computer Science, Columbia University, 1995. [H96] Hou, W., “Extraction and Applications of Statistical Relationships in Relational Databases,” IEEE Transactions on Knowledge and Data Engineering , Vol. 8, Iss. 6, Dec. 1996, pp.939–945. [HS95] Hernandez, M. and Stolfo, S., “The Merge/Purge Problem for Large Databases,” Proceedings of the 1995 ACM SIGMOD, May 1995. [HS98] Hernandez, M. A. and Stolfo, J. S., “Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem,” Journal of Data Mining and Knowledge Discovery, Vol. 2, 1998, pp.9-37. [L86] Laudon, K.C., “Data Quality and Due Process in Large Interorganizational Record Systems,” Communications of the ACM, Vol. 29, No. 1, January 1986, pp.4-11. [KN98] Knorr, E. and Ng., R., “Algorithms for Mining Distance-based Outliers in Large Datasets,” Proc. 24th VLDB Conference, 1998. [KR90] Kaufman, L. and Rousseeus, P. J., Finding Groups in Data: An introduction to Cluster Analysis, New York: John Wiley & Sons, 1990.” [MM86] Michalski, R. S., Mozetic, I., Hong, J. and Lavrac, N., “The Multipurpose Incremental Learning System AQ15 and its Testing Application to three Medical Domains,” Proceedings of American Association for Artificial Intelligence (AAAI-86), 1986, pp.1041-1045. [MM00] Maletic, J. I. and Marcus, A., “Data Cleaning: Beyond Integrity Checking,” Proceedings of the Conference on Information Quality (IQ2000), October 2000, pp.200-209. [P00] Paulson, L. D., “Data Quality: a Rising e-Business Concern,” IT Professional, Vol. 2 No. 4, July-Aug. 2000, pp.10–14. [Q86] Quinlan, J. R., “Induction of Decision Trees,” Machine Learning, Vol. 1, No. 1, 1986,pp.81-106. [Q89] Quinlan, J. R. “Unknown Attribute Values in Induction,” Proceedings of the 6th International Machine Learning Workshop, 1989, pp.164-168. [Q93] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993. [R96] Redman, T. C., Data Quality for the Information Age, Artech House Inc., Norwood, MA, 1996. [RD00] Rahm, E. and Do, H.-H., “Data Cleaning: Problems and Current Approaches,” IEEE Bulletin of the Technical Committee on Data Engineering, Vol. 23, No. 4, December 2000. [RH00] Raman, V. and Hellerstein, J., “An Interactive Framework for Data Cleaning,” Technical Report, University of California at Berkeley, 2000. [RHW86] Rumelhart, D. E., Hinton, G. E., & Williams, R. J., “Learning Internal Representations by Error Propagation,” Parallel Distributed Processing: Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L. McClelland (Eds.), MIT Press, MA, 1986, pp.318-362. [RRK00] Ramaswamy, S., Rastogi, R. and Shim, K., “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proceedings of the ACM SIGMOD Conference on Management of Data, May 2000, pp.427-438. [WC02] Wei, C. and Chiu, I., “Turning Telecommunications Call Details to Churn Prediction: A Data Mining Approach,” Expert Systems with Applications (forthcoming).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0725102-233322.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS