Responsive image
博碩士論文 etd-0725102-233322 詳細資訊
Title page for etd-0725102-233322
論文名稱
Title
資料品質改善之研究:錯誤資料偵測技術之發展與評估
Improving Data Quality: Development and Evaluation of Error Detection Methods
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
50
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2002-07-23
繳交日期
Date of Submission
2002-07-25
關鍵字
Keywords
資料品質、錯誤偵測、決策樹歸納法、語意限制、異常點偵測、資料淨化
Semantic Constraint, Error Detection, Data Quality, Outlier Detection, Data Cleaning, Decision Tree Induction
統計
Statistics
本論文已被瀏覽 5703 次,被下載 11945
The thesis/dissertation has been browsed 5703 times, has been downloaded 11945 times.
中文摘要
在組織中,決策所需資料的品質是影響決策好壞的重要依據,但是根據估計顯示,在組織中15%-20%的資料庫可能含有錯誤的資料。在資料庫中錯誤的資料常常導致了潛在的決策問題,為了改善資料的品質,資料淨化(Data Cleaning)的技術必須要被實行。廣泛地說,資料品質的問題可以被分為三種,分別是:不完整性(Incompleteness)、不一致性(Inconsistency)以及不正確性(Incorrectness)。在這三種資料品質的問題中,不正確性資料的問題是資料品質低落的主要來源。因此,本研究的目的在資料庫中偵測不正確的資料以改善資料品質。根據語意限制(semantic constraint)的架構,本研究發展一套錯誤偵測的方法,偵測的範圍包括唯一性偵測(uniqueness detection)、值域偵測(domain detection)、屬性相依性偵測(attribute value dependency detection)、屬性值域間包容性偵測(attribute domain
inclusion detection)以及實體參與性偵測(entity participation detection)。實證評估結果顯示在部份提出的方法中(如唯一性偵測)可以得到較低的失誤率(miss rate)和錯誤警報率(false alarm rate)。整體而言,本研究所提出的偵測方法可以在錯誤資料中偵測出約50%的錯誤資料。
Abstract
High quality of data are essential to decision support in organizations. However estimates have shown that 15-20% of data within an organization’s databases can be erroneous. Some databases contain large number of errors, leading to a large potential problem if they are used for managerial decision-making. To improve data quality, data cleaning endeavors are needed and have been initiated by many organizations. Broadly, data quality problems can be classified into three categories, including incompleteness, inconsistency, and incorrectness. Among the three data quality problems, data incorrectness represents the major sources for low quality data. Thus, this research focuses on error detection for improving data quality. In this study, we developed a set of error detection methods based on the semantic constraint framework. Specifically, we proposed a set of error detection methods including uniqueness detection, domain detection, attribute value dependency detection, attribute domain inclusion detection, and entity participation detection. Empirical evaluation results showed that some of our proposed error detection techniques (i.e., uniqueness detection) achieved low miss rates and low false alarm rates. Overall, our error detection methods together could identify around 50% of the errors introduced by subjects during experiments.
目次 Table of Contents
CHAPTER 1 Introduction................................ 1
1.1 Background.........................................1
1.2 Research Motivations and Objective ................2
1.3 Organization of the Thesis ........................4
CHAPTER 2 Literature Review .......................... 6
2.1 Data Cleaning Techniques for Data Incompleteness ..6
2.2 Data Cleaning Techniques for Data Inconsistency ...7
2.3 Data Cleaning Techniques for Data Incorrectness....8
CHAPTER 3 Error Detection Methods Based on Semantic
Constraint Framework ...................... 13
3.1 Uniqueness Detection .............................14
3.2 Attribute Domain Detection........................20
3.3 Attribute Value Dependency Detection..............20
3.4 Domain Inclusion Detection .......................22
3.5 Entity Participation Detection....................24
3.6 Summary...........................................26
CHAPTER 4 Empirical Evaluation ...................... 28
4.1 Database Collection for Evaluation ...............28
4.2 Evaluation Criteria ..............................29
4.3 Evaluation Design.................................31
4.4 Profile of Subjects...............................33
4.5 Evaluation Results ...............................34
CHAPTER 5 Conclusions and Future Research Directions. 41
References .......................................... 43
參考文獻 References
[AS94] Agrawal, R. and Srikant, R., “Fast Algorithms for Mining Association
Rules,” Proc. of the 20th VLDB Conference, Santiago, Chile, 1994,
[B00] Brauer, B., “Data Quality –Spinning Straw Into Gold,” Available [Online]
at: http://www2.sas.com/proceedings/sugi26/p117-26.pdf, 2000.
[BFOS84] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J.: Classification
and Regression Trees, Wadsworth Int. Group, Belmont, California, USA,
[BL94] Barnett, V. and Lewis, T., Outliers in Statistical Data, Wiley & Sons,
Chichester, New York, 3rd Ed., 1994.
[CB91] Clark, P. and Boswell, R., “Rule Induction with CN2: Some Recent
Improvements,” Proceedings of the 5th European Conference (EWSL ‘91),
[CFPS99] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J., “Distributed
Data Mining in Credit Card Fraud Detection”, IEEE Intelligent Systems,
Vol. 14, No. 6, 1999, pp.67-74.
[CN89] Clark, P. and Niblett, T., “The CN2 Induction Algorithm”, Machine
Learning, Vol. 3, No. 4, 1989, pp.261−283
[F77] Friedman, j. H., “A recursive partitioning decision rule for nonparametric
classifiers,” IEEE Trans. on Comp., Vol. 26, 1977, pp.404-408.
[F97] Firth, C., “When Do Data Quality Problems Occur?,” Available [Online] at:
http://wunflower.singnet.com.sg/~cfirth/dql.htm, 1997.
[H80] Hawkins, D., Identification of Outliers, Chapman and Hall, London, 1980.
[H95] Hernandez, M., “A Generalization of Band Joins and the Merge/Purge
Problem,” Technical report CUCS-005-95, Department of Computer
Science, Columbia University, 1995.
[H96] Hou, W., “Extraction and Applications of Statistical Relationships in
Relational Databases,” IEEE Transactions on Knowledge and Data
Engineering , Vol. 8, Iss. 6, Dec. 1996, pp.939–945.
[HS95] Hernandez, M. and Stolfo, S., “The Merge/Purge Problem for Large
Databases,” Proceedings of the 1995 ACM SIGMOD, May 1995.
[HS98] Hernandez, M. A. and Stolfo, J. S., “Real-world Data is Dirty: Data
Cleansing and The Merge/Purge Problem,” Journal of Data Mining and
Knowledge Discovery, Vol. 2, 1998, pp.9-37.
[L86] Laudon, K.C., “Data Quality and Due Process in Large Interorganizational
Record Systems,” Communications of the ACM, Vol. 29, No. 1, January
1986, pp.4-11.
[KN98] Knorr, E. and Ng., R., “Algorithms for Mining Distance-based Outliers in
Large Datasets,” Proc. 24th VLDB Conference, 1998.
[KR90] Kaufman, L. and Rousseeus, P. J., Finding Groups in Data: An
introduction to Cluster Analysis, New York: John Wiley & Sons, 1990.”
[MM86] Michalski, R. S., Mozetic, I., Hong, J. and Lavrac, N., “The Multipurpose
Incremental Learning System AQ15 and its Testing Application to three
Medical Domains,” Proceedings of American Association for Artificial
Intelligence (AAAI-86), 1986, pp.1041-1045.
[MM00] Maletic, J. I. and Marcus, A., “Data Cleaning: Beyond Integrity
Checking,” Proceedings of the Conference on Information Quality
(IQ2000), October 2000, pp.200-209.
[P00] Paulson, L. D., “Data Quality: a Rising e-Business Concern,” IT
Professional, Vol. 2 No. 4, July-Aug. 2000, pp.10–14.
[Q86] Quinlan, J. R., “Induction of Decision Trees,” Machine Learning, Vol. 1, No.
1, 1986,pp.81-106.
[Q89] Quinlan, J. R. “Unknown Attribute Values in Induction,” Proceedings of
the 6th International Machine Learning Workshop, 1989, pp.164-168.
[Q93] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann
Publishers, San Mateo, CA, 1993.
[R96] Redman, T. C., Data Quality for the Information Age, Artech House Inc.,
Norwood, MA, 1996.
[RD00] Rahm, E. and Do, H.-H., “Data Cleaning: Problems and Current
Approaches,” IEEE Bulletin of the Technical Committee on Data
Engineering, Vol. 23, No. 4, December 2000.
[RH00] Raman, V. and Hellerstein, J., “An Interactive Framework for Data
Cleaning,” Technical Report, University of California at Berkeley, 2000.
[RHW86] Rumelhart, D. E., Hinton, G. E., & Williams, R. J., “Learning Internal
Representations by Error Propagation,” Parallel Distributed Processing:
Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L.
McClelland (Eds.), MIT Press, MA, 1986, pp.318-362.
[RRK00] Ramaswamy, S., Rastogi, R. and Shim, K., “Efficient Algorithms for
Mining Outliers from Large Data Sets,” Proceedings of the ACM SIGMOD
Conference on Management of Data, May 2000, pp.427-438.
[WC02] Wei, C. and Chiu, I., “Turning Telecommunications Call Details to Churn
Prediction: A Data Mining Approach,” Expert Systems with Applications
(forthcoming).
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code