國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,利用數據流預測潛在顧客,Prediction of Potential Customers with Data Streams

論文名稱 Title	利用數據流預測潛在顧客 Prediction of Potential Customers with Data Streams
系所名稱 Department	應用數學系 Department of Applied Mathematics
畢業學年期 Year, semester	107 學年度第 2 學期 The spring semester of Academic Year 107	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	33
研究生 Author	謝宜臻 Shie,Yi-Jen
指導教授 Advisor	郭美惠 Mei-Hui Guo
召集委員 Convenor	羅夢娜 Mong-Na Lo Huang
口試委員 Advisory Committee	黃士峰, 林良靖, 曾聖澧 Shih-Feng Huang; Liang-Ching Lin; Sheng-Li Tzeng
口試日期 Date of Exam	2019-07-09	繳交日期 Date of Submission	2019-07-29
關鍵字 Keywords	查全率(recall)、精確率(precision)、AUC、預測準確率(accuracy)、XGBoost 分類器、梯度提升(gradient boosting)、邏輯斯迴歸(logistic regression)、特徵整合、電商產業、數據流 gradient boosting, logistic regression, XGBoost classification, aggregated features, data streams, e-commerce
統計 Statistics	本論文已被瀏覽 5617 次，被下載 0 次 The thesis/dissertation has been browsed 5617 times, has been downloaded 0 times.

中文摘要
對於電商產業來說，預測潛在顧客是重要的議題。本研究對於分析該議題有兩筆資料。第一筆資料來自於Kaggle 競賽『Acquire Value Shopper Challenge』，該筆資料含有顧客的購物紀錄及優惠券資訊，大小為22GB，主要目標為預測哪些顧客得到優惠券後仍會回購。第二筆資料來自台灣電商，該筆資料含有顧客的下單交易記錄以及商品相關資訊，主要目標為根據過往購物紀錄預測哪些顧客會再購買新商品。首先，我們會將數據流整合成新的資料矩陣，該矩陣中每列代表資料中每位顧客，特徵來自於整合其他數據流該顧客的過去交易資訊。在預測模型部分，我們使用邏輯斯迴歸(logistic regression)，梯度提升(gradient boosting) 以及 XGBoost 分類器。模型準則為AUC、預測準確率（accuracy）、查全率（recall）以及精確率（precision）。預測結果為電商公司提供有效資訊，使其廣告成本預算達目標。
Abstract
Predicting potential customers is an important challenge for e-commerce companies. In this study, we analyze this problem from two data sets. The first dataset, based on the Kaggle competition called “Acquire Value Shopper Challenge”, contains all the given transactions with size over 22GB. The objective is to make predictions of customers who will remain loyal after a promotional period. The second dataset is from a local e-commerce company, which contains customer purchase history and product information. The objective is to make predictions for potential customers who will buy a new product given their past purchase information. We first transfer the data stream to a new data format where columns represent aggregated features and rows represent customers. The aspect of feature engineering is based on the aggregation of past customer transactions. We consider the prediction models including logistic regression, gradient boosting and XGBoost classifier. To evaluate the models, the area under the operating characteristic curve (AUC) and prediction accuracies are adopted. The results provide useful information for e-companies to keep their advertising budget in line with promotional and marketing goals.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 致謝 iii 摘要 iv Abstract v 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Machine Learning Model 1 2.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2.2 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 2 3 Acquire Value Shopper Challenge 4 3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Training set and Test set . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4 Leave-one-offer-out method . . . . . . . . . . . . . . . . . . . . . 8 3.5 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Local E-commerce Company 10 4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Training set and Test sets . . . . . . . . . . . . . . . . . . . . . . . 16 4.4 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Conclusion 23

參考文獻 References
[1] Chen, T., and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. International Conference on Knowledge Discovery and Data Mining 6, 785-794 [2] Géron, A. (2017). Hands-on Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media. [3] James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer. [4] Leo, B. (2001). Random forests, 45(1) : 5–32 Maching Learning Springer. [5] Nikulin, V. (2015). On the method for data streams aggregation to predict shoppers loyalty. International Joint Conference on Neural Networks. [6] Nikulin, V. (2016). Prediction of the shoppers loyalty with aggregated data streams. Journal of Artificial Intelligence and Soft Computing Research 6; 69-79. [7] Acquire Valued Shoppers Challenge, part 1 - Competitions go through \| Coursera. 2019. Coursera. https://www.coursera.org/lecture/competitive-data-science/acquire-valued-shoppers-challenge-part-1-nudnt. [8] Acquire Valued Shoppers Challenge, part 2 - Competitions go through \| Coursera. 2019. Coursera. https://www.coursera.org/lecture/competitive-data-science/acquire-valued-shoppers-challenge-part-2-b5i7N. [9] Auduno. Kaggle code for Acquire Valued Shoppers Challenge. https://github.com/auduno/Kaggle-Acquire-Valued-Shoppers-Challenge. [10] Sanket Doshi. Brief on Recommender Systems. https://towardsdatascience.com/brief-on-recommender-systems-b86a1068a4dd. [11] XGBoost Documentation - xgboost 0.81 documentation. 2019. https://xgboost.readthedocs.io/en/latest/index.html.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2024-07-29 校外 Off-campus：開放下載的時間 available 2024-07-29 您的 IP(校外) 位址是 18.188.61.223 現在時間是 2024-04-18 論文校外開放下載的時間是 2024-07-29 Your IP address is 18.188.61.223 The current date is 2024-04-18 This thesis will be available to you on 2024-07-29.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2024-07-29

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS