Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat:

Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.eduhttp://zlin.ba.ttu.edu Zhangxi.lin@ttu.eduZhangxi.lin@ttu.edu 2015-06-16

Agenda Business Data Examples Review - Data mining procedure Two-stage predictive modeling Handling unstructured data ◦ Text Mining: CRM at Alibaba’s B2B Call Center ◦ Sentiment Analysis: Media-Aware Stock Trading Based on Public Web Information Understanding the nature of human beings in socio-economic context ◦ Cyber Credit Assessment for Internet Finance

Survey Data processing 1.I know how to cleanse data 2.I know how to do data exploration 3.I know how to fix data quality problems Data mining 1.Know how to develop a decision tree model 2.I know the principles of classification modeling 3.I know how to calculate GINI, or entropy given a decision tree split 4.I know how to use confusion matrix to assess the performance of a classification modeling Tools 1.I can do SAS programming 2.I know how to use SAS Enterprise Miner 3.I know how to use other data mining tools

To conduct good research projects in big data The following skills are highly recommended ◦ Data preparation: aggregation, cleansing, conversion, quality checking ◦ Management massive data with DBMS and DW ◦ Basic data mining skills: classification, clustering, association analysis, and ext mining ◦ Understand basic algorithms: CHAID, CRT, K- Means, SOM, etc. ◦ Ability to explain data mining results correctly

Advanced data mining techniques Data quality diagnosis Handling imbalanced dataset Handling missing values Coping with the curse of dimensionality Multi-stage modeling Two-stage classification modeling Model performance assessment

BUSINESS DATA EXAMPLES

表 4 order_be- selled order_sn 表 5 order_caig ou order_sn 表 11 order_ship order_sn 表 1 data_affix order_sn 表 7 order_refund order_sn refund_id user_id 表 9 order_rights order_sn user_id 表 6 order_data order_sn user_id 表 10 order_table order_sn user_id 表 8 order_refu nd_log order_sn refund_id user_id 表 2 order_air order_sn 表 3 order_air _user order_sn Dataset provided by Qiyi Network at CHongqing

Beijing 1039 Traffic Radio (Ad revenue 3 billion RMB/year) 数据来源录入系统方式标准化地点方向定量定性交管局或交委摄像头或其他方式采集的路况信息经过编辑文字化后传递至路况信息中心。高准确定量固定采集点系统自动拨打采集点固定电话，采集点根据路况选择【拥堵】【缓慢】【畅通】对应的按键，系统自动生成标准化文字信息反馈至路况信息中心。高准确定性浮动车通过交通台发放的手机预装客户端软件，定期返回车辆行驶数据，根据手机 GPS 系统，车速，判断路况。高准确（如手机 GPS 不开，会缺少地点方向等信息）定量信息播报员信息员拨打路况电话报路况，由路况信息中心人工根据电话内容录入系统。高准确定性交通信息志愿者全市热心志愿者通过交通广播 APP 客户端或短信平台，自动自发报路况。低信息表述不能保证完整清晰定性本次提供数据样本为浮动车一周数据（包括常规路况和突发事件路况）

Beijing’s Floating Vehicle Data Data: Location (X, Y) and Time

Taxis in Fuzhou This map is updated every 15 seconds Data: Location (X, Y) and Time

REVIEW - DATA MINING PROCEDURE

ISQS 6347, Data & Text Mining 12 Data Mining Process

ISQS 6347, Data & Text Mining 13 Types of Attributes (Variables) There are different types of attributes ◦ Nominal  Examples: ID numbers, eye color, zip codes ◦ Ordinal  Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} ◦ Interval  Examples: calendar dates, temperatures in Celsius or Fahrenheit. ◦ Ratio  Examples: temperature in Kelvin, length, time, counts

ISQS 6347, Data & Text Mining 14 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: ◦ Distinctness: =  ◦ Order: ◦ Addition: + - ◦ Multiplication: * / ◦ Nominal attribute: distinctness ◦ Ordinal attribute: distinctness & order ◦ Interval attribute: distinctness, order & addition ◦ Ratio attribute: all 4 properties

ISQS 6347, Data & Text Mining 15 Discrete and Continuous Attributes Discrete Attribute ◦ Has only a finite or countably infinite set of values ◦ Examples: zip codes, counts, or the set of words in a collection of documents ◦ Often represented as integer variables. ◦ Note: binary attributes are a special case of discrete attributes Continuous Attribute ◦ Has real numbers as attribute values ◦ Examples: temperature, height, or weight. ◦ Practically, real values can only be measured and represented using a finite number of digits. ◦ Continuous attributes are typically represented as floating-point variables.

ISQS 6347, Data & Text Mining 16 Important Characteristics of Structured Data ◦ Dimensionality  Curse of Dimensionality ◦ Sparsity  Only presence counts ◦ Quality  missing values, typos, outliers, etc. ◦ Resolution (frequency)  Patterns depend on the scale

ISQS 6347, Data & Text Mining 17 Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful Randomly generate 500 points Compute difference between max and min distance between any pair of points

ISQS 6347, Data & Text Mining 18 Dimensionality Reduction Purpose: ◦ Avoid curse of dimensionality ◦ Reduce amount of time and memory required by data mining algorithms ◦ Allow data to be more easily visualized ◦ May help to eliminate irrelevant features or reduce noise Techniques ◦ Principle Component Analysis ◦ Singular Value Decomposition ◦ Others: supervised and non-linear techniques

ISQS 6347, Data & Text Mining 19 Feature Subset Selection Another way to reduce dimensionality of data Redundant features ◦ duplicate much or all of the information contained in one or more other attributes ◦ Example: purchase price of a product and the amount of sales tax paid Irrelevant features ◦ contain no information that is useful for the data mining task at hand ◦ Example: students' ID is often irrelevant to the task of predicting students' GPA

ISQS 6347, Data & Text Mining 20 Data Quality What are data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: ◦ Noise and outliers ◦ missing values ◦ duplicate data

ISQS 6347, Data & Text Mining 21 Noise Noise refers to modification of original values ◦ Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine WavesTwo Sine Waves + Noise

ISQS 6347, Data & Text Mining 22 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

ISQS 6347, Data & Text Mining 23 Missing Values Reasons for missing values ◦ Information is not collected (e.g., people decline to give their age and weight) ◦ Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values ◦ Eliminate Data Objects ◦ Estimate Missing Values ◦ Ignore the Missing Value During Analysis ◦ Replace with all possible values (weighted by their probabilities)

ISQS 6347, Data & Text Mining 24 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another ◦ Major issue when merging data from heterogeneous sources Examples: ◦ Same person with multiple email addresses Data cleaning ◦ Process of dealing with duplicate data issues

ISQS 6347, Data & Text Mining 25 Data Preprocessing Tasks Main tasks ◦ Sampling ◦ Aggregation ◦ Feature creation ◦ Attribute Transformation ◦ Dimensionality Reduction ◦ Feature subset selection

ISQS 6347, Data & Text Mining 26 The Process of Classification categorical continuous class Test Set Training Set Model Learn Classifier

Data Mining Tools

ISQS 6347, Data & Text Mining 28 SAS Enterprise Miner v13.2 Basic ◦ How to use the application main menu ◦ Using the pop-up menus ◦ Enterprise Miner documentation ◦ Project – Diagram The SEMMA methodology ◦ Sample ◦ Explore ◦ Modify ◦ Model ◦ Assess

ISQS 6347, Data & Text Mining 29 Case: German credit benchmark data set 1000 observations Clean data Target variable: “Good_Bad” Cost: $1 loss when “false negative” vs. $5 loss “when false positive” Prior probability of the target variable: 0.9:0.1 vs. sample probability 0.7:0.3

SAS Enterprise Miner 30

The Analytic Workflow 31 Analytic workflow Define analytic objective Select cases Extract input data Validate input data Repair input data Apply analysis Transform input data Generate deployment methods Integrate deployment Gather results Assess observed results Refine analytic objective

Open Source Data Mining Software – Rapid Miner Rapid Miner Rapid Miner Formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010. The RapidMiner project was started in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at the Artificial Intelligence Unit of the University of Dortmund. In 2006 Ingo Mierswa and Ralf Klinkenberg founded the company Rapid-I that is now the main contributor out of more than 30 international developers further developing RapidMiner.

TWO-STAGE PREDICTIVE MODELING

TEXT MINING

SENTIMENT ANALYSIS

Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat:

Similar presentations

Presentation on theme: "Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat:

Similar presentations

Presentation on theme: "Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat:"— Presentation transcript:

Similar presentations

About project

Feedback