Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Similar presentations


Presentation on theme: "Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012."— Presentation transcript:

1 Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012

2 2 Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data processing Extracting useful information that reflects user behavior from massive log Instance data management Data analysis Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on Big data Challenges Opportuni ties

3 3 Distributed File System(HDFS) Key-value Database(HBase,Cassandra, MongoDB) Unstructured Data Cloud Storage Big Data processing BI/ Reporting BI/ Reporting Data Mining Machine Learning Analytic applications Cloud computing (Map/Reduce Framework) Cloud computing (Map/Reduce Framework) Big Data Access Hive NoSQL Raw data Instance data Distributed File System(HDFS) NoSQL Cloud computing (Map/Reduce Framework) Cloud computing (Map/Reduce Framework) Cassandra Process Mining Process Mining Process Mining Process Mining

4 Case study: Search Engine Company 4 News, Page, Image, Maps, Music, navigation Dataset: 66 million clicks in one month, 2.2 million clicks per day ->generate behavior in 10 minutes User Behavior: Visiting path (Referer) Searching result effectiveness Abs Clicking Behavior Source and Destination of User visiting Robot Behavior Reorganization and Analysis Visiting page layout Behavior comparison and product improvement User grouping and recommendation

5 Data features 5 It contains massive information in a well recorded format Large scale with big growing potential Real-time analysis

6 existing tools 6 Data extracting: XESame , Prom Import Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis Cloud Storage /no rational DB Instance data(XES) Extracting data from cloud

7 System Structure 7 Log processing Understandable model Extracting useful information that reflects user behavior from massive log

8 Convert raw log to instance data(event log) with Map/Reduce 8

9 9

10 10 fileSizelogNumOnePCTimeMapReduceTimeMapNumReduceNum 8.84 MB364225 s, 921 ms7s315 65.8M21817730 s, 846 ms25s315 112 M77224148 s, 559 ms30s315 One day(371M)2,200,0002.5minutes1.3minutes4015 One week15,000,00020 Minutes (Expected ) 2.5minutes28015 One month66,000,0002 hours (Expected ) 6 minutes120015 CPU: Intel Xeon 2.40GHZ RAM:2GB 14Nodes

11 Process Discovery 11 Alpha miner Heuristic miner Fuzzy miner Sequence model One instance/case is defined as one visitor’s one time visiting. IP+UA CookieID Activity varies based on different requirements

12 Behavior analysis 12 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

13 13 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

14 Behavior analysis 14 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

15 Active visitor’s visiting path 15

16 Behavior analysis 16 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

17 Main page 17

18 18

19 Sequence model 19

20 ` 20

21 XES statistics 21

22 Conclusion 22 It is a nice project to get into data analysis field,with the combination of web data analysis, process mining and cloud computing technology. Future work: 1 More algorithms and technologies should be applied to this data set. 2 Behavior comparison and user recommendation still need to be accomplished. 3 Can process mining analyze the behavior that does not have a certain pattern. 1 Log Sampling 2 Detect the incorrectness from logs before applying log to analysis technologies. 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.

23 feedback 23 1 What is the real questions? 2 Why process mining?

24 Thank you ! Meng Dou 13/9/2012


Download ppt "Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012."

Similar presentations


Ads by Google