Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Slides:



Advertisements
Similar presentations
Nokia Technology Institute Natural Partner for Innovation.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Copyright 2012 Trend Micro Inc. Raimund Genes, CTO Innovation In Cloud Security.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
Chapter 12: Web Usage Mining - An introduction
NHS DIRECT MULTI-CHANNEL SERVICE Dr Shirley Large and Kate Arnold
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Big Data A big step towards innovation, competition and productivity.
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
National Institute of Science & Technology Algorithm to Find Hidden Links Pradyut Kumar Mallick [1] Under the guidance of Mr. Indraneel Mukhopadhyay ALGORITHM.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Charles Tappert Seidenberg School of CSIS, Pace University
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Visual Analytics of User Behavior Project Description: Analyze and predict user behavior in a virtual world to inform dynamic modifications to the environment.
Event Websites, Part II: Setting Goals and Measuring Conversions John Curtis, Quotient Stephen Nold, Advon Technologies Ian Strain-Seymour, Apogee Search.
Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Real-time Ingestion of telemetry into Hadoop to respond to Zero-Day Attacks Vipul Sawant, Pallav Jakhotiya.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics (CS40003) Introduction to Data Lecture #1
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
CS 405G: Introduction to Database Systems
Big Data A Quick Review on Analytical Tools
Data Mining, Data Science, Big Data
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Cloud Distributed Computing Environment Hadoop
Zoie Barrett and Brian Lam
Big DATA.
UNIT 6 RECENT TRENDS.
Presentation transcript:

Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012

2 Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data processing Extracting useful information that reflects user behavior from massive log Instance data management Data analysis Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on Big data Challenges Opportuni ties

3 Distributed File System(HDFS) Key-value Database(HBase,Cassandra, MongoDB) Unstructured Data Cloud Storage Big Data processing BI/ Reporting BI/ Reporting Data Mining Machine Learning Analytic applications Cloud computing (Map/Reduce Framework) Cloud computing (Map/Reduce Framework) Big Data Access Hive NoSQL Raw data Instance data Distributed File System(HDFS) NoSQL Cloud computing (Map/Reduce Framework) Cloud computing (Map/Reduce Framework) Cassandra Process Mining Process Mining Process Mining Process Mining

Case study: Search Engine Company 4 News, Page, Image, Maps, Music, navigation Dataset: 66 million clicks in one month, 2.2 million clicks per day ->generate behavior in 10 minutes User Behavior: Visiting path (Referer) Searching result effectiveness Abs Clicking Behavior Source and Destination of User visiting Robot Behavior Reorganization and Analysis Visiting page layout Behavior comparison and product improvement User grouping and recommendation

Data features 5 It contains massive information in a well recorded format Large scale with big growing potential Real-time analysis

existing tools 6 Data extracting: XESame , Prom Import Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis Cloud Storage /no rational DB Instance data(XES) Extracting data from cloud

System Structure 7 Log processing Understandable model Extracting useful information that reflects user behavior from massive log

Convert raw log to instance data(event log) with Map/Reduce 8

9

10 fileSizelogNumOnePCTimeMapReduceTimeMapNumReduceNum 8.84 MB s, 921 ms7s M s, 846 ms25s M s, 559 ms30s315 One day(371M)2,200,0002.5minutes1.3minutes4015 One week15,000,00020 Minutes (Expected ) 2.5minutes28015 One month66,000,0002 hours (Expected ) 6 minutes CPU: Intel Xeon 2.40GHZ RAM:2GB 14Nodes

Process Discovery 11 Alpha miner Heuristic miner Fuzzy miner Sequence model One instance/case is defined as one visitor’s one time visiting. IP+UA CookieID Activity varies based on different requirements

Behavior analysis 12 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

13 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

Behavior analysis 14 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

Active visitor’s visiting path 15

Behavior analysis 16 User behavior pattern rangeactivityData selection Interaction between channels allContentType Web Map vising path allReferer/URL webpage layoutnewsContentType+Page Type+Block (Channel =news)AND( PageType=19 5) image ContentType+Page Type+Block (Channel =image)AND( PageType=43 5) Searching result all Behavior grouping all Registration

Main page 17

18

Sequence model 19

` 20

XES statistics 21

Conclusion 22 It is a nice project to get into data analysis field,with the combination of web data analysis, process mining and cloud computing technology. Future work: 1 More algorithms and technologies should be applied to this data set. 2 Behavior comparison and user recommendation still need to be accomplished. 3 Can process mining analyze the behavior that does not have a certain pattern. 1 Log Sampling 2 Detect the incorrectness from logs before applying log to analysis technologies. 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.

feedback 23 1 What is the real questions? 2 Why process mining?

Thank you ! Meng Dou 13/9/2012