Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Software Architecture
DISTRIBUTED COMPUTING
Distributed Computing Systems Current Issues in DCS Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.
Programming in Hadoop Guangda HU Huayang GUO
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
SAS users meeting in Halifax
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
The University of Adelaide, School of Computer Science
Data Warehousing and Data Mining
Overview of big data tools
Zoie Barrett and Brian Lam
Big DATA.
UNIT 6 RECENT TRENDS.
Presentation transcript:

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy

Data Mining on the Web via Cloud Computing Introduction to –  Web Mining  Cloud computing infrastructure  Apache’s Hadoop Web Usage Mining using Hadoop HDFS and Map/Reduce technologies

What is Web Mining… What is Web Mining - data mining techniques applied to the Web to discover user patterns like  what users are looking for on the internet,  to deduce type of information the users are looking for,  structuring data available on the web etc. Why Web Mining –  amount of information available on the Web is enormous.  difficult for users to find and utilize information  not easy for content providers to classify and catalog documents

Types of Web Mining Web mining types –  Web usage mining.  Web content mining.  Web structure mining. Web usage mining - applying data mining techniques to discover usage patterns from Web data, to understand and serve the needs of Web-based applications better. Web content mining describes the automatic search of information available online, and involves mining web data content. Web structure mining is concerned with the description/ organization of the content.

More on Web Usage Mining… Preprocessing.  convert the usage, content, and structure information in the available data sources.  regarded as the most difficult task in Web Usage Mining. Pattern Discovery.  uses the algorithms and techniques from data mining, machine learning, statistics and pattern recognition. Pattern analysis.  lot of redundant rules or patterns found during discovery phase.  the main objective here is to filter out such data which would aid in the data analysis.  SQL queries, visualization techniques such as graphing patterns etc

Cloud Computing Use of existing commodities.  reduce cost of the services.  helps in concentrating on deploying the services faster.  more flexibility. Virtualization technique used as a standard deployment object.  provides abstraction between hardware and computing software.  enables loose coupling of the resources. Services are delivered over the network.

HDFS - Hadoop Distributed File System Data parallel but process sequential. Data processing is in a batch oriented fashion. Data communication is via distributed file system. So, latency is an issue. But HDFS is designed for giving higher throughputs than latency. In Facebook, jobs that took more than a day were cut down to less than a day by using Hadoop.

Important characteristics of HDFS… Hardware Failure. Streaming Data Access. Large Data Sets. Moving Computation is Cheaper than Moving Data

Web Mining, HDFS and Map/Reduce HDFS can be the storage backbone for Web Mining applications. HDFS replicates data at several nodes in the cluster to ensure robustness, data recovery in case of failure etc. Map/Reduce – A framework for realizing Distributed computing/Compute Cloud.

Web Mining & HIVE Developed by the Facebook Data Infrastructure Team in order to exploit the features of Hadoop HDFS and Map/Reduce. The next generation infrastructure designed with the goals of providing data processing systems:  enable easy data summarization  ad-hoc querying and analysis of large volumes of data Allows users to embed custom map/reduce functions

Web Usage Mining Architecture using HDFS, Map/Reduce and HIVE How Apache Hadoop can be used in Web Usage Mining. The system consists of HDFS as the Storage Cloud. Map/Reduce framework can be used as the Compute Cloud. Hive can be used to format the data.

Web Usage Mining Architecture

References HDFS: Map/Reduce: Web Mining: Information and Pattern Discovery on the World Wide Web: Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using Hadoop:

References Dhruba Borthakur: Hadoop Introduction: troduction troduction Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, Pang- Ning Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Thank You!