 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Spark: Cluster Computing with Working Sets
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Fast Algorithms for Association Rule Mining
Research Project Mining Negative Rules in Large Databases using GRD.
Lecture14: Association Rules
Mining Association Rules
Google Distributed System and Hadoop Lakshmi Thyagarajan.
HADOOP ADMIN: Session -2
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Mining High Utility Itemset in Big Data
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Data Mining Find information from data data ? information.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Overview Definition of Apriori Algorithm
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
By Shivaraman Janakiraman, Magesh Khanna Vadivelu.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Association Rules Repoussis Panagiotis.
Central Florida Business Intelligence User Group
Analysis of Lucene Index on Hbase in an HPC Environment
Introduction to Apache
Market Basket Analysis and Association Rules
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Introduction  Many projects use Hbase to store large amount of data for distributed computation  The Processing of these data becomes a challenge for the programmers  The use of frequent terms help us in many ways in the field of machine learning  Eg: Frequently purchased items, Frequently Asked Questions, etc.

Problem  These projects on Hbase create indexes on multiple data  We are able to find the frequency of a single word easily using these indexes  It is hard to find the frequency of a combination of words  For example: “cloud computing”  Searching these words separately may lead to results like “scientific computing”, “cloud platform”

Objective  This project focuses on finding the frequency of a combination of words  We use the concept of Data mining and Apriori algorithm for this project  We will be using Map-Reduce and HBase for this project.

Survey Topics  Apriori Algorithm  HBase  Map – Reduce

Data Mining What is Data Mining?  Process of analyzing data from different perspective  Summarizing data into useful information.

Data Mining How Data Mining works?  Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question.  Size of the database  Query complexity

Apriori Algorithm  Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.  Association rules form an very applied data mining approach.  Association rules are derived from frequent itemsets.  It uses level-wise search using frequent item property.

Algorithm Flow

Apriori Algorithm & Problem Description 10 If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support. If the minimum confidence is 50%, then the only two rules generated from this 2- itemset, that have confidence greater than 50%, are: Shoes  Jacket Support=50%, Confidence=66% Jacket  Shoes Support=50%, Confidence=100%

Apriori Algorithm Example Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Database D Min support =50%

Apriori Advantages & Disadvantages  ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement  DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans

HBase What is HBase?  A Hadoop Database  Non - Relational  Open-source, Distributed, versioned, column-oriented store model  Designed after Google Bigtable  Runs on top of HDFS ( Hadoop Distributed File System )

Map Reduce  Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.  Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )

Map Reduce

Mapper and Reducer  Mappers FreqentItemsMap -Finds the combination and assigns the key value for each combination CandidateGenMap AssociationRuleMap  Reducer FrequentItemsReduce CandidateGenReduce AssociationRuleReduce

Flow Chart No Yes

Schedule  1 week – Talking to the Experts at Futuregrid  1 Week – survey of HBase, Apriori Algorithm  4 Weeks -- Kick start on implementing Apriori Algorithm  2 Weeks – Testing the code and get the results.

Results

Conclusion  The execution takes more time for the single node  As the number of mappers getting increased, we come up with better performance  When the data is very large, single node execution takes more time and behaves weirdly

Screenshot

Known Issues  When the frequency is very low for large data set the reducer takes more time  Eg: A text paragraph in which the words are not repeated often.

Future Work  The analysis can be done with Twister and other platform  The algorithm can be extended for other applications that use machine learning techniques

References     apriori.html apriori.html  x x 

Questions?