Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Similar presentations


Presentation on theme: " Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan."— Presentation transcript:

1  Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

2 Introduction  Many projects use Hbase to store large amount of data for distributed computation  The Processing of these data becomes a challenge for the programmers  The use of frequent terms help us in many ways in the field of machine learning  Eg: Frequently purchased items, Frequently Asked Questions, etc.

3 Problem  These projects on Hbase create indexes on multiple data  We are able to find the frequency of a single word easily using these indexes  It is hard to find the frequency of a combination of words  For example: “cloud computing”  Searching these words separately may lead to results like “scientific computing”, “cloud platform”

4 Objective  This project focuses on finding the frequency of a combination of words  We use the concept of Data mining and Apriori algorithm for this project  We will be using Map-Reduce and HBase for this project.

5 Survey Topics  Apriori Algorithm  HBase  Map – Reduce

6 Data Mining What is Data Mining?  Process of analyzing data from different perspective  Summarizing data into useful information.

7 Data Mining How Data Mining works?  Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question.  Size of the database  Query complexity

8 Apriori Algorithm  Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.  Association rules form an very applied data mining approach.  Association rules are derived from frequent itemsets.  It uses level-wise search using frequent item property.

9 Algorithm Flow

10 Apriori Algorithm & Problem Description 10 If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support. If the minimum confidence is 50%, then the only two rules generated from this 2- itemset, that have confidence greater than 50%, are: Shoes  Jacket Support=50%, Confidence=66% Jacket  Shoes Support=50%, Confidence=100%

11 Apriori Algorithm Example Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Database D Min support =50%

12 Apriori Advantages & Disadvantages  ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement  DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans

13 HBase What is HBase?  A Hadoop Database  Non - Relational  Open-source, Distributed, versioned, column-oriented store model  Designed after Google Bigtable  Runs on top of HDFS ( Hadoop Distributed File System )

14 Map Reduce  Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.  Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )

15 Map Reduce

16 Mapper and Reducer  Mappers FreqentItemsMap -Finds the combination and assigns the key value for each combination CandidateGenMap AssociationRuleMap  Reducer FrequentItemsReduce CandidateGenReduce AssociationRuleReduce

17 Flow Chart No Yes

18 Schedule  1 week – Talking to the Experts at Futuregrid  1 Week – survey of HBase, Apriori Algorithm  4 Weeks -- Kick start on implementing Apriori Algorithm  2 Weeks – Testing the code and get the results.

19 Results

20 Conclusion  The execution takes more time for the single node  As the number of mappers getting increased, we come up with better performance  When the data is very large, single node execution takes more time and behaves weirdly

21 Screenshot

22 Known Issues  When the frequency is very low for large data set the reducer takes more time  Eg: A text paragraph in which the words are not repeated often.

23 Future Work  The analysis can be done with Twister and other platform  The algorithm can be extended for other applications that use machine learning techniques

24 References  http://en.wikipedia.org/wiki/Text_mining http://en.wikipedia.org/wiki/Text_mining  http://en.wikipedia.org/wiki/Apriori_algorithm http://en.wikipedia.org/wiki/Apriori_algorithm  http://hbase.apache.org/book/book.html http://hbase.apache.org/book/book.html  http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_ apriori.html http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_ apriori.html  http://www.codeproject.com/KB/recipes/AprioriAlgorithm.asp x http://www.codeproject.com/KB/recipes/AprioriAlgorithm.asp x  http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf

25 Questions?


Download ppt " Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan."

Similar presentations


Ads by Google