Presentation is loading. Please wait.

Presentation is loading. Please wait.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Similar presentations


Presentation on theme: "HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase."— Presentation transcript:

1 HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase

2 Introduction Many projects on HBase create indexes on multiple data We are able to find the frequency of a single word easily It is hard to find the frequency of a combination of words For example: cloud computing

3 Objective This project focuses on finding the frequency of a combination of words We use the concept of Data mining and Apriori algorithm for this project We will be using Map-Reduce and HBase for this project.

4 Survey Topics Apriori Algorithm HBase Map – Reduce

5 Data Mining What is Data Mining? Process of analyzing data from different perspective Summarizing data into useful information.

6 Data Mining How Data Mining works? Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question. Size of the database Query complexity

7 Apriori Algorithm Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules. Association rules form an very applied data mining approach. Association rules are derived from frequent itemsets. It uses level-wise search using frequent item property.

8 Algorithm Flow

9 Apriori Algorithm & Problem Description 9 If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support. If the minimum confidence is 50%, then the only two rules generated from this 2- itemset, that have confidence greater than 50%, are: Shoes  Jacket Support=50%, Confidence=66% Jacket  Shoes Support=50%, Confidence=100%

10 Apriori Algorithm Example Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Database D Min support =50%

11 Apriori Advantages & Disadvantages ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans

12 HBase What is HBase? A Hadoop Database Non - Relational Open-source, Distributed, versioned, column- oriented store model Designed after Google Bigtable Runs on top of HDFS ( Hadoop Distributed File System )

13 HBase Architecture

14 Map Reduce Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster. Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )

15 Map Reduce

16 How Combination works cont. The approach is similar to the frequent item sets mining problem But only the adjacent words are to be mined The idea is if a phrase (combination of words) is frequent then its subset are also frequent.

17 Schedule 1 week – Talking to the Experts at Futuregrid 1 Week – survey of HBase, Apriori Algorithm 4 Weeks -- Kick start on implementing Apriori Algorithm 2 Weeks – Testing the code and get the results.

18 References http://en.wikipedia.org/wiki/Text_mining. http://en.wikipedia.org/wiki/Text_mining http://en.wikipedia.org/wiki/Apriori_algorithm http://hbase.apache.org/book/book.html

19 Questions?


Download ppt "HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase."

Similar presentations


Ads by Google