Word Co-occurrence Chapter 3, Lin and Dryer.

Slides:

Advertisements

Similar presentations

Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Based on Lin and Dryer’s text: Chapter 3.  Figure 2.6.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

Ch 4. The Evolution of Analytic Scalability

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.

Mining High Utility Itemset in Big Data

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Big Data Infrastructure

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Big Data is a Big Deal!.

CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA

Recitation #4 Tel Aviv University 2016/2017 Slava Novgorodov

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

MapReduce Types, Formats and Features

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Lecture 3: Bringing it all together

An Innovative Approach to Parallel Processing Data

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MR Application with optimizations for performance and scalability

Defining Data-intensive computing

Cloud Distributed Computing Environment Hadoop

MapReduce Algorithm Design

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Cse 344 May 4th – Map/Reduce.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Defining Data-intensive computing

CS110: Discussion about Spark

Defining Data-intensive computing

Ch 4. The Evolution of Analytic Scalability

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Word Co-occurrence Chapter 3, Lin and Dyer.

Distributed System Gang Wu Spring，2018.

Cloud Programming Models

MR Application with optimizations for performance and scalability

Lecture 16 (Intro to MapReduce and Hadoop)

Large Scale Distributed Computing

Charles Tappert Seidenberg School of CSIS, Pace University

Introduction to MapReduce

MAPREDUCE TYPES, FORMATS AND FEATURES

Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy

MapReduce Algorithm Design

Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy

Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

CS639: Data Management for Data Science

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

CMPT 120 Lecture 26 – Unit 5 – Internet and Big Data

Map Reduce, Types, Formats and Features

Presentation transcript:

Word Co-occurrence Chapter 3, Lin and Dryer

Why is co-occurrence important? Read chapter 3 This will help you with Lab2 as well as Final Exam This will also help with future projects Help you with interview in big data analytics A simple method with big impact Co-occurrence is 2-gram, n-grams is an extension (Google has supported this ) And of course, how do you define co-occurrence is an domain- dependent issue: text—sentence, paragraph etc. Temporal: within a day, week, month, co-occurrence; more complex: Blei’s LDA

Intelligence and Scale of Data Intelligence is a set of discoveries made by federating/processing information collected from diverse sources. Information is a cleansed form of raw data. For statistically significant information we need reasonable amount of data. For gathering good intelligence we need large amount of information. As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of data is generated by the millions of experiments and applications. Thus intelligence applications are invariably data-heavy, data-driven and data-intensive. Data is gathered from the web (public or private, covert or overt), generated by large number of domain applications. 8/21/2019

Intelligence (or origins of Big-data computing?) Search for Extra Terrestrial Intelligence (seti@home project) The Wow signal http://www.bigear.org/wow.htm 8/21/2019

Characteristics of intelligent applications Google search: How is different from regular search in existence before it? It took advantage of the fact the hyperlinks within web pages form an underlying structure that can be mined to determine the importance of various pages. Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to go to CityGrille”? Learning capacity from previous data of habits, profiles, and other information gathered over time. Collaborative and interconnected world inference capable: facebook friend suggestion Large scale data requiring indexing …Do you know amazon is going to ship things before you order? Here 8/21/2019

Review 1: Mapreduce Algorithm Design "simplicity" is the theme Fast "simple operation" on a large set of data Most web-mobile-internet application data yield to embarrassingly parallel processing General Idea; you write the Mapper and Reducer (Combiner and Partitioner); the execution framework takes care of the rest. Of course, you configure...the splits, the # of reducers, input path, output path,.. etc.

Review 2: Programmer has NO control over -- where a mapper or reducer runs (which node in the cluster) -- when a mapper or reducer begins or finishes --which input key-value pairs are processed by a specific mapper --what intermediate key-value pair is processed by a specific reducer

Review 3 However what control does a programmer have? 1. Ability to construct complex structures as keys and values to store and communicate partial results 2. The ability to execute user-specified code at the beginning of a map or a reduce task; and termination code at the end; 3. Ability to preserve state in both mappers and reducers across multiple input /intermediate values: counters 4. Ability to control sort order, order of distribution to reducers 5. ability to partition the key space to reducers

Lets move on co-occurrence (Section 3.2) Word counting is not the only example.. Another example: co-occurrence matrix large corpus: nXn matrix where n is the number of unique words in the corpus. (corpora is plural) Lets assume m words, i and j row and column index, m(i.j) cell will have the number of times w(i) co-occurred with w(j) For example <Basketball> is w(i) and <March> w<j> on twitter feed today is >1000, more than what it would been in December. Lets look at the algorithm. You need this for your Lab2.

Word Co-occurrence – Pairs version 1: class Mapper 2: method Map(docid a; doc d) 3: for all term w 2 doc d do 4: for all term u 2 Neighbors(w) do 5: Emit(pair (w; u); count 1) . Emit count for each co-occurrence 1: class Reducer 2: method Reduce(pair p; counts [c1; c2; : : :]) 3: s <- 0 4: for all count c in counts [c1; c2; : : :] do 5: s s + c . Sum co-occurrence counts 6: Emit(pair p; count s)

Word Co-occurrence – Stripes version 1.class Mapper 2: method Map(docid a; doc d) 3: for all term w in doc d do 4: H <-new AssociativeArray 5: for all term u in Neighbors(w) do 6: H{u} <-H{u} + 1 . //Tally words co-occurring with w 7: Emit(Term w; Stripe H) 1: class Reducer 2: method Reduce(term w; stripes [H1;H2;H3; : : :]) 3: Hf <-new AssociativeArray 4: for all stripe H in stripes [H1;H2;H3; : : :] do 5: Sum(Hf ,H) // Element-wise sum lots of small values into big value 6: Emit(term w; stripe Hf )

Run it on AWS and evaluate the two approaches

Summary/Observation 1.Word co-occurrence is proposed as solution for evaluating association! 2. Two methods proposed: pairs, stripes 3. MR implementation designed (pseudo code) 4. Implemented on MR on amazon cloud 5. Evaluated and relative performance studied (R2, runtime, scale)

Lab2 Discussion Build a MR data pipeline All processing in big-data done in MR Twitter : get tweets by keyword Cleaning done by MR (NOT BY R- studio) Analyze using MR NYTimes: Get news by keyword Cleaning done by MR Analyze using MR Common crawl: get data  filter by keyword using MR clean using MR Analyze using MR