Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Slides:

Advertisements

Similar presentations

Chapter 5: Tree Constructions

Advertisements

AI Pathfinding Representing the Search Space

Google N-gram Data analyzer Project and Presentation by, Anagha Dharasurkar Andrew Norgren Premchand Bellamkonda Shruti Pandey Salil Bapat.

Lecture 4 (week 2) Source Coding and Compression

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

Recursion CS 367 – Introduction to Data Structures.

CS252: Systems Programming Ninghui Li Program Interview Questions.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

1 Today’s lecture  Last lecture we started talking about control flow in MIPS (branches)  Finish up control-flow (branches) in MIPS —if/then —loops —case/switch.

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Lecture 7: Synchronous Network Algorithms

1 Optimizing Malloc and Free Professor Jennifer Rexford COS 217 Reading: Section 8.7 in K&R book

C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.

CS 410 Applied Algorithms Applied Algorithms Lecture #3 Data Structures.

Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.

Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.

Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.

Spanning Tree and Multicast. The Story So Far Switched ethernet is good – Besides switching needed to join even multiple classical ethernet networks Routing.

Memory Allocation CS Introduction to Operating Systems.

Randomized Algorithms - Treaps

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.

Chapter 9 – Graphs A graph G=(V,E) – vertices and edges

Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.

Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.

Chapter 14 Dynamic Data Structures Instructor: Kun-Mao Chao ( 台大資工趙坤茂 )

March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.

ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.

Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.

Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.

Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.

CSC 211 Data Structures Lecture 13

CS 206 Introduction to Computer Science II 10 / 05 / 2009 Instructor: Michael Eckmann.

1 Searching Searching in a sorted linked list takes linear time in the worst and average case. Searching in a sorted array takes logarithmic time in the.

COMP261 Lecture 6 Dijkstra’s Algorithm. Connectedness Is this graph connected or not? A Z FF C M N B Y BB S P DDGG AA R F G J L EE CC Q O V D T H W E.

Union-find Algorithm Presented by Michael Cassarino.

Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.

CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

1 Applied Arrays Lists and Strings Chapter 12 2 Applying What You Learn Searching through arrays efficiently Sorting arrays Using character arrays as.

CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]

Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.

B/B+ Trees 4.7.

Indexing Structures for Files and Physical Database Design

Top 50 Data Structures Interview Questions

Data Structure Interview Question and Answers

CS 430: Information Discovery

Chapter 12: Query Processing

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Arrays … The Sequel Applications and Extensions

Optimizing Malloc and Free

CS Introduction to Operating Systems

Graphs Chapter 11 Objectives Upon completion you will be able to:

N-Gram Model Formulas Word sequences Chain rule of probability

Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.

Lecture 8: Synchronous Network Algorithms

Data Structures & Algorithms

Presentation transcript:

Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy

Project Goal Build a co-occurrence network using Google unigram and bi-gram data. Analyze the network layout and minimize the problem by finding disjoint networks. Extract the path information for user query words. Prune the network using unigram and association score cutoff.

System Flow Network Building (File I/O) Network Analysis (Find disjoint network) Path Finding Association Score Cutoff Unigram Cutoff

Data Structures 2 important aspects of co-occurrence network- Nodes & Edges. Data structure “node” stores the information about the unigram data. Data structure “edge” stores the information about the bi-gram data. Implementation choices: Array of structures or Linked list.

Data Structures typedef struct node { char* token; //Word string long int freq;//Unigram Frequency int index; //location in unigram array edge *incoming; //Starting pointer of incoming linked list edge *curr_incoming; //Current pointer of incoming linked list edge *outgoing; //Starting pointer of outgoing linked list edge *curr_outgoing; //Current pointer of outgoing linked list int has_seen; //Variable used while finding distinct networks int is_checked; //Variable used while finding distinct networks int count_outgoing; //Total number of outgoing edges int count_incoming; //Total number of incoming edges long int total_out_weight; //Sum of all outgoing weights int incoming_nodes_added;//Variable used in the beta stage int outgoing_nodes_added;//Variable used in the beta stage }; typedef struct edge { int index; //location in unigram array long int freq; //weight associated with edge struct edge *next; //Pointer to next entry in the linked list int marked; //Variable used in beta stage int marked_before//Variable used in beta stage };

Network Building Steps Count the number of unigrams. Memory Allocation. Read the unigram file. Bi-gram File Distribution. Finding index from Unigram array. Adding incoming and outgoing edge information.

Network Analysis Two steps Algorithm: 1. Local analyze. (Based on the information on each processor) 2. Global result is derived from the local results.

Local Analyze Two checking bits are being used during this process (has_seen and is_checked): –First, function check_all_branch will be applied to a node A. Inside the function, node A’s has_seen bit is being marked because we have “seen” node A. –Second, function check_all_branch will be applied to all the neighbors of node A. –Finally, node A’s is_checked bit will be marked because we cover all neighbors of node A. So node A is “checked”. –During the local analyze, each disjoint network is assigned a unique network ID.

Global Analyze After the local analyze, each processor has a unique but incomplete view of the network based on the edges information it stores. –For example: In CPU A, node 1 is connected to node 2, 3, and 5. Their network ID is 5. Meanwhile, in CPU B, node 1 is connected to node 2, 6, and 7. Their network ID is 2. In CPU A, node 1 is connected to node 2, 3, and 5. Their network ID is 5. Meanwhile, in CPU B, node 1 is connected to node 2, 6, and 7. Their network ID is 2.

Global Analyze (Cont.) So what we need to do during the global analyze is to combine the local results so that the global result reflects the real network layout. –In previous example, what our algorithm does is basically to tell everyone that network 5 in CPU A is actually connected to network 2 in CPU B. Hence, node 1 is not only connected to node 2, 3, 5, but also to node 6 and 7. (Note that even the network ID could be different for the same node in different CPU, but the node ID or index is always the same for all CPUs)

PRINTING METHOD AND OBLIGATIONS To be able to print path of specific length with target ‘X’ at the center: 1)We will collect X’s disjoint network on master processor. 2)Then recursively print the paths. Note: 1)We don’t print cyclic paths. (Cyclic paths occur when same edge occurs in a path more than once.) 2)We print complete paths. (If the last node in the printed path does not have any parent or a child or both then that path is called a complete path.) 3)We will not collecting the whole disjoint network for the target but a part of it depending upon the specified length.

Avoiding Self-Loops (Edge-Marking Method) In this method while building up a path to be printed we mark all the edges that have been included in the path so that if they occur again we can just skip them and move to the next connected edge and so on till we find an unmarked edge. In this was way we can avoid the self loops. Note: 1) We do not mark the last outgoing edge as its useless. 2) We will not be collecting nodes for the last nodes in printed paths. Printing Complete Paths To print the complete paths we took advantage of the property of complete paths that the last node in a complete path will have no parent or no child or both. If the last node in the path does not have any child or parent or both and the length is less than or equal to the specified length we print that path.

1gm File A 1000 B 2000 F gm file to read A A 100 A B 200 B A 400 B B 500 F A 600 1gm(Vocab) File and 2gm File To Be Read

PRINT PATHS TARGET ‘A’ LENGTH 2 NUMBER OF PROCESSORS 2

Number Of Processors 2 | Target Token A | Length A’s disjoint network distributed among the processors. Processor 1 Processors |A| -> 0,100->1,200 (Outgoing) |A| -> NULL (Outgoing) -> 0,100->1,400 (Incoming) -> 2,600 (Incoming) |B| -> 0,400 |B| -> 1,500 -> 0,200 -> 1,500 |F| -> NULL |F| -> 0,600 -> NULL -> NULL INDICES 1gm File A B F gm file to read A A 100 A B 200 B A 400 B B 500 F A 600

Processor 1 Processors |A| -> 0,100->1,200 (Outgoing) |A| -> NULL (Outgoing) -> 0,100->1,400->2,600(Incoming) -> 2,600 (Incoming) |B| -> 0,400->1,500 |B| -> 1,500 -> 0,200->1,500 -> 1,500 |F| -> 0,600 |F| -> 0,600 -> NULL -> NULL Number Of Processors 2 | Target Token A | Length 2(Before Printing) Collect A’s Disjoint Network On Processor 1 INDICES 1gm File A B F gm file to read A A 100 A B 200 B A 400 B B 500 F A 600

Print Path For ‘A’ And Length 2 Processor |A| -> 0,100->1,200 (Outgoing) -> 0,100->1,400->2,600(Incoming) |B| -> 0,400->1,500 -> 0,200->1,500 |F| -> 0,600 -> NULL NETWORK GRAPH FORMNETWORK PROCESSOR FORM

AB BABA A F BA BAFAB Print Path For ‘A’ And Length 2 NETWORK A’s LENGTH 2 NETWORK

AB BABA A F BA BAFAB A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->B->(400)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB F->(600)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB F->(600)->A->(100)->A->(200)->B->(400)->A(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A FBA BAFAB F->(600)->A->(100)->A->(200)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB A->(200)->B->(400)->A->(100)->A->(100)->A(SKIPPED) A->(200)->B->(400)->A->(100)->A->(200)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB A->(200)->B->(400)->A->(200)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB B->(500)->B->(400)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB B->(500)->B->(400)->A->(100)->A->(200)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB B->(500)->B->(400)->A->(200)->B->(400)->A(SKIPPED) B->(500)->B->(400)->A->(200)->B->(400)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB F->(600)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB F->(600)->A->(100)->A->(200)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB F->(600)->A->(200)->B->(400)->A(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B F->(600)->A->(200)->B->(400)->A Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

AB BABA A F BA BAFAB F->(600)->A->(200)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B F->(600)->A->(200)->B->(400)->A F->(600)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

IMPORTANT OBSERVATIONS Paths And Unigram Cut-Off Unigram Cut-Off is the heart. Unigram Cut-Off helps us to reduce the size of the network the much we want and that helps us in running our systems even with less amount of memory. Paths And Associative Cut-Off If Unigram Cut-Off is the heart then Associative Cut-Off is the soul. It helps us in running the system with high lengths even with lack of disk space. No only this it also helps us in reducing the size of the collected network.

Ques: Which system is suitable for printing? Ans: We can use IBM BladeCenter or SGI Altix 3700 BX2. IBM BladeCenter It can be used only with high unigram cut-off’’s and very small associative cut-off’’s due to lack of memory and disk space respectively. SGI Altix 3700 BX It can be used even with no unigram cut-off’’s and very small associative cut-off’’s due to availability of loads of memory but limited disk space respectively. IMPORTANT OBSERVATIONS CONTD.

TARGET LIST It is hard to pick the best target list for Google n-gram data reason being there is one really big network with nodes and other’s with very small number of nodes 1, 2 or more nodes. So if we pick a word from the big network we might end up with a network residing on the master processor. This makes IBM BladeCenter unsuitable for printing because of memory limitation and forces us to choose ALTIX. On ALTIX with enough memory you can probably print paths for any token present in unigram array and any length. Note: We might still be short of disk space until we specify high unigram cut-off and very very low (e.g ) as associative cut- off. NO PATHS PRINTED FOR LENGTH GREATER THAN HALF THE NUMBER OF EDGES IN THE DISJOINT NETWORK

Cut-Off frequencies And Association Cut. What is their purpose ? 1. To reduce the size of the network built 1. To reduce the size of the network built in system memory. in system memory. 2. Tools to manipulate the structure of graph. 2. Tools to manipulate the structure of graph.

Cut – Off frequency. Design Options. 1. Create the array with all the unigrams but the edge information. 1. Create the array with all the unigrams but the edge information. 2. Create the unigram array with unigrams that are above the cut-off frequency. 2. Create the unigram array with unigrams that are above the cut-off frequency.

Where to plug in ??? Create the unigram array to reflect the total number of unigrams. Before adding the unigram into the array,check if it satisfies unigram cut-off. Before adding the edge information (from bigram), check for the presence of unigram using the binary search.

Association Cut. Determines which bigram pair to be included in the path based on associative score. Unigram cut – taken care during network build. Association cut – taken care during path finding network.

Where to plug in ??? Role of a regulator to path tracking. Prunes the whole sub graph if one of the branches does not satisfy association cut.

Network Analysis(alternate approach) Approach that finds the network completely before finding the next based on message passing. Useful in knowing the statistics of a network to which a word belongs to rather than building the whole network.

How does it work? Processor 0 -- master. Rest of Processors – Slaves. For every new node tracked in master, it is broadcasted. Slaves receive the nodes and perform localized search. Broadcast from slaves to account for disjoint network spread over processors.

How does it work ?? Master check if its local list is updated. If yes, continue the iteration beginning from first step again. If no, the whole of network is found.

Future Work Combine the network information in a better way while building the network. Faster algorithm to find disjoint networks.

Question & Answer

Thank You ! Enjoy your winter break!