Presentation is loading. Please wait.

Presentation is loading. Please wait.

Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Similar presentations


Presentation on theme: "Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy."— Presentation transcript:

1 Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy

2 Project Goal Build a co-occurrence network using Google unigram and bi-gram data. Analyze the network layout and minimize the problem by finding disjoint networks. Extract the path information for user query words. Prune the network using unigram and association score cutoff.

3 System Flow Network Building (File I/O) Network Analysis (Find disjoint network) Path Finding Association Score Cutoff Unigram Cutoff

4 Data Structures 2 important aspects of co-occurrence network- Nodes & Edges. Data structure “node” stores the information about the unigram data. Data structure “edge” stores the information about the bi-gram data. Implementation choices: Array of structures or Linked list.

5 Data Structures typedef struct node { char* token; //Word string long int freq;//Unigram Frequency int index; //location in unigram array edge *incoming; //Starting pointer of incoming linked list edge *curr_incoming; //Current pointer of incoming linked list edge *outgoing; //Starting pointer of outgoing linked list edge *curr_outgoing; //Current pointer of outgoing linked list int has_seen; //Variable used while finding distinct networks int is_checked; //Variable used while finding distinct networks int count_outgoing; //Total number of outgoing edges int count_incoming; //Total number of incoming edges long int total_out_weight; //Sum of all outgoing weights int incoming_nodes_added;//Variable used in the beta stage int outgoing_nodes_added;//Variable used in the beta stage }; typedef struct edge { int index; //location in unigram array long int freq; //weight associated with edge struct edge *next; //Pointer to next entry in the linked list int marked; //Variable used in beta stage int marked_before//Variable used in beta stage };

6 Network Building Steps Count the number of unigrams. Memory Allocation. Read the unigram file. Bi-gram File Distribution. Finding index from Unigram array. Adding incoming and outgoing edge information.

7 Network Analysis Two steps Algorithm: 1. Local analyze. (Based on the information on each processor) 2. Global result is derived from the local results.

8 Local Analyze Two checking bits are being used during this process (has_seen and is_checked): –First, function check_all_branch will be applied to a node A. Inside the function, node A’s has_seen bit is being marked because we have “seen” node A. –Second, function check_all_branch will be applied to all the neighbors of node A. –Finally, node A’s is_checked bit will be marked because we cover all neighbors of node A. So node A is “checked”. –During the local analyze, each disjoint network is assigned a unique network ID.

9 Global Analyze After the local analyze, each processor has a unique but incomplete view of the network based on the edges information it stores. –For example: In CPU A, node 1 is connected to node 2, 3, and 5. Their network ID is 5. Meanwhile, in CPU B, node 1 is connected to node 2, 6, and 7. Their network ID is 2. In CPU A, node 1 is connected to node 2, 3, and 5. Their network ID is 5. Meanwhile, in CPU B, node 1 is connected to node 2, 6, and 7. Their network ID is 2.

10 Global Analyze (Cont.) So what we need to do during the global analyze is to combine the local results so that the global result reflects the real network layout. –In previous example, what our algorithm does is basically to tell everyone that network 5 in CPU A is actually connected to network 2 in CPU B. Hence, node 1 is not only connected to node 2, 3, 5, but also to node 6 and 7. (Note that even the network ID could be different for the same node in different CPU, but the node ID or index is always the same for all CPUs)

11 PRINTING METHOD AND OBLIGATIONS --------------------------------------------------------- To be able to print path of specific length with target ‘X’ at the center: 1)We will collect X’s disjoint network on master processor. 2)Then recursively print the paths. Note: 1)We don’t print cyclic paths. (Cyclic paths occur when same edge occurs in a path more than once.) 2)We print complete paths. (If the last node in the printed path does not have any parent or a child or both then that path is called a complete path.) 3)We will not collecting the whole disjoint network for the target but a part of it depending upon the specified length.

12 Avoiding Self-Loops (Edge-Marking Method) ----------------------------------------------------------- In this method while building up a path to be printed we mark all the edges that have been included in the path so that if they occur again we can just skip them and move to the next connected edge and so on till we find an unmarked edge. In this was way we can avoid the self loops. Note: 1) We do not mark the last outgoing edge as its useless. 2) We will not be collecting nodes for the last nodes in printed paths. Printing Complete Paths -------------------------------- To print the complete paths we took advantage of the property of complete paths that the last node in a complete path will have no parent or no child or both. If the last node in the path does not have any child or parent or both and the length is less than or equal to the specified length we print that path.

13 1gm File ---------- A 1000 B 2000 F 3000 2gm file to read -------------------- A A 100 A B 200 B A 400 B B 500 F A 600 1gm(Vocab) File and 2gm File To Be Read

14 PRINT PATHS TARGET ‘A’ LENGTH 2 NUMBER OF PROCESSORS 2

15 Number Of Processors 2 | Target Token A | Length 2 --------------------------------------------------------------------------- A’s disjoint network distributed among the processors. Processor 1 Processors 2 ---------------- ----------------- |A| -> 0,100->1,200 (Outgoing) |A| -> NULL (Outgoing) -> 0,100->1,400 (Incoming) -> 2,600 (Incoming) |B| -> 0,400 |B| -> 1,500 -> 0,200 -> 1,500 |F| -> NULL |F| -> 0,600 -> NULL -> NULL INDICES 1gm File ------------ ----------- 0 A 1000 1 B 2000 2 F 3000 2gm file to read -------------------- A A 100 A B 200 B A 400 B B 500 F A 600

16 Processor 1 Processors 2 ---------------- ----------------- |A| -> 0,100->1,200 (Outgoing) |A| -> NULL (Outgoing) -> 0,100->1,400->2,600(Incoming) -> 2,600 (Incoming) |B| -> 0,400->1,500 |B| -> 1,500 -> 0,200->1,500 -> 1,500 |F| -> 0,600 |F| -> 0,600 -> NULL -> NULL Number Of Processors 2 | Target Token A | Length 2(Before Printing) -------------------------------------------------------------------------------------------------- Collect A’s Disjoint Network On Processor 1 INDICES 1gm File ------------ ----------- 0 A 1000 1 B 2000 2 F 3000 2gm file to read -------------------- A A 100 A B 200 B A 400 B B 500 F A 600

17 Print Path For ‘A’ And Length 2 Processor 1 ---------------- |A| -> 0,100->1,200 (Outgoing) -> 0,100->1,400->2,600(Incoming) |B| -> 0,400->1,500 -> 0,200->1,500 |F| -> 0,600 -> NULL NETWORK GRAPH FORMNETWORK PROCESSOR FORM

18 AB BABA A F BA BAFAB Print Path For ‘A’ And Length 2 NETWORK A’s LENGTH 2 NETWORK

19 AB BABA A F BA BAFAB A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

20 AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

21 AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->B->(400)->A(SKIPPED) (CENTER) SOLUTION Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

22 AB BABA A FBA BAFAB B->(400)->A->(100)->A->(100)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

23 AB BABA A FBA BAFAB F->(600)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

24 AB BABA A FBA BAFAB F->(600)->A->(100)->A->(200)->B->(400)->A(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

25 AB BABA A FBA BAFAB F->(600)->A->(100)->A->(200)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

26 AB BABA A F BA BAFAB A->(200)->B->(400)->A->(100)->A->(100)->A(SKIPPED) A->(200)->B->(400)->A->(100)->A->(200)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

27 AB BABA A F BA BAFAB A->(200)->B->(400)->A->(200)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

28 AB BABA A F BA BAFAB B->(500)->B->(400)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

29 AB BABA A F BA BAFAB B->(500)->B->(400)->A->(100)->A->(200)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

30 AB BABA A F BA BAFAB B->(500)->B->(400)->A->(200)->B->(400)->A(SKIPPED) B->(500)->B->(400)->A->(200)->B->(400)->B(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

31 AB BABA A F BA BAFAB F->(600)->A->(100)->A->(100)->A(SKIPPED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

32 AB BABA A F BA BAFAB F->(600)->A->(100)->A->(200)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

33 AB BABA A F BA BAFAB F->(600)->A->(200)->B->(400)->A(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B F->(600)->A->(200)->B->(400)->A Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

34 AB BABA A F BA BAFAB F->(600)->A->(200)->B->(500)->B(PRINTED) (CENTER) SOLUTION B->(400)->A->(100)->A->(200)->B->(500)->B F->(600)->A->(100)->A->(200)->B->(400)->A F->(600)->A->(100)->A->(200)->B->(500)->B B->(500)->B->(400)->A->(100)->A->(200)->B F->(600)->A->(100)->A->(200)->B F->(600)->A->(200)->B->(400)->A F->(600)->A->(200)->B->(500)->B Query Target: A Length: 2 Print Path For ‘A’ And Length 2 - Printed (Unmarked) - Might Be Printed (Marked) - Not Printed

35 IMPORTANT OBSERVATIONS Paths And Unigram Cut-Off ------------------------------------- Unigram Cut-Off is the heart. Unigram Cut-Off helps us to reduce the size of the network the much we want and that helps us in running our systems even with less amount of memory. Paths And Associative Cut-Off ---------------------------------------- If Unigram Cut-Off is the heart then Associative Cut-Off is the soul. It helps us in running the system with high lengths even with lack of disk space. No only this it also helps us in reducing the size of the collected network.

36 Ques: Which system is suitable for printing? Ans: We can use IBM BladeCenter or SGI Altix 3700 BX2. IBM BladeCenter ----------------------- It can be used only with high unigram cut-off’’s and very small associative cut-off’’s due to lack of memory and disk space respectively. SGI Altix 3700 BX2 -------------------------- It can be used even with no unigram cut-off’’s and very small associative cut-off’’s due to availability of loads of memory but limited disk space respectively. IMPORTANT OBSERVATIONS CONTD.

37 TARGET LIST -------------------- It is hard to pick the best target list for Google n-gram data reason being there is one really big network with 13163490 nodes and other’s with very small number of nodes 1, 2 or more nodes. So if we pick a word from the big network we might end up with a network residing on the master processor. This makes IBM BladeCenter unsuitable for printing because of memory limitation and forces us to choose ALTIX. On ALTIX with enough memory you can probably print paths for any token present in unigram array and any length. Note: We might still be short of disk space until we specify high unigram cut-off and very very low (e.g. 0.000000000000001) as associative cut- off. NO PATHS PRINTED FOR LENGTH GREATER THAN HALF THE NUMBER OF EDGES IN THE DISJOINT NETWORK ----------------------------------------------------------------------------------------------

38 Cut-Off frequencies And Association Cut. What is their purpose ? 1. To reduce the size of the network built 1. To reduce the size of the network built in system memory. in system memory. 2. Tools to manipulate the structure of graph. 2. Tools to manipulate the structure of graph.

39 Cut – Off frequency. Design Options. 1. Create the array with all the unigrams but the edge information. 1. Create the array with all the unigrams but the edge information. 2. Create the unigram array with unigrams that are above the cut-off frequency. 2. Create the unigram array with unigrams that are above the cut-off frequency.

40 Where to plug in ??? Create the unigram array to reflect the total number of unigrams. Before adding the unigram into the array,check if it satisfies unigram cut-off. Before adding the edge information (from bigram), check for the presence of unigram using the binary search.

41 Association Cut. Determines which bigram pair to be included in the path based on associative score. Unigram cut – taken care during network build. Association cut – taken care during path finding network.

42 Where to plug in ??? Role of a regulator to path tracking. Prunes the whole sub graph if one of the branches does not satisfy association cut.

43 Network Analysis(alternate approach) Approach that finds the network completely before finding the next based on message passing. Useful in knowing the statistics of a network to which a word belongs to rather than building the whole network.

44 How does it work? Processor 0 -- master. Rest of Processors – Slaves. For every new node tracked in master, it is broadcasted. Slaves receive the nodes and perform localized search. Broadcast from slaves to account for disjoint network spread over processors.

45 How does it work ?? Master check if its local list is updated. If yes, continue the iteration beginning from first step again. If no, the whole of network is found.

46 Future Work Combine the network information in a better way while building the network. Faster algorithm to find disjoint networks.

47 Question & Answer

48 Thank You ! Enjoy your winter break!


Download ppt "Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy."

Similar presentations


Ads by Google