2 Gram Data1 Gram Data Distribution of data (Premchand) Input:Unigram Data Output: Array of Unigrams Find Disjoint Network (Sruthi) Input:Array of Unigrams, Bigram data Output:Linked list, files Agglomeration (Andrew) Input:No.of n/w for each processor, Linked lists Output:No.of disjoint n/w Number of Edges (Premchand) Input:Number of networks Output: Number of Edges Reverse Network Building (Salil) Input:2 Gram data Output:Reverse Network Reverse Network Network Interface Module (Salil) Input: Reverse Network Output:Linked List Path Finder Module (Anagha) Input:Target Query File Output: Query Paths Disjoint Networks Query Path, Number of Disjoint Networks, Number of nodes in each Disjoint Networks, Number of edges Target Query File
Network building background Using what we have – the 2gm data. Building a reverse network. Store whatever is built.
Network Details The folder structure. 1 Root directory 63 Second level Directories Third Level of Directories With a file inside each directory. Consider the bigram “match day 2000”.
Parallelism Details Block allocation Lines distributed amongst processors instead of the files. Processor 0 sends to each processor: Number of lines it has to process Number of lines it has to process File number from which it should start File number from which it should start The starting line The starting line
Benchmarking ProcessorsTimings Number of 2- gm files 16 3 hrs 56 mins 32 32 2 hrs 47 mins 32 64 1 hr 58 mins 32
Space Requirements Not much to store in the memory. Large space requirements. Around 5.5 gbs for the google 2 gram data. As a general rule, the reverse network will be approximately of the same size as the original data content.
Finding Disjoint Networks Module Description: This module deals with finding the disjoint networks from the google-2- gram data. It takes unigrams as input and for each unigram, it gets all the tokens connected to it and processes them as described later to find the disjoint network.
Approach We exploited the simple fact that if we have two networks of words and if any word is common in both the networks, then both the networks are connected. Example: Network 1: A --> B --> C --> D --> E --> Z Network 2: Q --> Y --> P --> R --> S --> A --> V --> In the above network A is common in both the networks thus we can say that both the networks are connected
Distribution of Data Distributes the Unigram Data Follows Block Distribution Finds the number of Lines in the Unigram file Then finds the interval for the block distribution
Data Structure Used We have used a two dimensional linked list structure. The first linked list (Network List) contains all the connected words and the second linked list connects all the network lists. Base List Network List
Working 1 2 1324 1. Get the root tokens. 2. Get the words connected to the root tokens. 3. If it is the first root token. B1 1 4. If not first root token then process each word one by one. Nature of this network: unique -> if connected to 1 existing network. -> if connected to some network different to the marked network -> Not Connected at all
Working (cntd.) B1 1 B2 2 Cases: 1. None of the word in root token 2 exist in root token 1 2. Any one word exists in already existing network B1 12
Working (cntd.) 3. A word is common to a network different to the marked network 3 B1 1 B2 2 To Process: Existing: Result: B1 132
Observations & Conclusion Execution takes lot of time. 2gm-0031 data has 1869 networks. Initially fast. Execution slows down as network size increases. Use of linked list of arrays for speeding up the process.
Agglomeration Combines work of all processors Finds Number of Disjoint Networks Number of Disjoint Networks Number of Nodes in each Network Number of Nodes in each Network For this step to work: Processor 0 and k (k = np/2) have networks in linked list Processor 0 and k (k = np/2) have networks in linked list Other Processors have written out their networks to file Other Processors have written out their networks to file
How It Works Processors 1 to k-1 send their “local” number of networks to Processor 0 Processors k+1 to (number of processors) -1 send theirs to Processor k Processor 0 and k combine networks Open files and checks if a word is in their networks. Open files and checks if a word is in their networks. Yes – Combine the two networks (eliminating redundancy)Yes – Combine the two networks (eliminating redundancy) No – Add that network to its list of networksNo – Add that network to its list of networks
Final Step Processor k writes its networks to files Sends its number of networks to Processor 0 Processor 0 then combines those networks Results Processor 0 has list of disjoint networks Processor 0 has list of disjoint networks Prints out number of disjoint networks Prints out number of disjoint networks Prints out the number of nodes in each network Prints out the number of nodes in each network
Unigram Cut-Off Happens during distribution of data to Processors When distributing to Processors, check for condition If frequency of unigram is > cut-off, store in array for distribution. If frequency of unigram is > cut-off, store in array for distribution. Else ignore that unigram Else ignore that unigram
Associative Cut-off Happens during the path finding module For each path found Find association score Find association score If > association cut-off, then include in pathIf > association cut-off, then include in path Else don’t include in pathElse don’t include in path
Path Finding This had queries supported to the constructed network. The aim was to explore the built Network by Path Finding. The queries allow a user to specify a target word, and display the paths of a given length leading to and from that word and to the words connected to those words etc
Requirements Requirements The specified target word should be at the center of the paths that lead into and out from it. Path lengths are defined in terms of the number of edges in the path to and from the target word. into and out from it. Path lengths are defined in terms of the number of edges in the path to and from the target word. Eg: was 3 (Path length : 3) Italian --> (34) --> poor --> (34) --> girl --> (43) --> was --> (34) --> hardworking --> (432) --> and --> (23) --> beautiful TIME: 0.432 (+more path length 3 variations) The number between 2 words represents frequency of those bi-grams
Algorithm (Broader View) Read the query (target-list) file according to the file format which is format which is Distribute each query from target-list to processors in a parallel manner (using MPI) Each processor builds its internal tree structure and finds the entire paths. Needed to dedicate someone for printing. If all start printing chaos occurs as we need full result set for a single query word clubbed together. All processors send the path results they obtain to processor with rank 0 who is responsible for printing the individual paths obtained by each processor. Caching logic helped in cycle detection to some extent
Challenges Challenges The recursive traversal done for all the 'from' and ‘to’ nodes of the given target node limits the scope of parallelism. Memory Issue :- Maximum Memory Limit :- For path lengths till one there was no problem. Eg : Bush 1 entry has 20000+ 'From/To' words associated with it. For each of these 20000+ words when you start processing their ‘from’ and ‘To’ lists recursively there is huge investment of memory and time. This causes hitting the maximum memory limit on blade easily before path processing is complete. Blade :- maximum memory limit of 7GB for user programs (4 million nodes in our case before it crashes) Eg : Bush 1 entry has 20000+ 'From/To' words associated with it. For each of these 20000+ words when you start processing their ‘from’ and ‘To’ lists recursively there is huge investment of memory and time. This causes hitting the maximum memory limit on blade easily before path processing is complete. Blade :- maximum memory limit of 7GB for user programs (4 million nodes in our case before it crashes)
Alternatives to overcome challenges Fix memory leaks :- Code had memory leaks in some places. Identified major culprits in memory consumption and appropriately freed them for optimum memory consumption. Major Bottlenecks Eg: Anytime a 'from' list or 'to' list for a token was obtained memory was not getting freed. Function AddToResults() was allocating memory on every path found but was not freeing it.
Alternatives to overcome challenges (continued) Migration to ALTIX :- since amount of memory available on ALTIX is lot more than blade the chances for path finding to work for greater than path length 1 were high. Exploited good memory support on Altix by writing data to files. This gave good results for path length upto 8-9 for smaller scope target words and 4-5 for little higher scope words. The files on which data was written were as big as 20GB.
Change in Methodology and Performance. The new method exploited memory and also enhanced performance in terms of time required to find paths. However because of the recursive nature of the algorithm, inherent sequential component was fixed and this limited performance according to Amdahl’s Law.
How is this better ?? Merging:- Left Side Paths Right Side Paths Final Output of Combined Files For each line of left file, combine that with every line in right file to form a complete path. This architecture allows parallelism at much more granular level than just query level Target Word