Cluster Computing for Statistical Machine Translation Qin Gao, Kevin Gimpel, Alok Palikar, Andreas Zollmann Stephan Vogel, Noah Smith.

Slides:

Advertisements

Similar presentations

Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!

Advertisements

Statistical Machine Translation

Information Retrieval in Practice

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Spark: Cluster Computing with Working Sets

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

MapReduce How to painlessly process terabytes of data.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Large Scale Machine Translation Architectures Qin Gao.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,

RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

A Simple Approach for Author Profiling in MapReduce

Advanced Computer Systems

Compiler Design (40-414) Main Text Book:

Hadoop Aakash Kag What Why How 1.

Hadoop MapReduce Framework

--Mengxue Zhang, Qingyang Li

The Basics of Apache Hadoop

Cse 344 May 4th – Map/Reduce.

Overview of big data tools

Query Processing.

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Cluster Computing for Statistical Machine Translation Qin Gao, Kevin Gimpel, Alok Palikar, Andreas Zollmann Stephan Vogel, Noah Smith

Outline  Introduction  Statistical Machine Translation  How can parallel processing help MT?  INCA: An Integrated Cluster Computing Architecture for Machine Translation  Roadmap  Progress  Ongoing work

Statistical Machine Translation 澳洲是少数与北朝鲜建交的国家之一 Source Sentence Hello John … The United State government is going to… Australia is one of afew countries that has relationship with North Korea… Australia is North Korean has relationship a few… Sarah Palin's neighbors saluted their hometown girl with Alaskan Amber The idea, especially from the Democrats that I know ….. Possible Translations Which one to choose?

Components of Statistical Machine Translation Source LanguageTarget Language Interlingua Analysis Generation Transfer

Components of Statistical Machine Translation Source LanguageTarget Language Interlingua Analysis Generation ? Semantic (?) Syntactic Tree-to-Tree Syntactic Tree-to-String String-to-String

Components of Statistical Machine Translation Source LanguageTarget Language Interlingua Analysis Generation ? Semantic (?) Syntactic Tree-to-Tree Syntactic Tree-to-String String-to-String Text Normalizing Syntactic Parsing Semantic Parsing Tasks for Analysis: Text normalization Tokenization Syntactic Parsing Dependency Parsing Semantic Parsing ….

Components of Statistical Machine Translation Source LanguageTarget Language Interlingua (?) Analysis Generation ? Tasks for Transfer: Phrase extraction Rule extraction Feature value estimation …. Phrase extraction Rule extraction

Components of Statistical Machine Translation Target Language Tasks for Generation: Target language modeling Minimal Error Rate Training Decoding …. Source Language Interlingua (?) Analysis Generation ?

AnalysisTransferGeneration Tasks for Analysis: Text normalization Tokenization Syntactic Parsing Dependency Parsing Semantic Parsing …. Tasks for Transfer: Phrase extraction Generation rule extraction Feature value estimation …. Tasks for Generation: Target language modeling Minimal Error Rate Training Decoding …. g

INCA : An Integrated Cluster Computing Architecture for Machine Translation

Target: Integrated Cluster Computing Architecture for MT Provide Cluster-enabled Components Simplify experiment process with integrated framework Experiment Management : Reproduce the results

Technical Roadmap Current Processing Pipeline Word Alignment Phrase Extraction MERT Integrated Parallel Pipeline

Progress Distributed Phrase extraction (Chaski) Rule extraction for SAMT Distributed Tuning/Decoding framework (Trambo) Parallel GIZA++ (To be ported to Hadoop) POS tagging Parsing Semantic Role labeling Data Pre- processing Word Alignment Phrase/Rule Extraction Tuning and Decoding

Data Preprocessing  Examples:  Tokenization  Word segmentation (e.g. for Chinese)  POS tagging  Parsing  Semantic role labeling  Naturally parallelizable, perfect for MapReduce  However, there are still interesting problems

Error Tolerance?  Hadoop has the functionality to re-run failed jobs.  It works fine for hardware errors or “unpredictable” software errors  But it does not work for predictable errors  E.g: A certain sentence may crash the parser, restart the parser does not help  Fail-safe mechanism: Falling back to simple models  Currently implemented through individual scripts, aiming at providing the frame work

Example: Phrase-based Translation

The Speed…  On 5 million words corpus  Data preparation:  About 5 hours (word clustering, format conversion etc)  Word Alignment:  Typically 6-7 days for each iteration  Phrase Extraction: Maximum length 10  More than 5 days, depending on the IO speed of disk

Parallelized Word Alignment

Parallel GIZA++

Performance of PGIZA++ Running Time BLEU Score (Tuning) BLEU Score (Testing) CPUs GIZA++169h PGIZA++39h Per-IterationTotal Model min235 min (3.9 h) HMM31.8 min159 min (2.6 h) Model 3/425.2 min151 min (2.5 h) Comparison of Speed and BLEU score using PGIZA++ and GIZA++ Normalization time in PGIZA++ Corpus Size: 10 million sentence pairs, Chinese-English

I/O bottleneck  Currently the tool run on NFS  I/O of counts becomes the major bottleneck  The plan is to port it to Hadoop  The locality of HDFS should help

Ongoing Work: Distributed Asynchronous Online EM  Master node runs MapReduce jobs to perform E-step on mini-batches of data  When each mini-batch completes, master node updates parameters, starts a new mini-batch immediately  E-steps performed using slightly stale parameters → “asynchronous”

Phrase Extraction Tool Chaski  Chaski  Postmen of Inca Empire  They run really fast!

Phrase Model Training  Problems in phrase extraction:  Disk space  700GB when extracting phrase up to length 10, on 10 million sentences  Sorting phrases  External sorting must be used, which adds up to the disk usage

Phrase Model Training (Moses)

Hadoop FileSystem (HDFS) Block 2 Block 1 Block 3 Task 1 Block 5 Block 4 Block 6 Task 2 Block Y Block X Block Z Task k

Phrase Extraction & Reorder Model  MapReduce 1:  Extract phrase from alignments and sort by target phrases  Calculate the target  source features  MapReduce 2:  Take the input from previous step and sort by source phrases  Calculate the source  target features  MapReduce 3:  Take the input from MapReduce 1, sort by source phrases  Calculate the reordering features

Chaski Mapper Sorting Reducer Extract phrases Sort on target Score phrases Dummy Mapper Sort on source Score phrases Dummy Mapper Sort on source Learn Reordering Hadoop and HDFS handle large intermediate file and sorting Kept away from the merge operation which MapReduce is not good at

Performance of Chaski

Rule Extraction for Syntactic Augmented Machine Translation System  Extract probabilistic synchronous context-free grammar (PSCFG) rules in in a Map step from all the target-parsed training sentence pairs From syntax-annotated phrase pairs… S -> Il ne marche pas / It does not work NP -> Il / It VB -> marche / work …create rule: S -> NP ne VB pas / NP does not VB  Merge and compute weights for them in the Reduce step

Distributed Tuning/Decoding (Trambo)  Translate one sentence at a time  Split up decoding into sub-processes  Collect the output for MERT Test Set Decoder … Split MERT Merge

Trambo  Filter the phrase table and language models on a per-sentence basis -- beforehand  Each decoder instance loads faster  Memory usage is kept in check  Tuning time (with Moses decoder):  12.5 hrs on desktop machine  →70 mins using 50 nodes

Accessing Large Structured Data  The basic idea of MapReduce is to use commodity computers to build clusters  And it is specially good for Massive, Unordered, Distributed (MUD) data  However, not all data in this format:  Language models  Statistical dictionaries  Phrase / Rule database

Filtering as (temporary?) Solution  For language model  In Trambo, filtering a 11GB language model for 1600 sentences, ends up with 1TB temporary storage, with the limitation of 60 candidate target phrases for each source phrase  Distributed Language Model?

Conclusion  Working towards integrated cluster computing architecture for MT by providing cluster-enabled components  Data preprocessing  Word alignment  Phrase/Rule extraction  Tuning/Decoding

On-going Work Data preprocessing Word Alignment Phrase/Rule Extraction Tuning/Decoding System Integration Research Side Engineering Side Error tolerant framework through fall-back strategies Distributed Asynchronous Online EM Porting PGIZA++ to Hadoop Merge Chaski and SAMT rule extractor into single framework Efficient access of language models User study: Which kinds of automation can benefit researchers most Master script to integrate components Experiment management Improved phrase/rule extraction

Thanks! Questions & Suggestion?

Error tolerance, the definition?  In Hadoop, the error tolerance generally means tolerance of Hardware Errors or Unexpected System Errors.  “Expected” errors may happen:  Syntactic parsing may fail when using complicated models on long sentences.  Need to implement transparent “fall-back” mechanism to ensure completion of tasks