Presentation is loading. Please wait.

Presentation is loading. Please wait.

GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics.

Similar presentations


Presentation on theme: "GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics."— Presentation transcript:

1

2 GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics

3 2  Data created July 2009  Version 1 file format N-gram \t year \t match_count \t page_count \t volume_count \n  N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram  Year is the publication year  match_count is the occurrences for that year  page_count is the number of pages on which the ngram appeared  volume_count is the number of books where the ngram occurred Version 1

4 3  http://aws.amazon.com/datasets/8172056142375670 http://aws.amazon.com/datasets/8172056142375670  Stored in AWS Simple Storage Service (S3) AWS Public Dataset

5 4  Stored as compressed data  Luckily Hadoop supports GZIP BZIP2 LZO (see below) DEFLATE (zlib implementation)  But Hadoop does not support WinZip  And Hadoop supports LZO only if you create a version with it yourself AWS Public Dataset

6 5 Compression Format ToolAlgorithmFilename Extension Multiple files? Able to be Split? DEFLATE (zlib)No CLI toolsDEFLATE.deflateNo gzip DEFLATE+.gzNo bzip2.bz2NoYes LZOlzopLZO.lzoNo Hadoop Compression Formats Source: Hadoop The Definitive Guide

7 6 Compression FormatTool DEFLATE (zlib) org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.GzipCodec LZO com.hadoop.compression.LzopCodec Hadoop Compression Formats Source: Hadoop The Definitive Guide

8 Project Assignment I 7  Use the nwcdatabucket as the bucket for input  Use the tmp folder in nwcdatabucket  Input is nwcdatabucket/tmp  Write Python code (in > 1.py files)  Find the twenty most frequently occurring 5-grams for a 10 year period.  You may hard-code the 10 year period E.g. 1950 to 1959  You need not worry about error checking the range

9 Project Assignment II 8  Setting reducers  Use the extra arguments in the bottom of the first page  The following creates 1 reducer -D mapred.reduce.tasks=1  Upload your results as a text file  Upload your Python code modules

10 The end has come. End of the Part 3 PowerPoint 9


Download ppt "GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics."

Similar presentations


Ads by Google