GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics.

Slides:



Advertisements
Similar presentations
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Advertisements

School of Computing Science – CMT1000 Slide 1 Ed Currie Introduction to Programming CMT1000 Lecture 1A.
Using the File Manager WebCT 6. Understanding File Manager The File Manager is the area where all course files are stored. Whenever you link a file in.
PROBABILITY REVIEW PART 9 CONDITIONAL PROBABILITY II Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
INTRODUCTION TO PYTHON PART 5 - GRAPHICS CSC482 Introduction to Text Analytics Thomas Tiahrt, MA, PhD.
INTRODUCTION TO PYTHON PART 2 INPUT AND OUTPUT CSC482 Introduction to Text Analytics Thomas Tiahrt, MA, PhD.
CPSC 203 Introduction to Computers Lab 33 By Jie Gao.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Changes in WebCT Vista Version 8 (AKA CourseDen) UWG Distance & Distributed Ed Center (adapted from Kings College, UK) October 2008.
Python programs How can I run a program? Input and output.
CURR 285, Fall 2004 Michael Beutner, Associate Professor, Instructional Technology Office:Strauss Hall 206 (203) When you send ,
Introduction to Internet Engineering Tutorial 7 All about Assignment 2 By Tse Hok
Introduction to Functions Intro to Computer Science CS1510 Dr. Sarah Diesburg.
CS117 Introduction to Computer Science II Lecture 2 Creating an HTML Document Instructor: Li Ma Office: NBC 126 Phone: (713)
Getting Started with Application Software
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
COMP 171: Principles of Computer Science I John Barr.
Binary Numbers. Remember we count using the decimal system or Base 10. That means there are 10 symbols: 0,1,2,3,4,5,6,7,8 and 9 Computers use the Binary.
 Word Processing  Spreadsheets  Presentations  Drawings  Forms.
How to upload files to Altervista Overview:
CREATING WEB PAGES Using…More HTML code! My First \ Web Page.
GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 2 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics.
PowerPoint Extras. Eyes to the front please! Action Buttons.
Component 4: Introduction to Information and Computer Science Unit 6a Databases and SQL.
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
Chapter 6 Review: User Defined Functions Introduction to MATLAB 7 Engineering 161.
1 Project: Page Replacement Algorithms Lubomir Bic.
Using This PowerPoint This PowerPoint presentation assumes your Computer Science teacher has provided you with the InstallingJava folder, which contains.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
The Online World ONLINE DOCUMENTS. Online documents Online documents (such as text documents, spreadsheets, presentations, graphics and forms) are any.
Head First Python: Ch 3. Files and Exceptions: Dealing with Errors Aug 26, 2013 Kyung-Bin Lim.
Creating and Using Modules Sec 9-6 Web Design. Objectives The student will: Know how to create and save a module in Python Know how to include your modules.
Resources in Moodle Dubravka Crnić. Moodle supports a range of resource types which teachers can add to their courses. In edit mode, a teacher can add.
Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.
+ Auto-Testing Code for Teachers & Beginning Programmers Dr. Ronald K. Smith Graceland University.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
CIW LESSON 7 PART A. INTRODUCTION TO BUSINESS ELECTRONIC MAIL The use of has given rise to the term ______________________, which is a slang term.
Review for Test 2 Chapters 5 (start at 5.4), 6.1, , 12, 13, 15.1, Python.
Molecular Dynamics Analysis Toolkit Karl Debiec and Nick Rego Chong Group Department of Chemistry August 30 th 2013.
Development Environment
Getting Started with Application Software
Fundamentals of Python: First Programs
ivote A system for polling students in the class
Introduction to Programming
CSCE 3110 Data Structures & Algorithm Analysis
Lesson 10: Dictionaries Topic: Introduction to Programming, Zybook Ch 9, P4E Ch 9. Slides on website.
On each booth you will find 4 to 6 banners
Incident Management: Recording New Incidents User Guide
CIW Lesson 7 Part A Name: _______________________________________
Chromebook Training.
Chromebook Training.
Using the GO Portal A guide to the resources you can access through the GO Portal.
Part 3 Creating basic HTML web pages
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
HOW TO MAKE A SHARED DOCUMENT MULTIPLE PEOPLE CAN EDIT AT SAME TIME
Programming in JavaScript
Part 3 Creating basic HTML web pages
Introduction to Programming
Fundamentals of Python: First Programs
Programming in JavaScript
Introduction to Value-Returning Functions: Generating Random Numbers
Programming in JavaScript
Programming in JavaScript
Programming in JavaScript
Introduction to Computer Science
Introduction to Python
Introduction to Programming
Open Systems Technologies Data Analyst Internship:
Web Application Development Using PHP
Presentation transcript:

GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics

2  Data created July 2009  Version 1 file format N-gram \t year \t match_count \t page_count \t volume_count \n  N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram  Year is the publication year  match_count is the occurrences for that year  page_count is the number of pages on which the ngram appeared  volume_count is the number of books where the ngram occurred Version 1

3   Stored in AWS Simple Storage Service (S3) AWS Public Dataset

4  Stored as compressed data  Luckily Hadoop supports GZIP BZIP2 LZO (see below) DEFLATE (zlib implementation)  But Hadoop does not support WinZip  And Hadoop supports LZO only if you create a version with it yourself AWS Public Dataset

5 Compression Format ToolAlgorithmFilename Extension Multiple files? Able to be Split? DEFLATE (zlib)No CLI toolsDEFLATE.deflateNo gzip DEFLATE+.gzNo bzip2.bz2NoYes LZOlzopLZO.lzoNo Hadoop Compression Formats Source: Hadoop The Definitive Guide

6 Compression FormatTool DEFLATE (zlib) org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.GzipCodec LZO com.hadoop.compression.LzopCodec Hadoop Compression Formats Source: Hadoop The Definitive Guide

Project Assignment I 7  Use the nwcdatabucket as the bucket for input  Use the tmp folder in nwcdatabucket  Input is nwcdatabucket/tmp  Write Python code (in > 1.py files)  Find the twenty most frequently occurring 5-grams for a 10 year period.  You may hard-code the 10 year period E.g to 1959  You need not worry about error checking the range

Project Assignment II 8  Setting reducers  Use the extra arguments in the bottom of the first page  The following creates 1 reducer -D mapred.reduce.tasks=1  Upload your results as a text file  Upload your Python code modules

The end has come. End of the Part 3 PowerPoint 9