1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Information Retrieval in Practice
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
HTML and XHTML Controlling the Display Of Web Content.
Stemming, tagging and chunking Text analysis short of parsing.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Developing a Basic Web Page Posting Files on UMBC
Overview of Search Engines
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
* The basic components of a web site are: * Content – information displayed or accepted from users * Static – content that doesn’t change for different.
XML Basics Hope Greenberg Center for Teaching & Learning.
Evidence from Content INST 734 Module 2 Doug Oard.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
Lecturer: Ghadah Aldehim
Chpater 3. Outline The definition of Syntax The Definition of Semantic Most Common Methods of Describing Syntax.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.
1 HTML John Sum Institute of Technology Management National Chung Hsing University.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Introduction to HTML. Slide 1 Hard-Coding What is hard-coding? –Creating the page in a text editor just using HTML A Web designer should know how to hard-
1 What is HTML? Standardized codes Web pages SGML Descriptive markup Tags.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
What it is and how it works
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Basics of Web Based Computing. The Architecture The user’s system A Web Server What’s inside? Server software Apache or other Resources to be accessible.
Web programming Part 1: HTML 由 NordriDesign 提供
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Session 2: Basic HTML HTML Coding Spring 2009 The LIS Web Team Presents.
Presentation On HTML & Podcast Done by: Shamelia Young & Sheriece Williamson.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Module Road Map Assignment Road Map Notice we have linked the conduit directly to the presentation layer. This is normally a bad idea!
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Week-11 (Lecture-1) Introduction to HTML programming: A web based markup language for web. Ex.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture5 2 August 2007.
Information Retrieval in Practice
What is HTML? Standardized codes Web pages SGML Descriptive markup
Natural Language Processing (NLP)
CS 430: Information Discovery
UNIT 15 Webpage Creator.
Topics in Linguistics ENG 331
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Inf 722 Information Organisation
Introduction to Text Analysis
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

1 Statistical NLP: Lecture 6 Corpus-Based Work

2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest. 4 Corpus-Based work involves collecting a large number of counts from corpora that need to be access quickly. 4 There exists some software for processing corpora (see useful links on course homepage.

3 Looking at Text I: Low-Level Formatting Issues 4 Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins. 4 Uppercase and Lowercase: should we keep the case or not? The The and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

4 Looking at Text II: Tokenization --What is a Word? 4 An early step of processing is to divide the input text into units called tokens where each is either a word or something else like a number or a punctuation mark. 4 Periods: haplologies or end of sentence? 4 Single apostrophes 4 Hyphenation 4 Homographs --> two lexemes

5 Looking at Text III: Tokenization --What is a Word (Cont’d)? 4 Word Segmentation in other languages: no whitespace ==> words segmentation is hard 4 whitespace not indicating a word break. 4 variant coding of information of a certain semantic type. 4 Speech corpora

6 Morphology 4 Stemming: Strips off affixes and leaves a stem. þ Not that helpful in English (from an IR point of view) which has very little morphology. þ Perhaps more useful in other contexts.

7 Sentences: What is a sentence?” 4 Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases. þ Sometimes, however, sentences are split up by other punctuation marks or quotes. þ Often, solutions involve heuristic methods. However, these solutions are handcoded. Some effort to automate the sentence- boundary process have also been tried.

8 Marked-Up Data I: Mark-up Schemes 4 Schemes developped to mark up the structure of text 4 Different Mark-up schemes: –COCOA format (older, and rather ad-hoc) –SGML [other related encodings: HTML, TEI, XML]

9 Marked-Up Data II: Grammatical Coding 4 Tagging corresponds to indicating the various conventional parts of speech. Tagging can be done automatically (we will talk about that in Week 9. 4 Different Tag Sets have been used: E.g., Brown Tag Set, Penn Treebank Tag Set 4 The Design of a Tag Set: Target Features versus Predictive Features.