Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview 9.11.2007.

Slides:



Advertisements
Similar presentations
Introduction to Computational Linguistics
Advertisements

Introduction to Computational Linguistics
Advanced Language Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School Winter 2010 / Spring 200p Jožef Stefan International.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Introduction to Computational Linguistics Lecture 2.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Research on Intelligent Information Systems Himanshu Gupta Michael Kifer Annie Liu C.R. Ramakrishnan I.V. Ramakrishnan Amanda Stent David Warren Anita.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Lecture 1 Introduction: Linguistic Theory and Theories
Research methods in corpus linguistics Xiaofei Lu.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Three Generative grammars
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Language Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School Winter 2013 / Spring 2014 Jožef Stefan International.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
9/8/20151 Natural Language Processing Lecture Notes 1.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
1 Computational Linguistics Ling 200 Spring 2006.
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Chapter 10 Language and Computer English Linguistics: An Introduction.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Language Technology I © 2005 Hans Uszkoreit Language Technology I 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.
Introduction to CL & NLP CMSC April 1, 2003.
Language Technologies Tomaž Erjavec “New Media and eScience” MSc Programme Jožef Stefan International Postgraduate School Winter/Spring Semester, 2005/06.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
Language Technologies “New Media and eScience” MSc Programme Jožef Stefan International Postgraduate School Winter Semester 2008/09 Tomaž Erjavec Lecture.
3.2 Semantics. 2 Semantics Attribute Grammars The Meanings of Programs: Semantics Sebesta Chapter 3.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Language Technologies “New Media and eScience” MSc Programme Jožef Stefan International Postgraduate School Winter/Spring Semester, 2007/08 Tomaž Erjavec.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Natural Language Processing Chapter 1 : Introduction.
Introduction Chapter 1 Foundations of statistical natural language processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
1 An Introduction to Computational Linguistics Mohammad Bahrani.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MENTAL GRAMMAR Language and mind. First half of 20 th cent. – What the main goal of linguistics should be? Behaviorism – Bloomfield: goal of linguistics.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Natural Language Processing (NLP)
Machine Learning in Natural Language Processing
Natural Language Processing
Language Technologies Tomaž Erjavec Dept
Knowledge Representation for Natural Language Understanding
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview

Overview 1. a few words about me 2. a few words about you 3. introduction to HLT 4. lab work: first steps with Python

Lecturer Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Work: corpora and other language resources, standards, annotation, text-critical editions Work: corpora and other language resources, standards, annotation, text-critical editions Web page for this course: Web page for this course: assessment assessment

Students background: field of study background: field of study exposure to exposure to  linguistics?  corpus linguistics?  programming? s s

Overview of the course 1. Introduction 2. Basic processing of text 3. Working with corpora 4. Multilingual applications 5. Lexical semantics 6. … Lectures + work with NLTK

Computer processing of natural language Computational Linguistics: Computational Linguistics:  a branch of computer science, that attempts to model the cognitive faculty of humans that enables us to produce/understand language Natural Language Processing: Natural Language Processing:  a subfield of CL, dealing with specific methods to process language Human Language Technologies: Human Language Technologies:  (the development of) useful programs to process language

Languages and computers How do computers “understand” language?  (written) language is, for a computer, merely a sequence of characters (strings)  words are separated by spaces  words are separated by spaces or punctuation  words are separated by spaces or punctuation and space  [2,3H]dexamethasone, $ , pre- and post-natal, etc.

Problems Languages have properties that humans find easy to process, but are very problematic for computers Ambiguity: many words, syntactic constructions, etc. have more than one interpretation Ambiguity: many words, syntactic constructions, etc. have more than one interpretation Vagueness: many linguistic features are left implicit in the text Vagueness: many linguistic features are left implicit in the text Paraphrases: many concepts can be expressed in different ways Paraphrases: many concepts can be expressed in different ways Humans use context and background knowledge; both are difficult for computers

Time flies like an arrow. Time flies like an arrow. I saw the spy with the binoculars. He left the bank at 3 p.m. I saw the spy with the binoculars. He left the bank at 3 p.m.

The dimensions of the problem Scope of language resources Depth of analysis Application area Morphology Syntax Semantics Pragmatics Identification of words Many applications require only a shallow level of analysis.

Structuralist and empiricist views on language The structuralist approach: The structuralist approach:  Language is a limited and orderly system based on rules.  Automatic processing of language is possible with rules  Rules are written in accordance with language intuition The empirical approach: The empirical approach:  Language is the sum total of all its manifestations (written and spoken)  Generalisations are possible only the basis of large collections of language data, which serve as a sample of the language (corpora)  Machine Learning: “data-driven automatic inference of rules”

Other names for the two approaches rationalism vs. empiricism rationalism vs. empiricism competence vs. performance competence vs. performance deductive vs. inductive deductive vs. inductive Deductive method: from the general to specific; rules are derived from axioms and principles; verification of rules by observations Deductive method: from the general to specific; rules are derived from axioms and principles; verification of rules by observations Inductive method: from the specific to the general; rules are derived from specific observations; falsification of rules by observations Inductive method: from the specific to the general; rules are derived from specific observations; falsification of rules by observations

Empirical approach Describing naturally occurring language data Describing naturally occurring language data Objective (reproducible) statements about language Objective (reproducible) statements about language Quantitative analysis: common patterns in language use Quantitative analysis: common patterns in language use Creation of robust tools by applying statistical and machine learning approaches to large amounts of language data Creation of robust tools by applying statistical and machine learning approaches to large amounts of language data Basis for empirical approach: corpora Basis for empirical approach: corpora Empirical turn supported by rise in processing speed of computers and their amount of storage, and the revolution in the availability of machine-readable texts (the word-wide web) Empirical turn supported by rise in processing speed of computers and their amount of storage, and the revolution in the availability of machine-readable texts (the word-wide web)

The history of Computational Linguistics MT, empiricism ( ) MT, empiricism ( ) Structuralism: the generative paradigm (70-90) Structuralism: the generative paradigm (70-90) Data fights back (80-00) Data fights back (80-00) A happy marriage? A happy marriage? The promise of the Web The promise of the Web

The early years The promise (and need!) for machine translation The promise (and need!) for machine translation The decade of optimism: The decade of optimism: The spirit is willing but the flesh is weak ≠ The vodka is good but the meat is rotten The spirit is willing but the flesh is weak ≠ The vodka is good but the meat is rotten ALPAC report 1966: no further investment in MT research; instead development of machine aids for translators, such as automatic dictionaries, and the continued support of basic research in computational linguistics ALPAC report 1966: no further investment in MT research; instead development of machine aids for translators, such as automatic dictionaries, and the continued support of basic research in computational linguistics also quantitative language (text/author) investigations also quantitative language (text/author) investigations

The Generative Paradigm Noam Chomsky’s Transformational grammar: Syntactic Structures (1957) Two levels of representation of the structure of sentences: an underlying, more abstract form, termed 'deep structure', an underlying, more abstract form, termed 'deep structure', the actual form of the sentence produced, called 'surface structure'. the actual form of the sentence produced, called 'surface structure'. Deep structure is represented in the form of a hierarchical tree diagram, or "phrase structure tree," depicting the abstract grammatical relationships between the words and phrases within a sentence. A system of formal rules specifies how deep structures are to be transformed into surface structures.

Phrase structure rules and derivation trees S→ NP V NP NP→ N NP→ Det N NP → NP that S

Characteristics of generative grammar Research mostly in syntax, but also phonology, morphology and semantics (as well as language development, cognitive linguistics) Research mostly in syntax, but also phonology, morphology and semantics (as well as language development, cognitive linguistics) Cognitive modelling and generative capacity; search for linguistic universals Cognitive modelling and generative capacity; search for linguistic universals First strict formal specifications (at first), but problems of overpremissivness First strict formal specifications (at first), but problems of overpremissivness Chomsky’s Development: Transformational Grammar (1957, 1964), …, Government and Binding/Principles and Parameters (1981), Minimalism (1995) Chomsky’s Development: Transformational Grammar (1957, 1964), …, Government and Binding/Principles and Parameters (1981), Minimalism (1995)

Computational linguistics Focus in the 70’s is on cognitive simulation (with long term practical prospects..) Focus in the 70’s is on cognitive simulation (with long term practical prospects..) The applied “branch” of CompLing is called Natural Language Processing The applied “branch” of CompLing is called Natural Language Processing Initially following Chomsky’s theory + developing efficient methods for parsing Initially following Chomsky’s theory + developing efficient methods for parsing Early 80’s: unification based grammars (artificial intelligence, logic programming, constraint satisfaction, inheritance reasoning, object oriented programming,..) Early 80’s: unification based grammars (artificial intelligence, logic programming, constraint satisfaction, inheritance reasoning, object oriented programming,..)

Problems Disadvantage of rule-based (deep-knowledge) systems: Coverage (lexicon) Coverage (lexicon) Robustness (ill-formed input) Robustness (ill-formed input) Speed (polynomial complexity) Speed (polynomial complexity) Preferences (the problem of ambiguity: “Time flies like an arrow”) Preferences (the problem of ambiguity: “Time flies like an arrow”) Applicability? (more useful to know what is the name of a company than to know the deep parse of a sentence) Applicability? (more useful to know what is the name of a company than to know the deep parse of a sentence) EUROTRA and VERBMOBIL: success or disaster? EUROTRA and VERBMOBIL: success or disaster?

Back to data Late 1980’s: applied methods methods based on data (the decade of “language resources”) The increasing role of the lexicon (Re)emergence of corpora 90’s: Human language technologies Data-driven shallow (knowledge-poor) methods Inductive approaches, esp. statistical ones (PoS tagging, collocation identification, Candide) Importance of evaluation (resources, methods)

The new millennium The emergence of the Web: Simple to access, but hard to digest Simple to access, but hard to digest Large and getting larger Large and getting larger Multilinguality Multilinguality The promise of mobile, ‘invisible’ interfaces; HLT in the role of middle-ware

HLT applications Speech technologies Machine translation Question answering Information retrieval and extraction Text summarisation Text mining Dialogue systems Multimodal and multimedia systems Computer assisted: authoring; language learning; translating; lexicology; language research

HLT applications II. Corpus tools Corpus tools  concordance software  tools for statistical analysis of corpora  tools for compiling corpora  tools for aligning corpora  tools for annotating corpora Translation tools Translation tools  programs for terminology databases  translation memory programs  machine translation

HLT research fields Phonetics and phonology: speech synthesis and recognition Phonetics and phonology: speech synthesis and recognition Morphology: morphological analysis, part-of-speech tagging, lemmatisation, recognition of unknown words Morphology: morphological analysis, part-of-speech tagging, lemmatisation, recognition of unknown words Syntax: determining the constituent parts of a sentence (NP, VP) and their syntactic function (Subject, Predicate, Object) Syntax: determining the constituent parts of a sentence (NP, VP) and their syntactic function (Subject, Predicate, Object) Semantics: word-sense disambiguation, automatic induction of semantic resources (thesauri, ontologies) Semantics: word-sense disambiguation, automatic induction of semantic resources (thesauri, ontologies) Multiulingual technologies: extracting translation equivalents from corpora, machine translation Multiulingual technologies: extracting translation equivalents from corpora, machine translation Internet: information extraction, text mining, advanced search engines Internet: information extraction, text mining, advanced search engines

Processes, methods, and resources The Oxford Handbook of Computational Linguistics, Ruslan Mitkov (ed.) Text-to-Speech Synthesis Speech Recognition Text Segmentation Part-of-Speech Tagging and lemmatisation Parsing Word-Sense Disambiguation Anaphora Resolution Natural Language Generation Finite-State Technology Statistical Methods Machine Learning Lexical Knowledge Acquisition Evaluation Sublanguages and Controlled Languages Corpora Ontologies

Further reading Language Technology World Language Technology World The Association for Computational Linguistics (c.f. Resources) The Association for Computational Linguistics (c.f. Resources) Interactive Online CL Demos Interactive Online CL Demos Natural Language Processing – course materials Natural Language Processing – course materials