Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Published byModified over 6 years ago
Presentation on theme: "Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006."— Presentation transcript:
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006
What is NLP? Natural Language Processing (NLP) is a field in Computer Science devoted to creating computers that use natural language as input and/or output.
NLP is AI-complete “The most difficult problems in AI manifest themselves in human language phenomena.” Use of language is the touchstone of intelligent behavior.
NLP Tasks NLP can be stand-alone applications or components embedded in other systems. NLP essentially is to analyze (natural language) sentences, linguistically. It involves (minimally) : 1. Determining the part-of-speech category of words (e.g. noun, verb). 2. Identifying the syntactic structure of a sentence. 3. Deriving the meaning of the sentence.
NLP Applications Speech Recognition Sentence Analysis Parsing for syntactic analysis Word sense disambiguation for semantic analysis Dialogue Systems, e.g. Tutoring systems Question-Answering Systems Machine Translation Automatic Text Summarization (e.g. Newsblaster system at U of Columbia)Newsblaster and much more
Part-Of-Speech (POS) Identifying POS is as easy as one would think – because of ambiguity. “He authored the book.” “he” – DET “author” – N or V “the” – DET “book” – N or V Without knowing the structure (or sometimes the meaning) of the sentence, ambiguous categories cannot be correctly disambiguated.
Stemming is NOT NLP Note that stemming (e.g. Porter stemmer) is a technique in Information Retrieval; it’s NOT NLP. Stemming only removes the common morphological and inflectional endings from words. “police”, “policy”, “policies” => “polic” “went” => “went” Whereas proper NLP analysis yields: “police” => “police”, “policies” => “policy” “went” => “go”
Syntactic Structure Grammar rules (often in CFG) specify the grammatically correct ways to form phrases (and a sentence). Grammar R0: R1: R2: R3: R4: R5: R6: R7: cake"" N the"" Det ate"" V John"" NP V VG NPVG VP N Det NP VP NP S S VP V NP “John”“ate” “the” DetN “cake”
Parsing Algorithms Computational complexity (with no optimization) is exponential. Various parsing algorithms: Top-down Parsing -- (top-down) derivation Bottom-up Parsing Chart Parsing Earley’s Algorithm – most efficient, O(n 3 ) Left-corner Parsing – optimization of Earley’s and lots more…
Semantics Derive the meaning of a sentence. Often applied on the result of syntactic analysis. “John ate the cake.” NP V NP ((action INGEST); syntactic verb (actor JOHN-01); syntactic subj (object FOOD)); syntactic obj To do semantic analysis, we need a (semantic) dictionary (e.g. WordNet, http://www.cogsci.princeton.edu/~wn/ ). http://www.cogsci.princeton.edu/~wn/
NLP is Hard… Understanding natural languages is hard … because of inherent ambiguity Engineering NLP systems is also hard … because of Huge amount of resources needed (e.g. grammar, dictionary, documents to extract statistics from) Computational complexity (intractable) of analyzing a sentence
Ambiguity At last, a computer that understands you like your mother.” Three possible (syntactic) interpretations: 1. It understands you as well as your mother understands you. 2. It understands that you like your mother. 3. It understands you as well as it understands your mother.
Empirical Approaches to NLP Formal analysis of natural languages is too hard and computationally heavy. Abandon complete/correct solutions, and shoot for “approximate” solutions by using cheaper/faster techniques. Collect data from documents (corpus), rather than building linguistic resources manually.
Part-Of-Speech Tagging Typically, from a corpus where words are already tagged with POS categories, collect statistics of n-word sequences (n-gram) => probabilities of POS sequences. For a given untagged/test sentence, assign the most probable POS tag for each word starting from the beginning, based on the POS tags of the preceding words. Popular techniques: Hidden Markov Model (HMM) – with parameter estimation using Viterbi, EM algorithms Transformation Rules (e.g. Brill tagger) Various ML classification algorithms
Probabilistic Parsing For ambiguous sentences, we’d like to know which parse tree is more likely than others. So we must assign probability to each parse tree … but how? A probability of a parse tree t is where r is a rule used in t. and p(r) is obtained from a (annotated) corpus. Again, parameter estimation by using various ML techniques.
Partial (Chunk) Parsing Abandon full parsing; instead aim to obtain just phrases (chunks). Build a flat structure (instead of a hierarchical tree by full parsing). Each chunk is identified by a regular expression – a finite-state automaton. => Polynomial time complexity Then a chunker is a cascade of finite- state automata.
Semantic Analysis Natural language sentences are ambiguous, largely because a word often has several meanings/senses. e.g. “bass” (n) has two senses: 1. a type of fish 2. tones of low frequency Which sense is used? “The bass part of the song is very moving.” “I went fishing for some sea bass.” It’s easy for humans, but not for computers.
Word Sense Disambiguation A task of assigning the proper sense to words in a sentence (thus a classification task). Using a training corpus annotated with proper senses, obtain statistics of n-word sequences (their word senses). Apply classification algorithms, such as: Decision Tree Support Vector Machine Conditional Random Fields (CRF) Difficulty with data sparseness. Techniques: Smoothing Backoff models
Other NLP’ish ML Tasks Text Categorization Classify a document into one of the document categories (See textbook on Naïve Bayes). Multi-lingual Retrieval Enter a query in one language, and retrieve relevant documents of ANY language. Machine Translation part is done by ML. Automatic construction of domain ontology Build a conceptual hierarchy of a specific domain and many, many more!
Analyzing Web Documents Recently there have been many NLP applications which analyze (not just retrieve) web documents Blogs – for semantic analysis, sentiment (polarity/opinion) identification Blogs Email Spam Filtering – but most often systems utilize simple word probability A general approach “Web as a corpus” – web as the vast collection of documents