Presentation is loading. Please wait.

Presentation is loading. Please wait.

RPI P ROFESSOR J IM H ENDLER S IMON E LLIS K ATE M C G UIRE N ICOLE N EGEDLY A VI W EINSTOCK M ATT K LAWONN J ENN C HAN S ARABETH J AFFE W ATSON.

Similar presentations


Presentation on theme: "RPI P ROFESSOR J IM H ENDLER S IMON E LLIS K ATE M C G UIRE N ICOLE N EGEDLY A VI W EINSTOCK M ATT K LAWONN J ENN C HAN S ARABETH J AFFE W ATSON."— Presentation transcript:

1 WATSON @ RPI P ROFESSOR J IM H ENDLER S IMON E LLIS K ATE M C G UIRE N ICOLE N EGEDLY A VI W EINSTOCK M ATT K LAWONN J ENN C HAN S ARABETH J AFFE W ATSON T ECHNOLOGIES AND O PEN A RCHITECTURE Q UESTION A NSWERING I NSIDE D EEP QA Managing complex unstructured data with UIMA Simon Ellis 22 nd November, 2013

2 WATSON RPI I NTRODUCTION

3 ??? IBM Watson

4 ??? Watson is… … a piece of software that will run on your laptop Though very slowly Specialised hardware and control platform … an implementation of the DeepQA concept … the first iteration of the cognitive computing platform … a very clever artificial intelligence A very clever application of human intelligence

5 ??? Inside Watson Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

6 WATSON RPI Nicole Negedly Q UESTION A NALYSIS

7 ??? Question Analysis

8 ??? Question analysis What is the question asking for? Which terms in the question refer to the answer? Given any natural language question, how can Watson accurately discover this information? Who is the president of Rensselaer Polytechnic Institute? Focus Terms: Who, president of Rensselaer Polytechnic Institute Answer Types: Person, President Question Analysis

9 ??? Parsing and semantic analysis What information about a previously unseen piece of English text can Watson determine? How is this information useful? Natural Language ParsingSemantic Analysis - grammatical structure - parts of speech - relationships between words -...etc. - meanings of words, phrases, etc. - synonyms, entailment - hypernyms, hyponyms -...etc.

10 ??? Parsing Stanfords NLP toolset is used

11 ??? Semantic relations in WordNet Princeton Universitys WordNet Words are grouped into groups of synonyms called synsets Relationships exist between noun synsets hypernym/hyponym: type-of relation e.g. Canine is a hypernym of dog holonym/meronym: part-of relation e.g. Building is a holonym of window

12 ??? How is this useful? This information can be used to understand a question Current Question Analysis work with RPIs version of Watson Creating and training machine learning classifiers Parse Trees Dependency Relations Coreferences Named Entities Semantic Relations Classifiers Manually Annotated Questions New Question Critical Elements of Question

13 ??? Question analysis pipeline Unstructured Question Text Parsing & Semantic Analysis Machine Learning Classifiers Structured Annotations of Question: Focus, answer types, Useful search queries

14 WATSON RPI Kate McGuire C ANDIDATE G ENERATION

15 ??? Search Result Processing and Candidate Generation

16 ??? Primary Search Primary Search is used to generate our corpus of information from which to take candidate answers, passages, supporting evidence, and essentially all textual input to the system It formulates queries based on the results of Question Analysis These queries are passed into a search engine which returns a set number of highly relevant documents and their ranks.

17 ??? Search Result Processing Search Result Processing restructures the information in the document so it is useful. HTML tags are cleaned from the document Passage Retrieval/Chunking Breaks the document down into smaller pieces Adds information, such as the html text, length, place in the document, etc. Passage Parsing Parse trees are formed for each passage

18 ??? Candidate Generation Candidate Generation generates a wide net of possible answers for the question from each document. Using each document, and the passages created by Search Result Processing, we generate candidates using three techniques: Title of Document (T.O.D.): Adds the title of the document as a candidate. Wikipedia Title Candidate Generation: Adds any noun phrases within the documents passage texts that are also the titles of Wikipedia articles. Anchor Text Candidate Generation: Adds candidates based on the hyperlinks and metadata within the document.

19 ??? Search Result Processing and Candidate Generation

20 WATSON RPI Matt Klawonn S CORING & R ANKING

21 ??? Scoring & Ranking

22 ??? Scoring Analyzes how well a candidate answer relates to the question Two basic types of scoring algorithm Context-independent scoring Context-dependent scoring

23 ??? Types of scorers Context-independent Question Analysis Ontologies (DBpedia, YAGO, etc) Reasoning Context-dependent Analyzes natural language that candidates appear in Relies on passages found during search

24 ??? Scorers Examples of scorers include Passage Term Match Textual Alignment Skip-Bigram Each of these scores supportive evidence Scores are then merged to produce a single candidate score

25 ??? Inside Watson Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

26 WATSON RPI Simon Ellis T HE T AO OF UIMA

27 ??? UIMA Unstructured Information Management Architecture A platform for the analysis of unstructured information and its integration with search technologies Permits multi-modal analysis of collections or archives

28 ??? UIMA http://uima.apache.org/d/uimaj-2.4.0/

29 ??? Unstructured information The most rapidly-growing source of information in existence The internet Print media Video recordings Audio recordings... Unstructured information is just information that doesnt have the kind of structure you need it to have for what youre doing. [Peter Fox, X-Informatics class]

30 ??? UIMA (again) The UIMA platform can be thought of in four ways: A specification for component interfaces for, and in, an analytics pipeline A specification of certain design patterns for that pipeline An outline of 2 data representations: in-memory annotations for local analysis and XML representation for remote web integration An outline for possible development roles allowing tools to be used by users with a wide range of skills

31 ??? CAS Common Analysis Structure (CAS) Object-based structure Allows representation of objects, properties and values Stores arbitrary data structures Annotations Types Object types may be related by single-inheritance Contains document being analysed, either physically or logically Results of analysis are shared and recorded in a CAS

32 ??? Annotator Core UIMA component type Contains analysis algorithms designed to work on data contained in a CAS Original document Annotation Search evidence Candidate score... Form the building blocks of Analysis Engines

33 ??? Analysis Engine Building blocks of a UIMA pipeline Section of code containing 1 or more annotators Analyses source document(s) and provides analysis results Results typically represent metadata about the source Analysis Engines are effectively software agents that discover and record metadata

34 ??? Example http://uima.apache.org/d/uimaj-2.4.0/

35 ??? Sofas and CAS Views Sofa Subject of Analysis A piece of data intended for analysis by UIMA components CAS View A section of a CAS dedicated to one Sofa Shares the same name as its Sofa May be dynamically created as needed by applications or AEs Each Sofa permits a different perspective of an artefact

36 ??? Example Dr Shirley Ann Jackson Teacher of physics President, RPI Researcher at Bell Labs IBM Board of Directors Chairman, USNRC

37 ??? Descriptors All components consist of two parts Code Descriptor (declaration) Functions of the descriptor Contains metadata about the code block Name Structure Behaviour Used in component discovery, reuse, and tool composition

38 ??? UIMA (again, again) Highly reliant on XML Flexible Extensible XML...... describes components and their behaviour... controls data (CAS) flow through the pipeline... is used to create larger components from subcomponents Aggregate Analysis Engines

39 ??? Aggregate Analysis Engine A complex analysis engine made up of other components May contain simple AEs or other AAEs Components further down the pipeline may rely on all output Performs a larger, complete task, e.g. named entity recognition language detection and tokenisation part-of-speech detection deep grammatical parsing named entity recognition

40 ??? CAS Multiplier Creates 0 or more new CAS objects from an input CAS May be used to duplicate or merge CAS objects e.g....... creating alternative versions of an input Sofa... breaking a large input CAS into multiple smaller pieces... aggregating multiple input CAS into a single output

41 ??? Inside Watson Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

42 ??? UIMA, once more UIMA runs in the Java Runtime Environment Uses XML code to run system UIMA framework reads XML dynamically and creates objects using them Only the UIMA framework itself is compiled SO HOW DOES IT WORK?

43 ??? How it works Abstract class prototyping UIMA Framework objects are usually derived from a base class Function signature UIMA Framework objects each have certain functions which can or must be overridden initialize() process() This ensures all classes are of known supertypes and have a recognisable function signature for all key functions

44 ??? How it works Reflection The ability of a computer program to examine and modify the structure and behavior (specifically the values, meta-data, properties and functions) of an object at runtime. XML descriptors define the nature of objects class name constructor parameters... UIMA dynamically creates objects using reflection

45 ??? The magic code //create type of obj we want JCasAnnotator ann = null; //use Java inbuilt function to create abstract class Class annClass = Class.forName("com.ibm.tutorial.tycor"); //get constructors for abstract class type Constructor cons = annClass.getConstructor( ); //should return a JCasAnnotator object ann = cons.newInstance( );

46 ??? UIMA, finally Effectively an interpreter for code scripted in XML and Java Component-oriented design makes scaling easy BlueJ (Jeopardy! hardware) had 2,000 cores Most easily written in Java Java runs in the Java Runtime Environment Dynamic typing & reflection are therefore possible Could not have been written in C++08 An OS for multimodal, unstructured information management

47 WATSON RPI Q UESTIONS & A NSWERS


Download ppt "RPI P ROFESSOR J IM H ENDLER S IMON E LLIS K ATE M C G UIRE N ICOLE N EGEDLY A VI W EINSTOCK M ATT K LAWONN J ENN C HAN S ARABETH J AFFE W ATSON."

Similar presentations


Ads by Google