UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS

UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS Natural Language Access to Relational Databases: An Ontology Concept Mapping (OCM) Approach Lawrence Muchemi Githiari P80/80034/2008 Supervisor Supervisor: Dr. Wanjiku Ng’ang’a ‘Viva’ for the Degree of Doctor of Philosophy in Computer Science ©2014

NLQ PROCESSING SOLUTION
A Social Need Need for NL Querying of RDBs by Casual Users NLQ PROCESSING SOLUTION LANGUAGE INDEPEND. DOMAIN INDEPEND. GENERIC RDB CROSS-LINGUAL 25 December 2017 PhD Presentation by Lawrence Muchemi 2

PhD Presentation by Lawrence Muchemi
Focus of Research Relational Database Access using NL It’s an active basic research issue under HLT. ‘An Information Access Problem’ in NLP. Specific Challenges Tackled: Domain Independence Language independence FOCUS OF RESEARCH Previous studies have tended to concentrate on web text sources (semantic web) or pre-populated ontologies (eg GATE ontology – Shefield Univ. etc) or Prolog Dbs Cross-linguality Robustness (of inputs) 25 December 2017 3 PhD Presentation by Lawrence Muchemi

Problem Statement The Unresolved issue of NLQ Processing for RDB access is the main Research Problem addressed. Main Challenge : Lack of a Language- & Domain independent (generic) approach that maps any given NL to Structured QL Most DBs have to grapple with the issue of cross-lingual interaction; this problem is also addressed Other challenges addressed include lack of Generic Models that; Define the processes of concepts discovery from NLQ free text Guide the Retrieval of concepts from RDB Metadata Mapping Algorithms, SQ generator functions & other Heuristics 25 December 2017 PhD Presentation by Lawrence Muchemi 4

Objectives The main objective: Bring forth a novel approach & an architecture Devoid of language and domain dependence Also address other challenges within the state-of-the-art (noted in problem statement). Specific Objectives: Develop a suitable language & domain independent methodology for understanding un-restrained NL text. Design an architectural model and algorithms thereof that facilitate access of data from DBs using English & Kiswahili as case-study languages. Design Algorithms for parsing free NL text & data structure for holding the parsed queries. Design Algorithms for extracting concepts from ontologies Design matching Function; Design SPaRQL generator Functions. Develop a prototype upon which performance evaluations can be done. 25 December 2017 PhD Presentation by Lawrence Muchemi 5

Significance of Research
The solution is an important intermediate step in speech processing for RD access A gold-standard models-evaluation framework A Design devoid of language & Domain considerations -Generic Researchers Developers “There is a renewed interest in catering for casual user NL text interaction”. Kauffman & Bernstein (2007). Users More direct interaction with casual users Contribution to NLDBA field: Providing A novel Language & Domain Independent Approach upon which NL interfaces to DBs can be built 25 December 2017 PhD Presentation by Lawrence Muchemi 6

I. Semantic Parsing Approaches to SQL Generation
2. Logic Mapping 3. Ontology-based Approaches to Related problems REVIEW OF VARIOUS SCHOOLS-OF-THOUGHT TO STRUCTURED-QUERY GENERATION I. Semantic Parsing Approaches to SQL Generation Approaches to MR Generation Probabilistic M/c Learning Statistical Common Cited Grammar for MR Definite Clause Grammar (DCG), example atieno, (λx.λy.loves(y, x), kamau Combinatory Categorial Grammar (CCG), example atieno → NP; loves → (S\NP)/NP; kamau → NP Synchronous Context Free Grammar (SyncCFG) (Chiang, 2006) The Grammar above may be augmented with the lambda, λ notation Two Back-to-Back Classifiers 25 December 2017 PhD Presentation by Lawrence Muchemi 7

II. Logic Mapping Approaches
Logic Mappers, Token-matching – Tagged Tokens eg Dittenbach & Berger, (2003); and Popescu et al (2003); Finite State Transduction - Garcia et al, 2008] Phrase-Trees-Mapping – Interlingua (Shin & Chu,1998 ), Phrase-Trees Mapping using Templates (TTM)- Muchemi, 2008 Syntactic Trees Mapping – NL/SQL Syntactic Trees Mapping) – SVM-based learners- (Giordani & Moschitti, 2010) No reported literature on Relational Database access Access to other repositories (eg semantic web) Relies on Bag-of-Words with direct Tokens-mappings to an Ontology eg Querix (Kauffman et al, 2006) and QuestIO (Tablan et al 2008) III. Related Ontology-based Approaches 25 December 2017 PhD Presentation by Lawrence Muchemi 8

Notable Issues with Above Approaches
Machine Learning: Challenges of portability (moving across domains), “… parser has to be trained on a corpus of questions speciﬁc to a db…. making portability a big issue” Popescu, et al (2003) Cost of building corpus “…weakness of the approach is the cost of training corpus of natural language/ logical expression pairs” Minock et al (2008) Low Accuracy due to back-to-back arrangement of classifiers Mapping Techniques: Errors from Automatic Tagging (same as in semantic labeling such as poor classification manual tagging is costly. Ontology-based Techniques Generic RDb access not well studied Hence purpose of this Research Work Cross-cutting: Language and Domain Dependence Resulting in relatively poor performance 25 December 2017 PhD Presentation by Lawrence Muchemi 9

Trends in Approach to NL Access Problem
LR: Correlation between Degree of Structuredness of Repository & Preferred Approach Eg Semantic Web Data Eg Named Entity Entries Eg Plain Texts & HTML Text Time From above trends, It is deduced that the direction to which a generic NLDBA model should be sought is in the area of ontology concept mapping The power of ontologies lies in their capacity to provide context for semantics within Resource Description Framework(RDF). PhD Presentation by Lawrence Muchemi 10 25 December 2017

The OCM Conceptual Model
From Analysis of Literature, the following OCM Conceptual Model was designed. Graphic Image of Conceptual Model 25 December 2017 PhD Presentation by Lawrence Muchemi 11

Issues that Needed to be Tackled Before Realization of the OCM model
Decoding of schema data (no controlled vocabulary) Design of a language & Domain- ind. “Ontology Concepts Deducing Algorithms” 1. “Concepts” Modeling & Discovery Methods “Explicit” & “Implicit concepts” discovery Models 25 December 2017 PhD Presentation by Lawrence Muchemi 12

Issues that Needed to be Tackled Before Realization of the OCM model
Design of Schemata “Features Space Model (FSM) and Gazetteer Model” 2. Cross-lingual Access Mapping Algorithm Need to Enhance the current state-of-the-art (Lexical-level, Keyword-based Matching method), LLKM. (Punyakanok, Roth, & Yin, 2004) Structured Query Generator The query generator’s task is to organize ‘concepts’ into a structured query 25 December 2017 PhD Presentation by Lawrence Muchemi 13

Research Design… Double Water Fall Strategy
NLQ CONCEPTS STUDY DB CONCEPTS STUDY Modeling of NLQ Concepts Design of NLP Components Architectural Design Development of Prototype Evaluation & Benchmarking 1 Modeling of RDb Concepts 3 Concepts Discovery process Design of Ontology Processing Components 1 2 4 Modeling Query Semantics Transfer Process (NLQ  DSF  SPaRQL) Feature Space Modeling (FSM) RD- Overall strategy that integrates different components of the study in a coherent & logical way thereby addressing overall Research Problem Design of Common Processing Components 2 5 Deciphering meanings from Schema Data 3 6 Explorations Modifying PhP source code Advanced PhP Debugging Runkit APIs ADODB APIs Piloting Busy season Log collection 25th Jan – 29th Jan 2013 2.5M, 1.8M Modelling ‘Concepts Re-construction’ Schema Design (Gazetteer) 4 Evaluation 7 Concepts Mapping Algorithm. A Structured Query-Generator fn 5 8 Assembly of Components to form OCM-based Architectural Model Design of MAIN Algorithms & Heuristics 6 December 25, 2017 PhD Presentation by Lawrence Muchemi

Research Methodology 1 Concepts Modeling (RDB) NLQ Concepts Modeling
NLP Components Design Architecture Design Prototype Development Evaluation & Benchmarking Ontology Processing Components Design METHOD:Several Case Studies – 5 No. Two – primary data; Three – secondary data 5-point case study research design strategy (Yin, 1994) used. Implementation Protocol adopted: one devt at MIT- (Zucker, 2009) Sampling : Stratified Random sampling approach Kernelization Technique used in query decomposition Join Processing Components Design 1 Explorations Modifying PhP source code Advanced PhP Debugging Runkit APIs ADODB APIs Piloting Busy season Log collection 25th Jan – 29th Jan 2013 2.5M, 1.8M The 5-Point C/Study Design 1. Research Questions 2.Make Propositions 3. Establish Analysis Rigor 4. Linking data to proposition 5. Criteria for interpretation December 25, 2017 PhD Presentation by Lawrence Muchemi

Prototype Development Evaluation & Benchmarking
Research Methodology RDB Concepts Modeling NLQ Concepts Modeling NLP Components Design Architecture Design Prototype Development Evaluation & Benchmarking METHODS Simulations in Test-bed Validation of Query Semantics Transfer Model : Quantitative & Qualitative 2 Ontology Processing Components Design Join Processing Components Design METHOD: Simulations in test bed Evaluation (efficacy of the algorithm)- Experimental > 6 test databases 4 METHOD: Case studies - Data collected from Questionnaires & internet based surveys 3 Explorations Modifying PhP source code Advanced PhP Debugging Runkit APIs ADODB APIs Piloting Busy season Log collection 25th Jan – 29th Jan 2013 2.5M, 1.8M METHOD: Simulations in test bed Evaluation: Quantitative for Algorithm , Function & other Heuristics Experiments  Sampled queries used 5 Discussed in evaluation segment December 25, 2017 PhD Presentation by Lawrence Muchemi

Research Questions Guiding NLQ Studies
Generative-Transformation Theory: All languages have the same Deep Structure Form (for similar sentences), but their respective Surface Structures Forms differ because of the application of different Transformations Rules. Noam Chomsky (1957) DSF = Simple, Assertive, Declarative, and Active This study is based on GT Theory BUT concentrates on Query Semantics transfer in NL Queries [as opposed to sentence transformations], Can deep structure form (DSF) of queries be used in deducing the interrogative properties of NL queries? What type of relationships exists between DSF and SPaRQL queries and are they language and domain independent? Are the processes (for conversion of SSF to DSF in NL queries) language and domain independent? 17 25 December 2017 PhD Presentation by Lawrence Muchemi

Case 1 & 2 Kiswahili & English Queries
Pre-study survey: Group of Farmers Are they Potential users of NLQ-DB access system? Respondents were regular users of veterinary services- commercial scale Case Study: Solicited potential queries (for use in a simulated db.) 25 information request areas per questionnaire Purposive sampling Method. Sample sizes were determined on the basis of ‘Theoretical Saturation’ 625 questions were collected Queries contained in Web interface that is maintained by the UoN MSc coordinator. Various s Data collected was from the domain of students’ management, provided a domain variation with case1) Queries collected were in English provided a language contrast to case 1) 310 were collected questions? Kiswahili Queries Case Study English Queries Case Study 25 December 2017 PhD Presentation by Lawrence Muchemi 18

Other Query Sets Used Name of Query-set No of Questions Description Original Source 1 CASE 1: Farmers Queries 625 Poultry farmers queries -Kiswahili Muchemi, (2008) 2 CASE 2: UoN MSc Coordinator 310 Questions by UoN MSc students to coordinator - English Coordinator s 3 CASE 3: ELF Queries to MS NorthwindDB 120 Originally created by Bootra to evaluate ELF on Microsoft Northwind-db ( at Virginia Commonwealth University –English (Bootra, 2004) 4 CASE 4: Computer Jobs 500 Database & queries for computer jobs used originally by Tang under Ray Mooney for PhD work at Texas State University- English Tang & Mooney, 2001 5 CASE 5: Restaurant 250 Same as above but for restaurant selection - English Total 1805 Bench-Marking Data 25 December 2017 PhD Presentation by Lawrence Muchemi 19

Findings 1: Prevalence of Transformation Rules
Example of a formal Transformation Rule ‘Was a lot of water taken by the chicken?’  ‘Chicken took a lot of water’ Aux – NP2– V –NP1  NP1– V – NP2 1.DAT= Agent Deletion eg ‘[Kuku] inataga kwa mda?’ ‘inataga kwa mda gani?’ 2. PT= Passive (from active to passive tense-deep to surface); eg ‘Je, vifaranga walikula chakula?’ ‘Je, chakula kililiwa na vifaranga?’ 3. DET= Deletion of Elements (eliminates excessive words ); eg ‘Jimbi na vifaranga walikula chakula?’  ‘Jimbi walikula chakula na vifaranga walikula chakula?’ 4. IT= Imperative Transf. (command) eg ‘nipe wanunuzi bora’  ‘Wanunuzi bora’ 5. CT= Coordination (two sentences are combined into one - surface) eg ‘Kuku walikula chakula kimeoza’ / ‘Kuku walihara’  ‘Kuku walikula chakula kimeoza kisha wakahara’ 6. AET= Addition of Elements (adds information such as ADJ &ADV) eg ‘kuku amehara’  ‘kuku mweupe amehara’ 7. NT= Negation Transf.; eg ‘kuku amehara’  ‘kuku hajahara’ Conclusion: There are 7 most prevalent Generative Transformation-Rules 25 December 2017 PhD Presentation by Lawrence Muchemi 20

Findings 2- Mapping NLQ Semantics  DSF Semantics
Sampling: Stratified Random Sampling Approach Each population (query set) was divided into 12 strata (based on query type eg ‘who’, ‘when’, ‘what’ etc). The size of each strata was determined by the Frequencies of Each Query type. 50 Query samples were obtained from each of the 5 Populations; Total 250 Queries The questions for each query type (forming the strata) were randomly selected from original population. Semantics Transfer Analysis – 7-step KERNELIZATION Method’ for identification of Meaning Bearing Components (MBCs) Conclusion: S-V-O terms & Modifiers (adj, adv etc) are critical components in the transfer process 21 25 December 2017 PhD Presentation by Lawrence Muchemi

The Query Semantics Transfer Model (QuSeT Model)
Finding 3: Does there exist a regular process in which the semantic of a query is transferred from the SSF to the MBCs?. YES - Modeled as Query Semantics Transfer Model Modeling done after semantics analysis of data stratified across all Categories of Transformation Rules The Query Semantics Transfer Model (QuSeT Model) 25 December 2017 PhD Presentation by Lawrence Muchemi 22

Findings 4 to 7: Analysis & Validation of QuSeT
MBCs have a tri-partite relation; between Subject, Verb and Object OR, Any 2 of these components and an Interrogative (or modifier of either) OR Any of SVO and its modifiers A variable can replace any element within the tri-partite What relationships exist between Meaning Bearing Components (MBCs)? 4 5 Deviations in QuSeT Does transfer process occurs without deviation to QuSeT? 1st & 2nd persons in a query do not bear direct semantic reference & can be dropped Qualitative Validation Does Query Semantics Transfer conform to the QuSeT model? 6 All 12 query types conformed WH-query is answered through substitution of the interrogative with a suitable MBC. Swa: 23of 25 NLQ analyzed correctly Eng: 24 out of 25 NLQ correct Mean Accuracy of the QuSeT model was therefore determined as 94%. 7 Quantitative Validation of QuSeT (Model built as a python module ) 25 December 2017 PhD Presentation by Lawrence Muchemi 23

Finding 8: What type of relationship exist between DSF and SPaRQL?
MBCs have a tri-partite relation as observed from QuSeT. Example Query: ‘What is the phone number of the customer whose ID is 1’ There are 3 MBCs (Phone number; Customer; ID. What is an interrogative . Only 2 triples (with 1st element being a DB name) are possible. They have the format: ?element1 ?element2 ?element3, There is a mention of a specific row value (instance ID= 1) and hence an addition FILTER clause ?customer ? phone_number ?Variable1(what?) ?customer ? id_number ? Variable2 (what?) FILTER (? id_number = "1") Conclusion: MBC triples map directly onto SPaRQL triples

Finding 9: What type of relationship exist between the formed SPaRQL Queries and RDF? Example of a Full SPaRQL Query OWL RDF-Ontology Derived from Microsoft’s DB ‘Northwind’ Conclusion: The formed SPaRQL and RDF are based on TRIPLES and therefore can be mapped directly to yield answers Used as Variable Triple (Blue/ Green) Variable completes Triple Instance 25 December 2017 PhD Presentation by Lawrence Muchemi 25

Findings 10: Word Length of Concepts
Conclusion: Optimal phrases’ length is 3 It indicates the optimal number of words typically expressing most concepts (eg Collocations) and therefore guides any rule-based concept discovery process. 25 December 2017 PhD Presentation by Lawrence Muchemi 26

Survey on DB Schema Concepts Reconstruction
Challenge: No common Nomenclature exists hence challenges in decoding schema information Research: Answers the following questions based on collected data, Is there a finite set of patterns that database schema authors’ use in representing database schema object names? ‘How can we decipher the meaning of an ‘intended concept’ from the schema name? How can a general ‘Concepts Re-construction Algorithm’ be built from an ontology created from a relational database source? 25 December 2017 PhD Presentation by Lawrence Muchemi 27

The Case Study & Some Findings
Data Collection: Questionnaires & Internet Based sources Sample Frame 12 Training Institutions 16 Software Devpt firms. 320 Randomly Sampled db schema objects Sampling: Snowball Analysis: Descriptive Methods Pattern Identification End Product Generic Algorithm for lexicon & Concepts Reconstruction 28

Conclusions in the DB Nomenclature Studies
CONCLUSION 1: Finite set of patterns? 10 Common naming formalisms Categorized into 3 clusters shown below CONCLUSION 2: ‘Deciphering the meaning of an ‘intended concept’ ? DB developers rarely give names that do not have meaning. These meanings highly correlate with the intended concept. CONCLUSION 3: ‘‘Concepts Re-construction Algorithm’ ? A general Words Recreation Algorithm can be defined Recreates words from ontology derived from DB High Usage Medium Usage Low Usage under_score; Abbrev.; PascalMethod; Acronyms do.t; Finger_Breaking; camelCase da-sh; ‘string’; SCREAM

The OWoRA Algorithm -(Ontology Words Recreation Alg. )
Function retrieves schema elements from ontology and forms a List Different functions handle various patterns found in the strings Split compound Strings & do Stem Identify lexicon & synonyms & form Associated concepts (Phrase chunks + other categories) More Details of the Algorithm found in Thesis Document

Evaluation of the OWoRA Algorithm
Aim: Experimentally determine the efficacy of the OWoRA algorithm OWoRA was subjected to 6 test databases as shown below No. of columns positively identified (lexically & semantically), expressed as a % of the total No. of columns. CONCLUSION: Results show a mean Accuracy of 92.5%

1. Structure of Gazetteer Model
Components Design & Construction 1. Structure of Gazetteer Model Populated from OWoRA Translated Need for Manual Translation Items to be Matched to FSM 12/25/2017 PhD Presentation by Lawrence Muchemi

2. Feature Space Model (FSM) Design: Experimental Investigation-Stem OR Lemmatize? Example of Lemma (Roots) & Stems Surface form  Kuku wakitetemeka ni wagonjwa? Lexical form  [kuku] [tetema] [ni] [mgonjwa] ( [are] [chicken] [shiver] [sick]) Stemmed form [kuku] [tetem-] [ni] [-gonj-] {prefixes: wa-ki- for tetem-; -eka for tetem-} Methodology: Simulations on test bed English: Lancaster stemmer (Paice, 1990) vs NLTK WordNet lemmatizer (Miller, 1995) Kiswahili: Lexical dB constructed from TUKI dictionary Results show that STEMMING results in higher recall value, DECISION: Store stemmed word forms Higher Higher PhD Presentation by Lawrence Muchemi

WHAT TO STORE in FSM? Bag of Words (BoW) OR CP (Concept Patterns)+ BoW? Methodology: Simulations of various configurations on Test Bed Results: Summary: Store All as Stemmed CP+BoW Nouns and Noun Phrases Identified through patterns eg Kiswahili Noun patterns by Ohly (1982): Norminalized verbs, Deverbative head with Noun complement, Combination of nouns, Noun and adjectives, Nouns with –a connector & Nominalized verbs. Terms & Collocations identified thro patterns eg Kiswahili Term patterns (Sewangi, 2001) 3. Other Phrase Chunks – Verb-ph, prepositional–ph 4. Synonyms, Hypernyms Higher Higher 12/25/2017 PhD Presentation by Lawrence Muchemi

Structure of the Feature Space Model (FSM)
Items to be Matched to Gazetteer 12/25/2017 PhD Presentation by Lawrence Muchemi

3. Matching Function in OCM
REVIEWED STRATEGIES (from ontology matching): (Keshavarz & Lee, 2012) Lexical-based strategy; Semantic-augmentation strategy; Constraint-based, Instance-based strategy; Structure-based matching models Semantic Augmentation model borrowed from Ontology-Matching Strategies used; Augmentation = Integrating lexicon with meaning of words - from lexical dbs Developed Semantically Augmented Concepts Mapping (SACoMa) fn Levenshtein algorithm (lexicon-based, Edit-Distance Calculation) was enhanced through semantic augmentation. From Document retrieval strategies: Boolean, Vector Space, Probabilistic, and Language models (Liddy, 2005) PhD Presentation by Lawrence Muchemi 12/25/2017

A Python Implementation of Matching Function
More Details of the Algorithm found in Thesis Document The function Maps concepts in FSM to those in the Gazetteer) 12/25/2017 PhD Presentation by Lawrence Muchemi

4. Heuristics Examples Handling Foreign Key: When dealing with two or more tables related via a foreign key a heuristic was developed from collected Data.This is stated as follows, “when 2 or more classes are involved in reply to a query, we introduce a triple from each participating class. The triple introduced must have the common property that originally constituted the foreign key”. 2. Handling Implicit Concepts: This is discovery of IMPLIED ontology concepts. Heuristic: “If an instance is mentioned, it’s property is implied.” Example: The sentence “Which products comes in bottles” “Bottle” is an instance of “Categories” Table thro’ property “Description”, thus even if “Description” is not in the original sentence, it is an implied concept, Validation: Validated by analyzing 20 relevant queries for each heuristic Observation: Heuristics found to hold true in all (Results in appendix 11) Conclusion: Heuristics are dependable and therefore implemented PhD Presentation by Lawrence Muchemi 12/25/2017

5. Structured Query Generation Process
Sample Sentence: “Give me the names and identification of suppliers from central region?” Meanings Base Components: Give(SELECT query); Me (dropped); the (dropped) names; identification; suppliers; central; region; Compound terms (concepts) > Supplier+Id; Company+Id; Company+Name) Triples formed (Possible permutations) IDENTIFY MBCs & Associated Triples Database (Class) Field (Property) Instance/Variable Suppliers SupplierID No Instance /Var. CompanyName No Instance /Var Region “central” 12/25/2017 PhD Presentation by Lawrence Muchemi

Structured Query Generation Algorithm
General SPaRQL Query Assembly Heaping Procedure NB: > FILTER is necessary for instantiating a property value > Applied where there is direct mention of an instance & a Property The Generated SPaRQL Query Predefined URI 1 2 Identified Property objects 3 Identified Triples Identified Filters 4 12/25/2017 PhD Presentation by Lawrence Muchemi

PUTTING ALL COMPONENTS TOGETHER
Published as a Book Chapter - Springer Lecture Notes in Computer Science (LNCS 2013) (Muchemi & Popowich, 2013). The OCM-based Architectural Model (Ontology-based NL Access to DBs (ONLAD) PhD Presentation by Lawrence Muchemi 12/25/2017

Overall OCM Algorithm Concepts Discovery Query Assembly
Knowledge Comprehension Concepts Discovery Query Assembly Query Execution 25 December 2017 PhD Presentation by Lawrence Muchemi 42

Examples of Parsed NLQs
Sample Translated Queries Fig.1 One Table Example Fig. 2 Two Table Example Note: Foreign Key 2 triples one from each participating class and both having a common property (field) EXPECTED SPARQL RESULTS Query 1 “Give me the cities where employees come from?” PREFIX moon: < SELECT ?employees ?City WHERE { ?employees db:City ?City.} PREFIX chema: < SELECT DISTINCT ?ProductID ?ProductName ?Description ?CategoryID WHERE { ?products db:ProductID,?ProductID, ?products db:ProductName?ProductName. ?categories db:Description ?Description. FILTER( ?Description = "bottled") } Query 2 “Which products come in bottles?” ?products db:CategoryID ?CategoryID. ?categories db:CategoryId ?CategoryID. 25 December 2017 PhD Presentation by Lawrence Muchemi 43

EVALUATION OF OCM MODEL Evaluation Process
NLQ Concepts Modeling NLP Components Design Architecture Design Prototype Development Evaluation & Benchmarking Concepts Modeling (RDB) Ontology Processing Components Design 5 query sets previously described used Join Processing Components Design -NO Separation of Questions: -Gives true reflection of expected performance Explorations Modifying PhP source code Advanced PhP Debugging Runkit APIs ADODB APIs Piloting Busy season Log collection 25th Jan – 29th Jan 2013 2.5M, 1.8M EVALUATION FRAMEWORK Novel Evaluation Framework Devpd from Literature analysis December 25, 2017 PhD Presentation by Lawrence Muchemi

The Test-bed Oveview Concepts Generation – Phrase Chunker – NLTK RegExp (Eng) RegExp chunker + Swahili Patterns [sewangi, 2001] TOOLS & RESOURCES Tri-gram Language Detector NLTK Normaliser & Tokenizer Google Translate Lancaster stemmer/Lemmatizer (English) Lexical DB (Swahili) developed from TUKI, SALAMA POS Tagging – Combined Trained Unigram and Trained Brill Tagger Rapid Prototyping Approach Simulations in test bed 25 December 2017 PhD Presentation by Lawrence Muchemi 45

A Detailed look at the OCM Based Prototype
OCM Prototype Overview Resources: Protégé, Data-master, Protégé’s native RDF Reasoner, Wamp Server, Python scripting 25 December 2017 PhD Presentation by Lawrence Muchemi 46

Databases Used for Evaluation
Name of Database No of Tables Description 1 Chicken Farmers_db 8 DB created to mimic the one at Thika poultry farmers’ project. Also reported in Muchemi, (2008) 2 UoN MSc Coordinator_db 4 DB created to mimic students’ management at SCI, University of Nairobi. 3 Microsoft’s Northwind_db Standard database shipped with Microsoft’s database server Restaurants_db 7 DB whose schema is described in Tang & Mooney, (2001) and has been quoted widely in experiments 5 Computer Jobs_db DB whose schema is described in (Tang & Mooney, 2001) and has been quoted widely in experiments No. Of Queries 200 200 120 250 200 Sampled from same Queries used in NLQ Case Studies Sampled Query Sets Used for Evaluation A stratified random sampling approach 8 strata based on complexity of queries – Defined in Tablan et al. (2008) as number of concepts per Query Diversity of queries ensured by different types eg ‘where’, ‘when’ PhD Presentation by Lawrence Muchemi 12/25/2017

Experimental Determination of Mean Performance of OCM Model
Procedure: For TESTING Subject the Sampled Queries to OCM 4 Research Assistants were used to perform the tests Procedure: For EVALUATING Test & Categorize results. 4 Human evaluators Examined & Categorized the answers generated The evaluators were recruited from undergraduate CS students at UoN. Answers were Categorized as ‘true positive’, ‘false positive’ or ‘neg’ (no answer generated). Procedure repeated with TTM models for practical comparison. Training: Research Assistants and Evaluators were given basic training on handling input and output responses from the prototype 12/25/2017 PhD Presentation by Lawrence Muchemi

Parameters in the Evaluation Framework
Evaluation Framework has 8 Aspects, 4 quantitative measures namely 1. Precision, 2. Recall, 3. Accuracy and 4. F-score Four qualitative measures namely Domain independence, Language-independence, Support for Cross-linguality and Effect of Query Complexity on Model. Note: Design of Evaluation framework & parameters considerations was guided by literature review & constitutes a ‘gold-standard tool’ readily usable by other researchers 12/25/2017 PhD Presentation by Lawrence Muchemi

Results: 1 Test-Set out of the 10
Test 1: Model = OCM; Levenshtein gap, µ = 0 & then changed to µ = 1 Ie. perfect matching of strings within the gazetteer and the FSM OR an allowance of either 1 insertion, deletion or substitution of a single character Test 2: Repeat Tests above BUT Change Model to TTM. 10 test sets done in total Sample Results: Test Set 1& 2- OCM & TTM - Kiswahili Queries PhD Presentation by Lawrence Muchemi 12/25/2017

Summary of Results from the 10 Evaluation- Sets
Results indicate a model whose Average precision at a Levenshtein distance µ, of 1 is 0.75 This increases to 0.86 on decrease of µ to 0. Accuracy marginally increases from 0.52 to 0.53 on decreasing µ PhD Presentation by Lawrence Muchemi 12/25/2017

Effect of µ on Precision, Recall and F-Score
Precision decreases with increase of µ while Recall increases. F-score, the harmonic mean of precision and recall remains stable at 0.72. Its true that “The higher the precision, the better the quality of the answers received “ Thus based on precision alone µ should be restricted to 0 Recall shows the range of questions handled. “The higher the recall the better the range”, Thus based on recall, µ should be set to 1. Levenshtein Distance (within Matching fn)

Experimental Determination of Domain Independence
Make Querying Language same as Schema Language (No Cross-linguality) Test the 4 Domains One at a time Trading, Job-Search Student-Management, Finding-Restaurants. PROCEDURE Determine tp, fp & neg. For each domain calculate and tabulate, Accuracy, Recall, Precision, F-Score For each of the 4 domains calculate Mean (χ), Variance (ύ) and Std Deviation (σ) Perform Outlier Analysis (Peirce Criterion (Ross, 2003)) If minority of the points are classified as non-outliers, THEN we conclude that the model is NOT significantly affected by a change in domain, hence domain-independent. English Experimental Procedure for Domain Independence Experiments

Domain-Independence Analysis
Apply The Peirce Criterion (Ross, 2003) on all points > 1.00 The parameter, R was obtained from the Peirce’s table for a four data point-one outlier condition; R= 1.383 Determine S; S= R x σ = x = (eg first row) Determine Rmax; Rmax= (|xi- xm|max)/σ| = Max value for each row above. If S>Rmax, then the data is classified as an outlier, else normal No data was found to be an outlier, HENCE the conclusion that the model is DOMAIN-INDEPENDENT | | (x-mean)

Experimental Determination of Language Independence
Evaluations done for, English, Kiswahili Set Querying Language SAME TO Schema Language PROCEDURE For each LANGUAGE determine, True Positives, False Positives and No results (neg). Calculate and tabulated the following, Accuracy, Recall, Precision. Mean (χ), Variance (ύ) and Std Deviation (σ) Analysis done Deviation Analysis – AS DESCRIBED IN PREVIOUS EXPERIMENT Outlier Analysis (Peirce Criterion (Ross, 2003)) - AS DESCRIBED No data was found to be an outlier, HENCE the conclusion that the model is LANGUAGE-INDEPENDENT Procedure for Language Independence Experiments PhD Presentation by Lawrence Muchemi 12/25/2017

Experimental Determination of Cross-lingual Support
PROCEDURE Set each of the 4 arrangements. Determine true positives, false positives and no results. For each arrangement calculate and tabulated, Accuracy, Recall, Precision. Mean (χ), Variance (ύ) and Std Deviation (σ) Analysis done Deviation Analysis – AS DESCRIBED IN PREVIOUS EXPERIMENT Outlier Analysis (Peirce Criterion (Ross, 2003)) - AS DESCRIBED No data was found to be an outlier, HENCE the conclusion that the model HAS GOOD SUPPORT FOR CROSS-LINGUAL QUERYING 4 Experiments done, Swahili Queries - Swahili DB English Queries- Swahili DB English Queries- English DB Swahili Queries- English DB, Experimental Procedure for Cross-lingual Support Experiments

Effect of Query Complexity
Conclusion: Model performs best with at least 2 concepts per query with the peak occurring at 3 to 5 concepts and then gradually degrades. Performance Query Complexity 12/25/2017 PhD Presentation by Lawrence Muchemi

Comparative Analysis with other Models
Benchmarking done thro’ comparison with other published works Best in Category 12/25/2017 PhD Presentation by Lawrence Muchemi

COMPARATIVE ANALYSIS …/2
M/L models require 2 back-to-back learners Superimposing SQL converter eg {Giordani & Moschiti, (2010) F-Score = 0.759} to say WASP (F-score =0.81), the overall DB-Access F-Score would be (0.81x0.759) = which is lower than OCM’s 0.72. PRECISE achieves F-Score of 0.65 (NO Query pre-selection) which is lower than OCM’s 0.72 PRECISE uses BoW compared to OCM which uses ‘Concepts’ (tokens, phrase chunks, terms and collocations) thus explaining better Recall for OCM. (0.70 compared to 0.55) {without Query pre-selection} OCM P=0.86; R=0.7; A=0.53; F= 0.72 Machine Learning OCM Logic Mapping Ontology Access Direct comparison not suitable because tasks are different. Querix access specific ontologies (GATE) while OCM is a generic RDB access model. In general, OCM has better Precision than Querix (0.86 compared to 0.78) However, Querix has a user feedback Intervention which assists in guiding the questions posed by the user This explains Querix’s better Recall (0.78 comp. to 0.70). But in absence of this performance would be same as Questio’s (0.68) because of similar linguistic processing (BoW); OCM’s good performance can be attributed to the different query linguistics handling (Concepts Modeling)

Conclusions The developed Models are language and domain independent (shown experimentally) Reason: The underlying concepts are based on universal language processing theories such as Generative-transformation, Phrases, Terms & Collocations formation Theories, MBC Identification model (dev here). The main point of departure of the OCM (in terms of linguistic processing) from other models in ‘Ontology-based solutions’ is in ‘Concepts Formulation’ “The good results therefore indicates that the use of concepts, arising from a concepts-modeling process, as opposed to bag-of-words leads to a better performance as shown”. Draw-Back: OCM requires someone to enter information that at times is regarded as obvious or superfluous. This leads to lower recall 12/25/2017 PhD Presentation by Lawrence Muchemi

THEORETICAL CONTRIBUTIONS
1 New Approach, “The OCM Approach”, Facilitates conversion of NLQ into structured queries (SPaRQL) 4 2 Framework for reviewing the Trends to NL DB Access Approaches “Semantics Transfer Model” (QuSeT) Models semantic transfer Theoretical Contributions 3 5 Extension of ideas postulated by Chomsky (1957) that “DSF of a query can be used in deducing the interrogative properties of a NL query and that this property is domain and language independent.” Design of a generic Algorithm the OWoRA Models Reconstruction of Ontology Words

TECHNICAL CONTRIBUTIONS
Architecture for Ontology-based NL Access to DBs (ONLAD) Implementation of theoretical principles into practical contributions Implementation of Kiswahili Terms & Collocations discovery Methodologies Sewangi (2001) into concrete practical contribution 2 Creation of 2 Standard Re-usable Research DATASETS (queries & databases) Kiswahili dataset: (Farming) English dataset: (Students’ queries management Domain) 1 Technical Contributions 4 OCM Components & Related Heuristics Heuristics for discovery of implicit concepts Semantically-Augmented Concept Matching (SACoMA) function Heuristic for handling foreign keys Heuristic for SPaRQL query generation Feature Space Model Gazetteer Model 3

3. METHODOLOGICAL CONTRIBUTIONS
Framework for Performance Evaluation: “The 8-parameter evaluation framework” Procedures for Evaluation of Qualitative Parameters: Domain Independence, Language Independence, Cross-lingual Querying Capacity, Effect of Query Complexity 4. Achievements on Performance Advancement Good performance values comparable to the State-of-the-art Attainment of Domain Independence Attainment of Language Independence Achievement of Cross-lingual Querying Recommendations for Further Work Scalability Study to multiple databases Discourse Processing Study Application of OCM to Object-Oriented Databases 12/25/2017 PhD Presentation by Lawrence Muchemi

Relevant Publications, Conferences & Projects
BOOK CHAPTERS Muchemi, L & Popowich, F.(2013). An Ontology-Based Architecture for Natural Language Access to Relational Databases. Springer Lecture Notes in Computer Science. HCI (6) 2013: Vol Las Vegas, USA.ISBN JOURNAL PUBLICATION CONFERENCE PROCEEDINGS Muchemi, L & Popowich, .(2013). NL Access to Relational Databases: The OCM Approach. Proceedings of 7th International Conference, UAHCI 2013, Las Vegas, NV, USA, July , 2013, Proceedings Part I. Muchemi, L (2008). Swahili NL Access to RDbs (TTM Approach). Proceedings of 4th ICCR Conference. Makerere Univ., Kampala, Uganda, August 2008 Muchemi, L, Getao K. 2007. Enhancing Citizen-Government Communication Through Natural Language Querying. Proceedings of 1st International Conference in Computer Science and Informatics (COSCIT 2007). : , Nairobi, Kenya UoN - MSc & BSc STUDENTS PROJECTS APPLYING CONCEPT DEVELOPED IN THIS WORK Kiilu, E. (2014). NL Access to Kenya Open Data. BSc Project Report, UoN, Kenya Kihuna, M. (2013). Accessing Wikipedia data Using NL. BSc Project Report, UoN, Kenya Ikunyua, E. (2012). Automatic Characterization of Named Entities in Structured Reports. MSc Project Report, UoN, Kenya PhD Presentation by Lawrence Muchemi 64 25 December 2017

Acknowledgements Thanks to almighty God for journey mercies. PhD is a journey which is long with many meanders Supervisors Dr. Wanjiku Ng’ang’a – UoN Good high Quality Supervision Prof. Fred Popowich – Simon Fraser University, Canada – 6-Month UNPAID PhD supervision in Canada Useful insights Dr. Kate Getao Original PhD Concept. Phd Examination Panel – Providing me this opportunity to make this crucial milestone SCI admin& CBPS, led by Prof. Okelo-Odongo & Prof Aduda For their support Research Committee members especially >>Prof. Waema, Prof. Omwenga, Prof. Waiganjo, Dr. Opiyo among others For immeasurable motivation Colleagues, Friends & ALL present here for this viva. THANK YOU ALL 25 December 2017 65 PhD Presentation by Lawrence Muchemi

UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS

Similar presentations

Presentation on theme: "UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS

Similar presentations

Presentation on theme: "UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS"— Presentation transcript:

Similar presentations

About project

Feedback