Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

Similar presentations


Presentation on theme: "Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,"— Presentation transcript:

1 Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf, Michal Antkiewicz, and Krzysztof Czarnecki Generative Software Technologies Corp. Waterloo, Canada 1 © Generative Software Technologies Corp.

2 The Idea 2 © Generative Software Technologies Corp.

3 Specification Documents Tex t text Tex t text Section Table Paragraph Physical structures 3 Functional Reqs Business Rules Use Case Logical structures (specification elements) © Generative Software Technologies Corp.

4 Recognize and extract specification elements based on physical document structure 4 © Generative Software Technologies Corp.

5 ET – Extraction Tool searches for template instances Spec Doc text text text text text text text text text text text text text text text text text text text text Text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Text textTexttext UC Template 5 UC 1 UC 2 © Generative Software Technologies Corp.

6 ET – Extraction Tool searches for template instances Spec Doc text text text text text text text text text text text text text text text text text text text text Text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Text textTexttext UC Template UC 1 6 © Generative Software Technologies Corp.

7 ET – Extraction Tool searches for template instances Spec Doc text text text text text text text text text text text text text text text text text text text text Text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Text textTexttext UC Template UC 1 7 © Generative Software Technologies Corp.

8 ET – Extraction Tool searches for template instances Spec Doc text text text text text text text text text text text text text text text text text text text text Text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Text textTexttext UC Template 8 UC 1 UC 2 © Generative Software Technologies Corp.

9 Precondition: Documents have been authored with some template in mind 9 © Generative Software Technologies Corp.

10 Application scenarios 10 © Generative Software Technologies Corp.

11 Import to Requirements Mgmt Tools Spec Doc Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Tex t text Text text Doors HP Quality Center Requisite Pro … 11 Functional Reqs Business Rules Use Case Functional Reqs Business Rules Use Case ET © Generative Software Technologies Corp.

12 Spec Doc Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text QT Spec Doc Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Structured Query Tex t text Text text All use cases with actor = customer 12 Use Case Spec Doc Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Functional Reqs Use Case Business Rules © Generative Software Technologies Corp.

13 Spec Doc text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Tex t text Text text Spec Doc Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Tracing 13 Business Rules Use Case © Generative Software Technologies Corp.

14 Spec Doc text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Heading text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text Tex t text Text text Template Conformance Checking 14 Use Case © Generative Software Technologies Corp.

15 Main Challenge: Logical and Physical Variation 15 © Generative Software Technologies Corp.

16 Challenge – Variation Instances of Use Case 16 © Generative Software Technologies Corp.

17 Challenge – Variation Instances of Use CaseLogical components Component Identifiers 17 © Generative Software Technologies Corp.

18 Challenge – Variation Instances of Use CaseLogical components Component Identifiers 18 © Generative Software Technologies Corp.

19 Variation Types 19 DesignedAccidental Logical Physical © Generative Software Technologies Corp.

20 Designed Logical Variation 20 Optional component © Generative Software Technologies Corp.

21 Designed Logical Alternatives 21 Deeper decomposition Different methodologies lead to logical variation © Generative Software Technologies Corp.

22 Designed Physical Variation 22 Different formatting © Generative Software Technologies Corp.

23 Accidental Variation Logical Missing components, e.g., actor Physical Spelling mistakes, e.g., Actar Style inconsistency, e.g., italics instead of bold 23 © Generative Software Technologies Corp.

24 Solution 24 © Generative Software Technologies Corp.

25 ET – Extraction Tool 25 Docs PSE Physical components Sections, lists, table cells LSE UC Template Logical components Actor, flow, extensions Accidental variation via match threshold Accidental variation via match threshold Designed variation via template Designed variation via template © Generative Software Technologies Corp.

26 26 UC Template Metamodel UC Name : String Flow Action : String * 1 1 Section Heading List Paragraph Mapping © Generative Software Technologies Corp.

27 Example Template 27 © Generative Software Technologies Corp.

28 Logical Structure 28 © Generative Software Technologies Corp.

29 Mapping 29 © Generative Software Technologies Corp.

30 Regular Expressions 30 © Generative Software Technologies Corp.

31 Lists 31 © Generative Software Technologies Corp.

32 Component Nesting 32 © Generative Software Technologies Corp.

33 Optional Components 33 © Generative Software Technologies Corp.

34 Physical Alternatives 34 © Generative Software Technologies Corp.

35 Templates with Tables 35 © Generative Software Technologies Corp.

36 Logical Alternatives 36 © Generative Software Technologies Corp.

37 ET – Extraction Tool 37 Docs PSE Physical components Basic: Paragraph, cell, graphic Composite: Sections, lists, tables, … LSE UC Template Logical components Actor, flow, extensions © Generative Software Technologies Corp.

38 Physical Structure Extraction 38 Docs PSE Physical components Basic: Paragraph, cell, graphic Composite: Sections, lists, tables, … LSE UC Template Logical components Actor, flow, extensions Only part dependent on document- format © Generative Software Technologies Corp.

39 Performance 39 © Generative Software Technologies Corp.

40 Can we extract logical structures from real- world documents? 40 © Generative Software Technologies Corp.

41 Document Set 43 documents 24 from 3 companies 11 from public sources 6 student projects 2,000 to 23,000 words Content Use Cases Data Objects Business Rules Functional Reqs Non-Functional Reqs … 41 DocsDocs © Generative Software Technologies Corp.

42 ET 2) Verify extraction Template Development 42 UC1 UC Template 1) Write template manually UC2 ?? 3) Refine template © Generative Software Technologies Corp.

43 Results 36 logical structures Use cases, data objects, business rules, … Template sizes from 3 to 52 LOC Total 942 instances Nearly all instances perfectly recognized 100% recall for 33 templates; over 80% for remaining 3 100% precision for 35 templates; 87% for remaining 1 Error causes Severe formatting problems, e.g., manual line breaks Forgotten ids 43 © Generative Software Technologies Corp.

44 Other Questions Amount & kind of template change in refinement 1% – 25% LOC affected during refinement 81% changes concern optionality (add ? or component) Amount of iterations 1 instance (11 cases) to 50% of all instances (6 cases) e.g., 10 out of 20 (2 cases); mostly simple edits, add `? Implication Start with few examples, then edit the template based on expert knowledge (e.g., add `?) 44 © Generative Software Technologies Corp.

45 Related Work Import to Req Mgmt Tools Tools prescribe document structure Manual markup for fine-grained extraction Wrapper induction Machine generated docs (web pages) Induced Regex not human readable (no modeling language) Natural language processing Can benefit from structure- induced semantic tags 45 © Generative Software Technologies Corp.

46 Future: Template by Example 46 UC1 UC Template UC2 3) Refine template 1) Mark up sample document UC Template TE 2) Extract template 3) Verify extraction ET © Generative Software Technologies Corp.

47 Summary 47 © Generative Software Technologies Corp.

48 ET – Design 48 Functional Reqs B. Rules Use Case B. Rules Use Case PSE Physical components Spec Doc UC Template LSE Logical components Spec Doc Use Case QT Query Functional Reqs B. Rules Use Case ET Import Tracing Conformance Application scenarios Template development Evaluation results Nearly all instances perfectly recognized 43 real-world documents © Generative Software Technologies Corp.

49 Technology available at 49 © Generative Software Technologies Corp.


Download ppt "Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,"

Similar presentations


Ads by Google