Download presentation
Presentation is loading. Please wait.
Published byTheodora Poole Modified over 8 years ago
1
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001
2
ICDCRome November 2001CROSSMARC Third meeting Summary Complete experiment on French corpus –French mono-product corpus –Detailed extraction performances –Examples of limits French NERC overview –XML DTD for named-entities extractions –Architecture & components description –Development & maintenance
3
ICDCRome November 2001CROSSMARC Third meeting French Corpus 56 mono product description pages 7 manufacturers : SONY, ASUS, DELL… 17 models : VAIO, INSPIRON, L8400… 6 processors : PENTIUM III, CELERON… 5 OS : WIN MILLENIUM, WIN 98… Wide ranges of WEIGHTS, PRICES...
4
ICDCRome November 2001CROSSMARC Third meeting Example of extraction
5
ICDCRome November 2001CROSSMARC Third meeting Detailed extraction performances [OK,KO] MANUF [56, 0], Small number of cases (7) MODEL [56, 0], Great number of configurations ( VAIO FX 101, 105, 201, 203, 205, 209, 808, PCG, QR10…) PROCESSOR [55, 1], Most of the cases are PENTIUM III & CELERON SOFT_OS [51, 5], Small number of cases ( WIN XX ) PRICE [35, 21], Some limits, ambiguities due to component prices RESOLUTION [39, 17], Some limits SPEED [41, 15], Some limits, ambiguities due to component speed CAPACITY [52, 4], ambiguities due to component capacities
6
ICDCRome November 2001CROSSMARC Third meeting (1a) Limits: Information does not exist No weight
7
ICDCRome November 2001CROSSMARC Third meeting (1b) Limits: Information does not exist No Soft_OS
8
ICDCRome November 2001CROSSMARC Third meeting (2) Limits: Information inside an image 13990.00
9
ICDCRome November 2001CROSSMARC Third meeting (3) Limits: One description for several products
10
ICDCRome November 2001CROSSMARC Third meeting (4) Limits: Information outside of the page
11
ICDCRome November 2001CROSSMARC Third meeting (5) Limits: Information contains an error Soft_OS = windows 200
12
ICDCRome November 2001CROSSMARC Third meeting Perspectives Ambiguities will be managed by the Fact Extractor Module Limits should be discussed by the Consortium –Information does not exist –Information inside an image –One description for several products –Information outside of the page –Information contains an error
13
ICDCRome November 2001CROSSMARC Third meeting French NERC Overview laptops.xml nerc.dtd xml2nercnerc-laptops.pl Nerc.pm product.html extraction.html static stepdynamic step refers to is processed by generates XML Perl HTML XHTML
14
ICDCRome November 2001CROSSMARC Third meeting nerc.dtd <!ATTLIST nerc domain CDATA #REQUIRED> <!ATTLIST feature no CDATA #REQUIRED name CDATA #REQUIRED type (STRING|INTEGER|DECIMAL|DOUBLE-INTEGER) #REQUIRED if CDATA #REQUIRED weak CDATA #IMPLIED> <!ATTLIST element norm CDATA #REQUIRED weak CDATA #IMPLIED> DTD File Domain independant rulebase metadescription nerc: main –domain feature: of a product (e.g., SPEED) –no –name –type –if –weak element: of a feature (e.g., MHz) –norm –weak form: string or regex of an element (e.g., "[Mm][Hh][Zz]")
15
ICDCRome November 2001CROSSMARC Third meeting laptops.xml (1) XML File Domain dependant matching rulebase description
16
ICDCRome November 2001CROSSMARC Third meeting laptops.xml (2) Domain independant desambiguation
17
ICDCRome November 2001CROSSMARC Third meeting xml2nerc Perl Program Domain independant XML to Perl translator Refers to nerc.dtd: elements, attributes, pcdata Refers to Nerc.pm: main, matching and desambiguation algorithms
18
ICDCRome November 2001CROSSMARC Third meeting Nerc.pm Perl Module Domain independant pattern matching Domain independant desambiguation
19
ICDCRome November 2001CROSSMARC Third meeting nerc-laptops.pl Generated domain dependant Perl Program Applies pattern matching and desambiguation Generates named-entities that are recognized Refers to Nerc.pm: matching and desambiguation algorithms
20
ICDCRome November 2001CROSSMARC Third meeting FNERC Development & Maintenance nerc.dtd xml2nerc / Nerc.pm laptops.xml Level 2 New PCDATA regex Level 0 New PCDATA string Level 5 New attribute Level 1 Attributes value Domain dependentDomain independent Level 4 New attribute enum. Level 3 New attribute value
21
ICDCRome November 2001CROSSMARC Third meeting Perspectives WP1: Experimenting the NERC as a better evaluation function for the topic spider WP2: Improving the FNERC WP3: Implementing desambiguation techniques for the Fact Extractor Module
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.