Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Similar presentations


Presentation on theme: "ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001."— Presentation transcript:

1 ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001

2 ICDCRome November 2001CROSSMARC Third meeting Summary Complete experiment on French corpus –French mono-product corpus –Detailed extraction performances –Examples of limits French NERC overview –XML DTD for named-entities extractions –Architecture & components description –Development & maintenance

3 ICDCRome November 2001CROSSMARC Third meeting French Corpus 56 mono product description pages 7 manufacturers : SONY, ASUS, DELL… 17 models : VAIO, INSPIRON, L8400… 6 processors : PENTIUM III, CELERON… 5 OS : WIN MILLENIUM, WIN 98… Wide ranges of WEIGHTS, PRICES...

4 ICDCRome November 2001CROSSMARC Third meeting Example of extraction

5 ICDCRome November 2001CROSSMARC Third meeting Detailed extraction performances [OK,KO] MANUF [56, 0], Small number of cases (7) MODEL [56, 0], Great number of configurations ( VAIO FX 101, 105, 201, 203, 205, 209, 808, PCG, QR10…) PROCESSOR [55, 1], Most of the cases are PENTIUM III & CELERON SOFT_OS [51, 5], Small number of cases ( WIN XX ) PRICE [35, 21], Some limits, ambiguities due to component prices RESOLUTION [39, 17], Some limits SPEED [41, 15], Some limits, ambiguities due to component speed CAPACITY [52, 4], ambiguities due to component capacities

6 ICDCRome November 2001CROSSMARC Third meeting (1a) Limits: Information does not exist No weight

7 ICDCRome November 2001CROSSMARC Third meeting (1b) Limits: Information does not exist No Soft_OS

8 ICDCRome November 2001CROSSMARC Third meeting (2) Limits: Information inside an image 13990.00

9 ICDCRome November 2001CROSSMARC Third meeting (3) Limits: One description for several products

10 ICDCRome November 2001CROSSMARC Third meeting (4) Limits: Information outside of the page

11 ICDCRome November 2001CROSSMARC Third meeting (5) Limits: Information contains an error Soft_OS = windows 200

12 ICDCRome November 2001CROSSMARC Third meeting Perspectives Ambiguities will be managed by the Fact Extractor Module Limits should be discussed by the Consortium –Information does not exist –Information inside an image –One description for several products –Information outside of the page –Information contains an error

13 ICDCRome November 2001CROSSMARC Third meeting French NERC Overview laptops.xml nerc.dtd xml2nercnerc-laptops.pl Nerc.pm product.html extraction.html static stepdynamic step refers to is processed by generates XML Perl HTML XHTML

14 ICDCRome November 2001CROSSMARC Third meeting nerc.dtd <!ATTLIST nerc domain CDATA #REQUIRED> <!ATTLIST feature no CDATA #REQUIRED name CDATA #REQUIRED type (STRING|INTEGER|DECIMAL|DOUBLE-INTEGER) #REQUIRED if CDATA #REQUIRED weak CDATA #IMPLIED> <!ATTLIST element norm CDATA #REQUIRED weak CDATA #IMPLIED> DTD File Domain independant rulebase metadescription nerc: main –domain feature: of a product (e.g., SPEED) –no –name –type –if –weak element: of a feature (e.g., MHz) –norm –weak form: string or regex of an element (e.g., "[Mm][Hh][Zz]")

15 ICDCRome November 2001CROSSMARC Third meeting laptops.xml (1) XML File Domain dependant matching rulebase description

16 ICDCRome November 2001CROSSMARC Third meeting laptops.xml (2) Domain independant desambiguation

17 ICDCRome November 2001CROSSMARC Third meeting xml2nerc Perl Program Domain independant XML to Perl translator Refers to nerc.dtd: elements, attributes, pcdata Refers to Nerc.pm: main, matching and desambiguation algorithms

18 ICDCRome November 2001CROSSMARC Third meeting Nerc.pm Perl Module Domain independant pattern matching Domain independant desambiguation

19 ICDCRome November 2001CROSSMARC Third meeting nerc-laptops.pl Generated domain dependant Perl Program Applies pattern matching and desambiguation Generates named-entities that are recognized Refers to Nerc.pm: matching and desambiguation algorithms

20 ICDCRome November 2001CROSSMARC Third meeting FNERC Development & Maintenance nerc.dtd xml2nerc / Nerc.pm laptops.xml Level 2 New PCDATA regex Level 0 New PCDATA string Level 5 New attribute Level 1 Attributes value Domain dependentDomain independent Level 4 New attribute enum. Level 3 New attribute value

21 ICDCRome November 2001CROSSMARC Third meeting Perspectives WP1: Experimenting the NERC as a better evaluation function for the topic spider WP2: Improving the FNERC WP3: Implementing desambiguation techniques for the Fact Extractor Module


Download ppt "ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001."

Similar presentations


Ads by Google