Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff.

Similar presentations


Presentation on theme: "Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff."— Presentation transcript:

1 Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff I-KNOW'03, Friday, 4th of July

2 Chris Biemann - IKnow'03 Graz 2 Goals extraction of multiterms from unannotated text corpora for the use in information visualisation Example: Company Names

3 Chris Biemann - IKnow'03 Graz 3 Topics Patterns of Company Names From Patterns to Pattern Rules Search and Verification Algorithm: Use Pattern Rules to find Name Parts Term Extraction using Name Parts Experiments and Evaluation Aggregation of Name Variants Application to Semantic Networks

4 Chris Biemann - IKnow'03 Graz 4 Patterns of Company Names Regular Expression to capture the structure: (ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)* FS: Full Stop CONJ: Conjunctions, like +,&,... ?: Zero or one *: Zero or more ABBReviationNAME partsLegal Form (KIND) A & WElektrogeräteGmbH & Co. KG A.BaumgartenGmbH HagedornGmbH Institut für Angewandte Kreativität DASAG GmbH Japan Steel WorksLtd. K.F.C.GermanyInc. LABSCOLaboratory Supply CompanyGmbH & Co. KG

5 Chris Biemann - IKnow'03 Graz 5 How to learn missing parts Suppose: Japanese is known NAME Inc. is known KIND Then: Steelworks should be NAME ASD should be ABBR comprising should not be part of the name Use flat features: _UC Upper Case _CAP Capitalized _LC Lower Case _MIX Mixed Case ASD Japanese Steelworks Inc....comprising Japanese Steelworks Inc.,...

6 Chris Biemann - IKnow'03 Graz 6 From Patterns to Pattern Rules Pattern ABBR NAME NAME KIND Pattern Rules _CAP* NAME NAME KIND -> ABBR ABBR _UC* NAME KIND -> NAME ABBR NAME _UC* KIND -> NAME ABBR NAME NAME _MIX* -> KIND... (ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)*

7 Chris Biemann - IKnow'03 Graz 7 Pattern Rules Characteristics operate on sequences of - flat features - classes of known words Problem: match too often - high coverage - low precision ABBR _UC* NAME KIND -> NAME

8 Chris Biemann - IKnow'03 Graz 8 Search and Verification Algorithm Initialise pattern rules Let unused elements := initial set of elements with class Loop: For each unused element Find candidates for new elements by the search step For each candidate Do the verification step Add accepted candidates to new unused elements Output new unused elements Unused elements = new unused elements

9 Chris Biemann - IKnow'03 Graz 9 Search Step use unused element to find example sentences "Film" -> 100 sentences apply Pattern Rules to obtain candidates CineMedia NAME Odeon NAME Senator NAME Lunaris NAME Die NAME... Fragments containing "Film": Die CineMedia Film AG übernahm die Odeon Film AG mit darunter ein Film über zu jedem Film interessante die Senator Film AG über zukunftsweisenden Film "Jurassic Park" die Lunaris Film GmbH erfolgreichsten Film der. Die Film AG stellte nach... der Odeon Film AG. _UC* NAME KIND -> NAME, AG KIND, Film NAME

10 Chris Biemann - IKnow'03 Graz 10 Verification Step use candidate to find example sentences "Odeon" -> 30 sentences apply Pattern Rules and check classifications of candidate "Odeon" is NAME in 17/30 cases ->accept "Senator" is NAME in 2/30 cases -> reject "Die" is NAME in 0/30 cases -> reject Fragments containing "Odeon": Die Odeon Film AG (3x) des Vorstands der Odeon Film AG rennomierte Viedovertriebskette Odeon teilen sich Hecos Odeon Sub/200/Center Rahmenvertrag mit Odeon Zwo setzt auf Odeon Film AG... _UC* NAME KIND -> NAME, AG KIND, Film NAME

11 Chris Biemann - IKnow'03 Graz 11 Extraction of Multiterms Patterns with delimiters can be used for extraction Patterns only select appropriate multiterms, not single occurrences of name parts _DELIML NAME NAME KIND _DELIMR =company, AG KIND, Film NAME, Orion NAME Multiterms containing "Odeon": Die Odeon Film AG (3x) des Vorstands der Odeon Film AG rennomierte Viedovertriebskette Odeon teilen sich Hecos Odeon Sub/200/Center Rahmenvertrag mit Odeon Zwo setzt auf Odeon Film AG "Odeon Film AG"

12 Chris Biemann - IKnow'03 Graz 12 Experiment Prerequisites - take arbitrary company list, - sort words by frequency, - truncate top 1'000 - assign classes NAME, KIND Pattern Rules - Generate Patterns from Regexp - Generate Pattern Rules from Patterns - Add delimiters to Patterns to get Extraction Patterns

13 Chris Biemann - IKnow'03 Graz 13 Evaluation Input - 1'002 Items - 47 Pattern Rules - 106 Extraction Patterns Output - over 12'000 Items - over 6'000 multiterms (company names) CategoryCorrectWith SpecifierFractionsErrors Example Odeon Film AGGrazer Andritz AGGroßmarkt GmbHPlastik MiniDIL Fraction75.80 %17.36 %6.08 %0.76 %

14 Chris Biemann - IKnow'03 Graz 14 Aggregation of Name Variants Rule 1: first word is location: remove if short form has high frequency Rule 2: generic name not aligned to term border: Keep to distinguish between subsidaries Rule 3: long form has generic name for first word: remove if short form has higher frequency Long candidateShort candidateCorrect name Düsseldorfer Bank eGBank eGDüsseldorfer Bank eG Düsseldorfer Rheinmetall AGRheinmetall AG Mannheimer Pharmexx GmbHPharmexx GmbH Infomatec Media AGInfomatec AGInfomatec Media AG JENOPTIK Automatisierungstechnik GmbH Jenoptik GmbHJENOPTIK Automatisierungstechnik GmbH Jenoptik Bauentwicklung GmbHJenoptik GmbHJenoptik Bauentwicklung GmbH Kleindienst Datentechnik GmbHKleindienst GmbHKleindienst Datentechnik GmbH Nachrichtenagentur dpa-AFXdpa-AFX Infomatec-Tochtergesellschaft Igel GmbH Igel GmbH

15 Chris Biemann - IKnow'03 Graz 15 Application to Semantic Networks Media – Helkon Media AG, ProSieben Media AG, I-D Media AG http://www.wortschatz.uni-leipzig.de http://www.texttech.de

16 Chris Biemann - IKnow'03 Graz 16 Media Example (2) Helkon Media AG

17 Chris Biemann - IKnow'03 Graz 17 Media Example (3) ProSieben Media AG

18 Chris Biemann - IKnow'03 Graz 18 Media Example (4) I-D Media AG

19 Chris Biemann - IKnow'03 Graz 19 Telekom Example Telekom vs. Deutsche Telekom AG

20 Chris Biemann - IKnow'03 Graz 20 Summary Example-based unsupervised learning algorithm for multiterm extraction Disambiguation of generic company name parts in knowledge representation More finegrained representation of complex concepts

21 Chris Biemann - IKnow'03 Graz 21 END Thanks for your attention!


Download ppt "Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff."

Similar presentations


Ads by Google