1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University.

1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University of Leipzig

2 Outline The benefits of typed terms and relations Alleviating the ontology bottleneck Rapid annotation Sources for annotation candidates Annotation tools Case study: Annotation of „Deutscher Wortschatz“ Conclusion

3 Typed terms and relations The bag of words model treats all terms equally Document similarity based on all terms No views on data possible Typed terms and relations: Multiple views on documents w.r.t. types Document similarity restricted to types and augmented by relations Enables some tasks of Question Answering

4 Motivating example: untyped Documents: 1.The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. 2.„Weapon sales increased“, a government official stated, „especially tanks sell well“ 3.A holiday cruise on a yacht invites to take photos of seagulls. 4.The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: Doc 1Doc 2Doc 3Doc 4 Doc 1- Doc 23- Doc 300- Doc 4203- 1 43 2

5 Motivating example: type PERSON Documents: 1.The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. 2.„Weapon sales increased“, a government official stated, „especially tanks sell well“ 3.A holiday cruise on a yacht invites to take photos of seagulls. 4.The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: Doc 1Doc 2Doc 3Doc 4 Doc 1- Doc 20- Doc 300- Doc 4200- 1 43 2

6 The ontology bottleneck Semantic Web people believe that annotation with ontology relations will enable semantic search,... Annotation: Chose an ontology, label all instances in the document Problems: New documents have to be annotated all over again Merging of ontologies Despite tools, users are reluctant to annotate their documents Doc 1 Anno 1 Doc 2 Anno 2 Doc 3 Anno 3 Doc n Anno n.... Merged ontology interface

7 Centralized annotation Types and relations for terms are assigned globally and once-for-all. No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem Annotation is done for document collections Doc 1 Annotation Doc 2 Doc 3Doc n.... interface document collection

8 Generating Candidates for Annotation Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related Needed: Method that produces terms with similar types and related pairs at high rate Method here: Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents. Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher orders

9 The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger

10 Graphical annotation tool: colourizing co-occurrences

11 Specifying types and relations Click on node / edge opens context menu restricted to POS

12 Web-based annotation tool for arbitrary candidate sources

13 Rule-based candidate generation If some annotation is already present, then rules can be specified to obtain candidates at even higher rate. It is possible to guess the type of candidates Example: Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A) yields LIVING(dog) as candidate Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B) yields IS-A(cat, animal) as candidate dogcat LIVING animal LIVING IS-A CO-HYPONYM

14 Tool to accept or reject rule-based candidates

15 Case study: Annotating Deutscher Wortschatz www.wortschatz.uni-leipzig.de In terms of numbers: In 1‘000 hours, annotators could chose between 46 semantic types and 57 relations, and produced 150‘000 type instances and 150‘000 relation instances for over 80‘000 distinct terms, that is text coverage of 90%, with a speed of 5 units per minute

16 Different relations from different sources

17 Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“ 1. Translate into formal query: Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)} b1  Qset, b2  Qset, b1  b2 2. Access search engine with possible b1, b2

18 What Google found: Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.com

19 Conclusion Typed terms and relation can facilitate processing of electronic documents for a wide range of applications Rapid annotation alleviates the acquisition bottleneck by - globally annotating - local dependencies Intuitive tools for annotation are highly important to achieve large amounts in short time

20 QUESTIONS?!? THANK YOU

21 Bonus material Co-occurrences Co-occurrences of higher orders

22 Statistical Co-occurrences occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors) Significant Co-occurrences reflect relations between words Significance Measure (log-likelihood): - k is the number of sentences containing a and b together - ab is (number of sentences with a)*(number of sentences with b) - n is total number of sentences in corpus

23 Iterating Co-occurrences (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences co-occurrences of second order: words that co-occur significantly often in collocation sets of first order co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.

24 Constructed Example I Ord 1dogterriercatmousebarkingbiteyelp dog---XxX terrier---xxX cat--x-x- mouse--X-x- barkingXX---- biteXXxx-- yelpxx---- Ord 2dogterriercatmousebarkingbiteyelp dog311--- terrier311--- cat111--- mouse111-1- barking----22 bite---122 yelp----22

25 Constructed Example II Ord 3dogterriercatmousebarkingbiteyelp dog------ terrier----- cat------ mouse------ barking----11 bite----11 yelp----11 Ord 2dogterriercatmousebarkingbiteyelp dogx----- terrierx----- cat------ mouse------ barking----xx bite----xx yelp----xx

1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University.

Similar presentations

Presentation on theme: "1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University.

Similar presentations

Presentation on theme: "1 Semantic Indexing with Typed Terms using Rapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University."— Presentation transcript:

Similar presentations

About project

Feedback