Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010.

Similar presentations


Presentation on theme: "Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010."— Presentation transcript:

1 Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010

2  SemTag is an application built on the platform Seeker that adds semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The Chicago Bulls announced yesterday that Michael Jordan will...’’http://tap.stanford.edu/Basketball Team_Bullshttp://tap.stanford.edu/ AthleteJordan_Michael  The creation of this large scale automated semantic tagging will accelerate the creation of Semantic Web

3  Semantic Web is a vision to transform all documents in web into machine understandable format so that applications or programs can execute without human intervention.  All the entities of documents will be canonically annotated; therefore programs can easily understand what documents are about.

4  To accomplish the Semantic Web Vision, we need ◦ Ontological support in the form of Web available services, which will maintain metadata about entities and provide them whenever needed. ◦ Large scale availability of annotations within documents encoding canonical references to the entities.  Need to break the Circular Dependency, which means ◦ We need applications those will make extensive use of the semantically tagged Data. ◦ There should be enough Tagged Data on the web so that these applications can be useful.

5  Tagging is a way to classify entities either in written or spoken text.  Any Tagging process generally consists of two steps: ◦ Step 1: Identify the entities those should be classified ◦ Step 2: classify these instances according to their categories.  In case of Semantic Tagging, the categories used to classify the entities are derived from their intentions or meanings (what is being said than how is it said!)

6 He runs the companyHe runs the marathon run1 = controlrun2 = run by foot Sense Tagging HumanNon-Human Feature Tagging The speaker coughedThe speaker was disconnected

7  Needs to resolve ambiguities in a natural language corpus like web.  Maintaining and Updating a large scale corpus requires such a scalable infrastructure, which most tagging applications are unable to support.  Requires a platform so that multiple Tagging applications can share.

8  Designed the platform Seeker, which provides highly scalable core functionalities to support SemTag and other Tagging algorithms.  SemTag uses a new disambiguation algorithm called TBD for resolving Taxonomy based disambiguates.  Applied SemTag to a collection of approx. 264 million web pages and generate 434 million automatically disambiguated semantic tags  Published metadata regarding the annotations to the web as a label bureau.

9

10  SemTag runs in three phases:  Spotting Pass – Generate window of context surrounding a label (10 words-label-10 words)  Learning Pass – Use representative sample to determine distribution of terms in the Taxonomy  Tagging Pass – Disambiguate references using TBD algorithm. Two kinds of ambiguities are:  Same label appears at multiple locations in TAP ontology.  Some labels occurs in contexts, which are missing in the taxonomy.

11  TBD makes use of two classes of training information:  Automatic Metadata – help in determining whether context around a label appears within a subtree of the taxonomy.  Manual Metadata – Provides information regarding the nodes of the taxonomy whether it contains highly ambiguous or unambiguous labels.

12  An Ontology in TBD defined by four elements:  A Set of classes, C  A subclass relation, s(c1, c2)  A Set of Instances, I  A Type relation, t(i, c)  A Taxonomy T is defined by three elements:  A Set of Nodes, V  A Root Node, r  A parent function, p  Ontology describes relationships in an N-dimensional manner, where Taxonomy describes hierarchical relationships.

13  Each node in Taxonomy has a set of labels. E.g.: Musician, Singer, Band Members all can contain the label Mark Knopfler.  An ancestry chain denotes the path from a node to the root of the taxonomy followed by the parent relationship.  A spot, spot (l, c), i.e. spot (Mark knopfler, Singer) is a label in a context.

14  Each internal node in TAP associates a similarity function that determines whether a particular context is similar to a node.  Good Similarity function has the property that higher the similarity, the more likely that a spot containing a reference to an entity that belongs to the subtree rooted at that node. Music MusicianSinger Mark Knopfler Label Mark Knopfler Label Example of a subtree in Taxonomy Spot(Mark knopfler, Singer) c u Should have Higher similarity value

15 Determines whether a particular context is appropriate to a particular node in Taxonomy.

16 TBD Uses the manually generated Metadata to calculate m a u and m s u, as the training set, where m a u = probability as measured by Human judgement that spots for the subtree rooted at u are on topic. And m s u = Probability that Sim correctly judges whether spots for the subtree rooted at u are on topic.

17  Lexicon generation: ◦ Built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words. ◦ Took the most frequent 200,100 words. ◦ Took the most frequent 100 words out. ◦ Further computations are performed in the 200,000 dimensional vector space defined by these words.

18  Each node is associated with 200,000 dimensional vector.  Evaluated four standard candidates for Similarity Functions:  Scheme ‘Prob’  Scheme ‘TF-IDF’  Algorithm ‘IR’  Algorithm ‘Bayes’  According to the their result, IR with TF-IDF scheme gives the best accuracy (82%), which is a significant improvement.

19  It is a platform developed to support SemTag and other sophisticated Text analytics applications.  It is designed to achieve the following goals:  Composibility  Modularity  Extensibility  Scalability  Robustness

20

21  Seeker is a service oriented architecture (SOA), which means it is a local area, loosely-coupled, pull-based distributed computation system.  To address scalability and robustness issues, Seeker incorporates a Component containing small set of Critical Services named Infrastructure.  Analysis agents perform processing of web pages to generate annotations.

22

23

24  Automatic semantic tagging is essential to bootstrap the Semantic Web.  It’s possible to achieve good accuracy even with simple disambiguation approaches.

25 Question?


Download ppt "Presented By- Shahina Ferdous, Student ID – 1000630375, Spring 2010."

Similar presentations


Ads by Google