Presentation is loading. Please wait.

Presentation is loading. Please wait.

LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.

Similar presentations


Presentation on theme: "LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies."— Presentation transcript:

1 LingPipe http://www.alias-i.com/lingpipe/

2 Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies Significant Phrases  Other Topic Classification Database Text Mining Spell Checker Sentiment Analysis Chinese Word Segmentation

3 Other Niceties  Its free  Plenty of documentation  Tutorials for every subtask  Highly Configurable  Source Code Very complex, but well written Good comments Gives examples on how to edit code  Can be trained in several languages.

4 Tokenization  Divides up text in sentences and words using pretty sophisticated methods.

5 Part of Speech Tagging  You can output the N-best results  You can output a confidence score for each word.  You can also retrain the Part of Speech Tagger.  You can also edit how it runs.

6 Named Entity Detection  The default detection distinguishes between three types of entities. People (distinguishes male and female) Place Organization  It can be trained to recognize any type of entity. You can get corpora from online You can annotate your own corpora using WordFreak, which also comes with LingPipe.

7 Sample Input/Output - This is Mr. Bob Smith. Bob lives in Redmond. He works for Microsoft. - - This is Mr. Bob Smith. - Bob lives in Redmond. - He - works for Microsoft.

8 Dictionary  To increase the accuracy of LingPipe, you can import a Dictionary.  A dictionary will force the recognition of certain strings to be certain types.  Common dictionaries include: Gazeteer List of people’s names Company names

9 Coreference  It identifies different references to the same entity, such Bob Smith and Bob.  It does not identify entities across documents.  It identifies pronouns with its antecedent.  It does not do other anaphora resolution, like “Jane was the woman who pulled the trigger.”

10 Clustering  Single-link Clustering chops off longest link  Clustering with proximity bounds Merges based on proximity  Extract for K-clusters You can specify how many clusters you want  Complete-Link Clustering var of single link using a whole cluster  Within-Cluster Point Scatter You don’t need to specify the number of clusters. It detects the best breaking point. This is the method used to do NER across documents.

11 Significant Phrases  Determines phrases that are seen together more often than coincidence  Seems to be mostly named entities Puget Sound, George Bush  Helps tell the genre of an article

12 Questions?


Download ppt "LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies."

Similar presentations


Ads by Google