Text Mining Application Programming Chapter 3 Explore Text

Text Mining Application Programming Chapter 3 Explore Text
Manu Konchady, 2006

Outlines Words Zipf’s Law Sentences Indexing Document Text

Extracting words from text
A linguistic definition of a word is the smallest syntactic unit that cannot be broken into smaller segments. Words in a sequence governed by the grammar of the language form sentences.

The eight standard parts of speech
Nouns (名詞) Verbs (動詞) Adjectives (形容詞詞) Adverbs (副詞) Conjunctions (連接詞) Determiners (限定詞) Prepositions (介系詞) Pronouns (代名詞) Content words Function words

Five types of phrases Noun phrases Verb phrases Adjective phrases
“A good day” Verb phrases “had thought”, ”was right” and “will be jumping” Adjective phrases “A nice shiny” Preposition phrases. “With very lone hair”

Words vs. Token A token is a more formal definition of a single unit of text. A single word may not be the smallest unit of text and a token may consist of one or more words. We will use token to represent the smallest unit of text processed in the higher layers of our model.

Complex tokens Yahoo!, AT&T, Hancock&Co. Mr. Smith, lb.,or 192.168.1.1
New York-New Jersey, small-scale, or x-ray Web URL E-01 (:-<).

Vector representations of documents used in clustering and text categorization are made up of a sequence of tokens and weights. Documents can be correctly categorized only when the vector representatives accurately the contents of documents.

Token Assembly

Abbreviations (縮寫) Currencies Dimensions Time Places Organizations.

Base Words A base word is the root form of a word that can be found in the WordNet dictionary. Jump (base word) Jumps

Word Stems A word stem is a root form of a word.
Prevent Prevents Prevented Preventing Prevention Porter’s stemming algorithm TextMine/token.pm,

Word and Meaning Relationships
A thesaurus(詞典) organizes words and word meaning In WordNet 2.0 115,775 word meanings, or synonym sets (synsets) 152,217 word forms. Antonyms(反義字) Words with opposite meanings Rich and Poor Hot and cold

Organize word meanings into an acyclical hierarchy
Hypernym The parent node Hyponyms The child nodes

Meronyms Holonyms Finger is a meronym of the word hand.
Hand is a meronym of the human. Holonyms The finger, metacarpus, palm, etc. are all holonyms of the word hand.

Project : Gutenberg http://www.gutenberg.org
Reuters Alice in Wonderland A Tale of Two Cities Holy Bible

Heaps’s Law Heaps’s Law predicts the size of the vocabulary given the text. If the number of words in n, then the size of the vocabulary is v = Kn β, where β is between 0 and 1 and K is some constant between 10 and 100. Values of β between 0.4 and 0.6 have been reasonably good approximations to predict the size of the vocabulary.

Word Distribution

ZIPF’s Law G.K. Zipf first claimed that, by principle of least effort, we use a few words very often and rarely use most other words.

Sentences A sentence is made up of one or more clauses, and each clause is made up of phrases. The subject, verb, object, complement, and adverbial phrases are arranged in order to make up a clause. Sentence-Separator Period,(.) !,? Semicolon,(;)

TextMine/WordUtil.pm The text_split function

Stopwords Since high-frequency words are not generally useful in an index, they can be removed to save space and improve performance. The words that we exclude are called stopwords. High-frequency vs. low-frequency

Inverse Document Frequency(IDF)
fm = LogN – log dm +1 The value 1 is added to avoid cases where a word m occurs in every document, leading to a value of 0 for fm.

Latent Semantic Indexing
Latent semantic indexing (LSI) is an indexing method based on the Singular Value Decomposition (SVD) of the word document matrix. The SVD is a mathematical procedure to transform the word document matrix such that major intrinsic associative patterns in the collection are revealed. Minor patterns that are not very important can be removed to identify major global relationships.

LSI LSI builds relationships based on co-occurring words in multiple documents. These hidden underlying relationships are called the latent semantic structure in the collection.

The advantage of LSI LSI does not depend on individual words to locate documents, but rather uses a concept or topic to find relevant documents. Keyword-based methods rely on an exact match between words in a document and a query.

LSI A concept or a topic is a group of words that collectively describe similar thoughts, things, places, or people. It need not be as narrow as a single meaning from a dictionary. When a research submits a query, it is transformed to LSI space and compared with other documents in the same space.

Document relationships based on shared words.

SVD of word-document matrix

Vector and LSI spaces for three documents

Implementation of LSI The SVDPACKC package TextMine/WordUtil.pm
The gen_vectors function

Index Maintenance

Text Mining Application Programming Chapter 3 Explore Text

Similar presentations

Presentation on theme: "Text Mining Application Programming Chapter 3 Explore Text"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Mining Application Programming Chapter 3 Explore Text

Similar presentations

Presentation on theme: "Text Mining Application Programming Chapter 3 Explore Text"— Presentation transcript:

Similar presentations

About project

Feedback