Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.

Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Process : Information Retrieval Figure 2.1 Transforming a text document to a weighted list of keywords

1. The first step in transforming a document is simply to list all the words in a document. 2. The second step is removal of some of the most commonly occurring words. Process : Information Retrieval

Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges.

 A given word may occur in a variety of syntactic forms ◦ plurals ◦ past tense ◦ gerund forms (a noun derived from a verb)  The word connect, may appear as ◦ connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.  A stem is what is left after its affixes (prefixes and suffixes) are removed ◦ ed, s, or, ed, ing, and ion are suffixes ◦ pre and post are prefixes  Use of stems may arguably improve retrieval performance  Users rarely specify the exact forms of the word they are looking for  Reasonable to retrieve documents with similar words

Calculating frequency of each word Term Document Matrix

Term-document matrix (TDM) is a two-dimensional representation of a document collection. Rows of the matrix represent various documents Columns correspond to various index terms Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row).

Thank You

Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.

Similar presentations

Presentation on theme: "Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.

Similar presentations

Presentation on theme: "Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx."— Presentation transcript:

Similar presentations

About project

Feedback