Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Similar presentations


Presentation on theme: "Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University."— Presentation transcript:

1 Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

2 Self-Introduction My Journey in America My Journey in America  Atlanta, GA  Denton, TX  College Park, MD  San Jose, CA  White Plains, NY  Lexington, KY  Philadelphia, PA

3 Have you ever asked: How could search engines find the information I request so quickly, out of millions and millions of web pages? How could search engines find the information I request so quickly, out of millions and millions of web pages? Which statement do you like the best? Which statement do you like the best?  It is easy to find just about anything on the Web.  It’s impossible to find anything on the Web; I always find so many things that I don’t want. How about these: How about these:  I like search engines very much.  I hate search engines!

4 More questions What kinds of questions are easy (difficult) to find on the Web? What kinds of questions are easy (difficult) to find on the Web?  Why? Are there any ways to make it easier? Are there any ways to make it easier?  What solutions are we looking for  Technical solution?  Cognitive solution?

5 Google is the solution? Everyone likes Google. Everyone likes Google.  True or False? What would happen if Google disppears on the Web tomorrow morning? What would happen if Google disppears on the Web tomorrow morning? Better Search Results than Google? Better Search Results than Google?  CNN report, Jan. 5, 2004 CNN report, Jan. 5, 2004 CNN report, Jan. 5, 2004  vivisimo.com/ vivisimo.com/  Grokker2 Grokker2  TouchGraph TouchGraph

6 How to defeat Google? Microsoft Way Microsoft Way  I will buy you,  Or I will netscape you! Open Source Way Open Source Way  Watch Nutch Nutch  Under the leadership of Doug Cutting  The “Linux” of search engines

7 Course Overview What this course is …about What this course is …about  How people search and find information.  How computers store and retrieve information.  How computer systems are designed to help people find information they need.

8 Course Overview The course will emphasize on The course will emphasize on  Understanding of  Theories  Tools  Algorithms, and  Evaluations for Information Retrieval Systems.

9 Course Overview What this course is NOT... What this course is NOT...  An algorithm design course  We might use several related algorithms, not study them in details  Our textbook could be used for such a course  a system development course  Except some assignments may require you to compile some C procedures.  We look at an IR system as a whole, not as individual components

10 Required skills Know how to create html pages Know how to create html pages Have access to a Web server Have access to a Web server  If you don’t, the best way is to apply an dunx1 account from Drexel.  Make sure you request Web server access.Web server access. Shell access.Shell access. Have access to a C compiler Have access to a C compiler  Having Dunx1 Shell access will do it.

11 Project Idea -1 Install and implement an IR system: Install and implement an IR system:  Index a sample document collection or a Web site  Test and evaluate all the functionalities of the system.  Compare this IR system with others.  Demonstrate the implementation in class.

12 Project idea -- 2 Conduct an evaluation experiment on one or two selected IR systems Conduct an evaluation experiment on one or two selected IR systems  Identify the systems  Install the systems, if necessary  Design the experimental methods  Test the experimental methods  Analyze the data and write the final report.

13 Project idea -3 Customize an IR system Customize an IR system  Using an open source retrieval software  Apache Lucene Apache Lucene Apache Lucene  Implementing a crawler  With some open source codes  Designing a new retrieval interface

14 What is IR? IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. IR involves helping users find information that matches their information needs. IR involves helping users find information that matches their information needs. System- centered View User- centered

15 IR Systems IR systems contain three components: IR systems contain three components:  System  People  Documents (information items) User System Documents

16 Web brings IR to the Center of the Stage IR has become a center of the focus in the Web era. Its theories, techniques, and applications have reached many fields where processing large amount of information is essential. IR has become a center of the focus in the Web era. Its theories, techniques, and applications have reached many fields where processing large amount of information is essential.

17 Challenges of IR User Information Search/select Info. Needs Queries Stored Information Translating info. needs to queries Matching queries To stored information Query result evaluation: Does the information found match user’s information needs?

18 Examples: Where can I find information needed for my term project? Where can I find information needed for my term project?  Challenges:  How do you translate the question to a query?  What info. needs to store in the system in order to answer the question?  Which system will match the request best?

19 Examples: Which IST course is most useful? Which IST course is most useful?  Challenges:  Information may not exist anywhere  It’s personal opinion. Where is bin Laden now? Where is bin Laden now?  Challenges:  Intelligence Analysis  Need the first-hand information

20 Abstraction Principles First Abstraction Principle First Abstraction Principle  Abstract data from the “real world”  And make them available to the system. Second Abstraction Principles Second Abstraction Principles  Abstract the user’s information needs into a form the system understands.

21 Users The user The user  anyone who need to find some information The user groups The user groups  group by their knowledge of the system  novice users vs. experienced users  end users vs. information specialists  group by their domain knowledge  Domain experts vs. general public  group by information needs  need to locate a particular item  need some information  need all information on a subject

22 User’s Information Needs People depend on information to carry out their daily activities. People depend on information to carry out their daily activities.  need to accomplish some goals.  need to solve some problems. People realize a lack of information People realize a lack of information  perceive a gap in their knowledge state  ASK -- Anomalous State of Knowledge  desire to fill the gap RealityGoals ?

23 User’s information needs RealityGoals ? RealityGoals ? RealityGoals ? RealityGoals ? RealityGoals ? Info. Needs Info. Systems ??

24 Queries RealityGoals ? RealityGoals ? RealityGoals ? RealityGoals ? RealityGoals ? Info. Needs Info. Systems ?? Request Problems Data ?? First Abstraction Principle Second Abstraction Principle

25 Data and Information Data Data  String of symbols associated with objects, people, and events  Values of an attribute  Data need not have meaning to everyone  Data must be interpreted with associated attributes.

26 Data and Information Information Information  The meaning of the data interpreted by a person or a system  Data that changes the state of a person or system that perceives it.  Data that reduces uncertainty.  if data contain no uncertainty, there are no information with the data.  Examples: It snows in the winter. It does not snow this winter. It does not snow this winter.

27 Information and Knowledge knowledge knowledge  Structured information  through structuring, information becomes understandable  Processed Information  through processing, information becomes meaningful and useful  information shared and agreed upon within a community Data information knowledge

28 Text Strings of ASCII symbols or Unicode Strings of ASCII symbols or Unicode  structured by the author  indexed by information service providers Representation of natural languages people use Representation of natural languages people use  To convey meanings  To communicate between readers and authors. Data or information? Data or information?  If it can be understood, it’s information.  by Whom? A person or a system?

29 Documents Logical unit of text Logical unit of text  articles, books,  links, web pages Other components that come with the text Other components that come with the text  figures, charts, graphics  multimedia

30 Textual Data Repository of human intellectuals Repository of human intellectuals  Rich and diverse resources for all answers.  If it is written, it is there (in text)  Meaningful and understandable (to users). Simple ASCII representation Simple ASCII representation Free of pre-formatted structures Free of pre-formatted structures  continuous  separated into documents Easy to process by the computer Easy to process by the computer  Machine Intensive (not labor intensive)

31 Problems with Text Massive Massive  Any IR system needs the capability of large scale data processing.  Use of indexes and various representations are required. Inconsistent Inconsistent  It’s a human language  Syntactical and semantic variances Same information expressed in different ways.Same information expressed in different ways. Different information expressed in similar ways.Different information expressed in similar ways. Incomplete Incomplete  It uses common knowledge.  It’s an open system.

32 Retrieval Retrieval Retrieval  What do we retrieve?  Data  Information  Knowledge  We retrieve documents that contains text which carries information.  Information can be anywhere  in the text, in the links, in the process of text.

33 Information Retrieval Are they the same? Are they the same?  Text retrieval  Document retrieval  Information retrieval

34 Information Retrieval Conceptually, information retrieval is used to cover all related problems in finding needed information Conceptually, information retrieval is used to cover all related problems in finding needed information Historically, information retrieval is about document retrieval, emphasizing document as the basic unit Historically, information retrieval is about document retrieval, emphasizing document as the basic unit Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc. Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.

35 Summary The goal of IR systems is to help users find information that satisfies their information needs. The goal of IR systems is to help users find information that satisfies their information needs. The main process of IR systems is to match data abstracted from the real world to queries abstracted from user’s information needs. The main process of IR systems is to match data abstracted from the real world to queries abstracted from user’s information needs. Information retrieval is much more difficult than data retrieval. Information retrieval is much more difficult than data retrieval.

36 Data Retrieval vs. Information Retrieval Data retrieval Information retrieval Data retrieval Information retrieval ContentData Information Data objectTable Document MatchingExact match Partial match, best match Items wantedMatching Relevant Query languageSQL(artificial) Natural Query specification Complete Incomplete ModelDeterministic Probabilistic Highly structured less structure


Download ppt "Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University."

Similar presentations


Ads by Google