Presentation on theme: "Online information seeking behaviors and search strategies."— Presentation transcript:
Online information seeking behaviors and search strategies
The Internet and the Web The Internet has been around since the 1960s. – First major use – e-mail – Content was hard to find unless someone told you where to look. – Archie (1990) and gopher (1991) – tools to find files online and retrieve them from open ftp sites The Web has been available since approximately 1994. – Mosaic –search engine – 1993 – Lycos, Yahoo – 1994 – Google – September 1998 See http://www.seoconsultants.com/search- engines/history/#SEH1990 for search engine historyhttp://www.seoconsultants.com/search- engines/history/#SEH1990
Changing emphasis “Microsoft, Yahoo and Google say they are innovating because people's expectations for a search engine are far higher than they were even five years ago. People no longer search for a Web site; now they expect to find a specific piece of information, like the cheapest airfare to Chicago.” http://www.mercurynews.com/breaking-news/ci_13679086?nclick_check=1 Updated: 11/02/2009 02:45:48 AM PST
How do you search for something For information about a general topic? For a specific fact or data item? Let’s see – Form groups of 2 (or 3 if needed) – Search for Reliable information about the H1N1 flu vaccine The number of people who died in Philadelphia from the 1918 flu epidemic – Work together. Notice what you do and be prepared to describe your strategy.
How did you do? Let’s review the strategies – Were they different for the two tasks? – Who found the best results? How did you get there? – Who had the most problems? What caused your difficulties? Can we find some ideas for future success?
Looking at other sources Some papers on information finding (and re-finding, and on how search engines work.) Why this topic? – The Web is the ultimate digital library! If we understand how to get the best results from the Web, we can use some of those strategies to make our digital libraries more effective. – As we understand the issues faced in web search, we may have a better understanding of why targeted digital libraries have a place.
The Web Search Three Distinct Phases: – Crawling – Indexing – Searching Each has specific challenges to address Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006
Crawlers Basic process – Open an HTML page that has at least one anchor tag ( link description – Send HTTP request to the site and receive the page. – Parse the page, looking for other anchor tags – Place anchors on a queue for further processing – Submit the actual page for indexing and storing
Indexers Scanning – “For each indexable term … the indexer writes a posting consisting of a document number and a term number to a temporary file.” Parse this sentence: What is an indexable term? Posting? Document number? Term number? What does a posting look like? Invert the file – Sort by term, secondarily by document number – Record start location and list length for each term
Searching (Query Processing) Look up query term in term dictionary Get the postings list Find documents that match all search terms – Find documents for each term and merge lists where common documents occur Rank documents and report – As many as required or until end of the list Still possible to find a result on one search and not find that same item on a subsequent search of the same terms
Expanding from the basics Each of the phases of web searching is simple in concept, but complicated by the sheer magnitude of the task. The same ideas applied on a smaller scale -- in a company intra-net, for example, can be done efficiently. The Web presents special challenges.
Crawling A single machine running a simple crawling algorithm would not do well in finding all Web pages. Large data centers – Redundancy and fault tolerance – Parallel operation – (SIGCSE talk by Marissa Mayer of Google)SIGCSE talk by Marissa Mayer of Google
Crawling reality Speed - amazing numbers: – @.5 sec per http request, max 86,400 per day = 634 years for 20 billion pages Politeness - – Overwhelming web servers Excluded content – Robots.txt Duplicate content – Identifying duplicates can be tricky - why? Continuous crawling – Keeping current – Priority queue for crawling schedule - why? Spam
Indexing large collections The Web is the ultimate “large collection” “Estimating 500 terms in each of 20 billion pages” --> 10 trillion entries! Divide and conquer, as the crawler did – Each indexer builds a partial file in memory – Stops when memory is full – Write to disk, clear memory, and start over Merge the partial files to make the full index
Data structures for indexing Trees, tries, hash tables – Various ways to organize the terms for easy lookup Numbers of terms – More than all words in all languages – Acronyms, proper names, etc. – Must deal with common phrases also Separate index entries (postings) for common word combinations Compression – Saves space, increases processing Anchor text -- fie on those who use “click here”!! Link popularity score – Give a score to a page based on popularity, also on query-independent factors. – Think about the implications of this.
Query Processing Most queries are short, do not provide much context Result quality -- use some of the techniques from information retrieval – Once a preliminary list of responses is obtained, treat that as the collection and use IR techniques to improve the quality of the response. Some limitations. No way to judge how complete the initial list is. – Techniques are part of the trade secrets of the companies Speeding things up: – Skipping – Early termination – Document numbering – Caching
So what to do with the information found? Use it. Sure. Then what? Will I ever need that information again? Do I use it a lot? Should I retain it somehow so it will always be available? Or should I just figure on searching again? What is your approach? Does it vary with the kind of data? What are your decision criteria?
Resource Jones, William and Jaime Teevan. Personal Information Management. University of Washington Press. 2007 Nice set of papers related to finding, keeping, and organizing information The book has a theme running through it – a specific event that requires a number of people to obtain and use information. The papers address various aspects of meeting that need.
The unifying theme A set of characters with a need to accomplish a specific activity – Characters have distinct characteristics in terms of the way they organize and interact with information – The task is to organize a surprise birthday party for one of the characters, with appropriate roles and interactions among the other characters
The characters Alex, male, 27. – Securities analyst – Very well organized, especially work info. – Takes action immediately on all new information received -- email or other Brooke, female, 23. (Alex’s sister) – Software developer at startup company – “spontaneous, dynamic, chaotic” – “Job is too unpredictable and fast changing for … much point to filing information.” – Lots of unsorted piles, 2000 messages in inbox Connie, 58, mother of Alex and Brooke – Prides herself on being organized, mostly paper based. – Has been ill, papers have piled up and organization deteriorated. Derek, male, 23 – Engaged to Brooke – Relies on tablet PC – Would like to banish paper, but it still comes in Edna, female, 74 – Retired, owns a lot of real estate (was real estate broker) – Almost all paper-based information (PC does not work) – Prefers to call or write actual letters to communicate. – No children, close friend of Connie, honorary grandmother of Alex and Brooke Felicia, 20. Derek’s sister – In college, interested in music, photography – Uses laptop for communication, organizing of digital pictures – Has a lot of older print pictures, photo CDs
The theme activity Planning a surprise 75th birthday party for Edna. Out of town guests will need hotel rooms Edna’s favorite restaurant will be the site. Maybe they will do a phot album.
Finding information Opening scenario - Alex needs to find the phone number of his grandmother’s favorite restaurant and make a reservation. – He is the organized one. – This does not fit a category that he uses. If he knows the name of the restaurant - easy Otherwise, knows he has seen it somewhere -- how to find it again – REFINDING - a particular category of information finding. Related to “keeping” - a subject to come later Paper: How People Find Personal Information, in the book Persoal Information Management
Finding - a multi-stepped process Importance of browsing Common triggers and stop conditions Users prefer to find information by orienteering -- using small steps guided by their knowledge of the local context -- rather than search -- sudden jump to the destination. Scenario - Alex knows the restaurant name is in an email from his sister. Could do a search in the email client. Instead, goes to a folder, sorts to find all the mail from Brooke, then browses. Is that what you would do? How do you look for information that you believe is in an email message?
Why orienteering? Quality of search tools? – There are studies that show that presenting improved search tools do not noticeably affect the way people seek information. Side benefits? – Orienteering (or navigating) provides a broader look at the information space. Not only do you find what you are looking for, you also see what else is around it that might also be of interest in the current task. Distinction between recognition and recall – Navigating allows use of recognition within context, which may be easier or more comfortable than recall of the right search terms to use. Note always that individuals differ.
Files and piles Relating approaches to finding in physical spaces – Filers -- more comfortable with organized systems, visible structures – Pilers -- more comfortable with loose structures, less formal organization. Characteristics carry over to approaches to finding digital information. Direct connection between ways of organizing information and ability to refind.
Refinding -- different from initial discovery Finding something seen before is different from the initial discovery activity – Know more about it -- meta data that may aid in locating it Author, title, date created, URL, color, style of text, etc. Knowing that Brooke had emailed the name of the restaurant triggers a memory of the subject of the email, for example – Particularly important meta information: People associated it, path taken to find it originally, temporal aspects. – Some research shows such importance of time, that some argue that chronological ordering should be default ranking
Factors related to Re-finding Initial encounter with the information provided some experience that will influence re-finding – Elapsed time since prior encounter will influence value of that experience – Expected future value will influence how well it is remembered – Similarity of initial reason to have the information and the reason for the new access influence the connection between the prior and current experience.
Re-finding related to keeping, organizing Studies about how people re-find information on the Web show preference for strategies that do not involve any advance planning or keeping. Yet, people do spend time preparing for future access. Shown: Pilers prefer to organize with small steps while filers are more likely to use search tools to jump directly to, or close to, the target
Judging value Information is easier to re-find if it was recognized as important the first time it was seen. – What do you do to recognize the potential future use of information In email? In web sites? In Other information sources? – Post-valued recall -- recognizing the value of previously encountered information Some people e-mail information to themselves – Have you ever done this? What does it accomplish? Lack of knowledge of future importance makes it harder to store and organize information effectively
Information fragmentation On how many different devices do you store electronic information? – Phone, pda, desktop computer, laptop … – How do you recall what is where? Do you have any kind of overall index? Do you ever lose something entirely because you cannot recall where it is stored? Do you use online sites such as Google docs to make files accessible from a variety of places? – What are the pros and cons of that approach? – How do you handle multiple e-mail accounts? (Do you?) – How do you know that this version is most recent? Naming conventions help or hinder. How do you name the versions of a file? – Cathy Marshall study at Microsoft
Information Keeping People keep things -- including information -- for a variety of reasons – Expected future need – Reminder of an experience, usually pleasant, but perhaps something significant that should not be forgotten (VT April 16 collection -- see http://www.vt.edu/remember/) http://www.vt.edu/remember/ – Increasing amounts of information available, but it is hard to know what to keep Paper: How People Keep and Organize Personal Information in the book, Personal Information Management
Define: Information keeping Decision-making and actions relating to the information item currently under consideration that impact the likelihood that the item will be found again later. Decisions can range from: (1) “ignore, this has no relevance to me”; (2) “ignore, I can get back to this later”… (3) “keep this in a special place or way so that I can be sure to use this information later.” This is the keep or don’t keep decision, not related to how to keep anything. Quoted from How People Keep and Organize Personal Information in Personal Information Management
Define: Information organizing Decision-making and actions relating to the selection and implementation of a scheme of organization and representation for a collection of information items. Decisions can include: (1) How should items in this collection be named? (2) What sets of properties make sense for and help to distinguish the items in this collection? (3) How should items within this collection be grouped? Into piles or folders? Note the movement from an item to a collection as we talk about keeping and organizing – “Keeping” response is triggered frequently by ordinary events. – “Organizing” response is less often triggered What triggers the impulse to organize? Quoted from How People Keep and Organize Personal Information in Personal Information Management
Define: Information Maintaining All decisions and actions relating to the composition and preservation of a personal information collection. Decisions involve what kind of new items go into a collection, how information in the collection is stored (Where, in what formats? In what kind of storage? Backed up how?) and when do older items leave the collection (e.g. When are they deleted or archived?) A mixed blessing -- the Apple migration (with firewire) when a new machine replaces an old one. – Easily obtain an exact copy of the old disk system. Is this good, bad, some of both?
Keeping decisions: Multifaceted and Error-prone Some sorting attributes for paper items – Title, author – Disposition (discard, keep, postpone) – Order scheme (group, separate, arrange – Time (duration, currency) – Value (importance, interest, confidentiality – Cognitive State (don’t know, want to remember) Heavily influenced by anticipated future use
Other approaches to keeping Collection building, independent of expected future use Packrat Legacy What do you do with something you do not intend to use again? Do you get rid of it or just put it aside? How much effort is required to make that decision? – Alex and the business card scenario
Organizing Little research on how the same person organizes different forms of information Some results – People do not take time to assess their organization – People complain about needing to organize separate type of information and the resulting fragmentation – People are not consistent about the approaches they take, using different schemes on different days – Some people go to great lengths to consolidate types of information -- sending documents by email or storing email in file folders, for example.
Structures Making sense of organization includes both internal representations and external representations – Internal representation requires a cognitive connection, an understanding of where each information item fits into a larger scheme and how it will be retrieved later – External representation is a translation of the internal understanding of the structure needs of organization into a realization that can be seen and used. My definitions, not from the author
Features that would be useful for organization A manual ordering of folders – People force this by strange folder names (AAA…) An ability to set reminders, due dates, and other task-like properties on folders – Subfolders often correspond to tasks, but cannot be treated like tasks An ability to add notes – Some people add a notes document to a folder An ability to use and reuse structures – If an organization of a folder or directory is useful for a variety of activities, it would be nice to be able to reconstitute its structure, ready for new particulars. -- For me, an ABET visit, for example.
Final reflections How does understanding how people treat the information they find influence your organization and presentation of your digital library? Are there services that you might add (if you can) in order to meet the needs of the user? Consider the ACM DL option of binders. Is that something useful? Would it be a benefit to your users? How would you use that option or something similar?