CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.

Slides:



Advertisements
Similar presentations
COUNTER: improving usage statistics Peter Shepherd Director COUNTER December 2006.
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
Comparison of BIDS ISI (Enhanced) with Web of Science Lisa Haddow.
Bringing It All Together: An Academic Viewpoint (What is needed and what is likely to come next?) Association of Information and Dissemination Centers.
Database Searching: How to Find Journal Articles? START.
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Search Engines and Information Retrieval
Finding information: Engineering and Computing Sciences Nicola Conway October 2011.
E-resources for the social sciences A brief overview of general resources for the social sciences: –Bibliographic databases –Resources for news and statistics.
Information Skills for Computer Scientists Richard Pears June 2008.
Looking for information on a topic Choose your own adventure!
The Promise & Perils of Metasearching Roy Tennant California Digital Library Roy Tennant California Digital Library.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Databases & Data Warehouses Chapter 3 Database Processing.
Dr. Alireza Isfandyari-Moghaddam Department of Library and Information Studies, Islamic Azad University, Hamedan Branch
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Getting Started with. EndNote Web: It allows you to: Access your references from any computer with internet Collect references from online sources Drop.
Educational Research Theses : Online Communities and Partnerships Sue Clarke Manager, Cunningham Library, ACER ETD2005: Evolution through discovery 28.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Of Database Searching workshop for Ruby Certificate presented by: Theresa Mastrodonato February 20, 2007.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Master Thesis Defense Jan Fiedler 04/17/98
Layton’s Researching 101 Tutorial Planning the Search Boolean Logic Searching the Internet Using EBSCO host Quiz Let’s get started! Next.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Literature Reviews: the Hows, Whys and Wherefores GEO 518 Anne Nolin and Dawn Wright.
The Internet 8th Edition Tutorial 4 Searching the Web.
ICOM 6115: COMPUTER SYSTEMS PERFORMANCE MEASUREMENT AND EVALUATION Nayda G. Santiago August 16, 2006.
How to Read Research Papers? Xiao Qin Department of Computer Science and Software Engineering Auburn University
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Presented by Dr. S. C. Jindal Librarian Central Science Library University of Delhi Delhi Information Competency.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
OWL Representing Information Using the Web Ontology Language.
Design and Implementation of a Rationale-Based Analysis Tool (RAT) Diploma thesis from Timo Wolf Design and Realization of a Tool for Linking Source Code.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Visualization in Text Information Retrieval Ben Houston Exocortex Technologies Zack Jacobson CAC.
User Interface Design for a Large-Scale Computer Science Research Digital Library Min-Yen Kan Department of Computer Science National University of Singapore.
SILKWeb: A Sketching Tool for Informal Web Page Design Mark Newman, James Landay, Francis Li, Kalpana Joshi July 9, 1998 C&C Research Labs, NEC
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
How to Turnitin Dr Stephen Rankin Lecturer in Academic Writing and Literacy Murdoch University A 6 step guide for submitting your assignments to Turnitin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Information Retrieval in Practice
Building Search Systems for Digital Library Collections
Looking for information on a topic
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
New Features Update Web of Knowledge : Discovery Starts Here
Introduction to Information Retrieval
A Database Service – Thank you Mr. Conrad for funding this!
Haystack: an Adaptive Personalized Information Retrieval System
CMNS 130: Finding research materials
The Search Engine Architecture
Presentation transcript:

CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003

The course project Building a digital library of academic papers, from those freely available on the web This is a great learning context in which to investigate IR, classification and clustering, information extraction, link analysis, various forms of text-mining, textbase visualization, collaborative filtering … really everything we cover in this course (and the last one) Project name? We’re looking for a good one!

Organization & scope This is a project of reasonable scope; our plan is to have people taking the class work together to implement the components of it So a secondary benefit should be some exposure to software engineering issues… But it’s not an impossibly large project The main components are a series of stages that map between clearly defined data representations Several groups of two did things like components of this as components of their projects last quarter

Motivations/Predecessors Machine Learning Papers [Andrew Ng, defunct] Cora [Just Research, Andrew McCallum, defunct] CiteSeer/ResearchIndex [NEC Research] Highwire [Stanford] There are various other online archives, but this service isn’t available for most disciplines

Organization Two halves: In first half, people will build basic components, infrastructure, and data sets/databases for project. In two phases: First steps Further development (extensions, needed fixes) Second half: student-designed project, which will focus in on a particular issue of interest related to goals of this project In general, work in groups of 2 on projects

Timeline Tue Jan 14 [today]: Phase 1a starts Mon Jan 27: Phase 1a due [and name suggestion!] Course staff integrates, debugs, evaluates Thu Jan 30: Phase 1b starts Tue Feb 11: Phase 1b due Course staff integrates, debugs, evaluates Tue Feb 18: Phase 2 project plan due Tue Mar 4: Phase 2 project check-in point Course staff integrates, debugs, evaluates Wed Mar 12: Phase 2 due Thu Mar 13: Presentation of projects in class

Grading Project will be 40% of the grade distributed over phases: Phase 1a: 8% Phase 1b: 8% Phase 2: 24% (4% hand out for check-in point) Phase 1 will mainly involve getting parts of a system working and well-implemented. In evaluating it, we’ll value good systems engineering as well as course-related stuff Phase 2 is meant to be a research project: you’ll write up a research report/paper, and it’ll be evaluated largely based on its quality.

Opportunities for improvement Much of citation search is fielded search, and a text search interface is awkward Citations are not very well parsed Duplicates are poorly detected Lots of things that you could do with link analysis (important conferences, cliques) Getting reference information from HTML pages as well as papers Subject classification (esp. if broad domain) Visualization Using collaborative filtering

Take initiative and ask questions You should look to acquire information relevant to solving these problems well There are lots of relevant papers on many of these problems We’re here to help! We’d like this project to succeed, and would be eager to answer questions and give advice on how to do things There may also be things in our rough specification that actually need correcting Talk to Teg (and other staff)

Basic processing stages 1. Crawler downloads HTML pages that contain links to papers, and papers 2. Focussed crawler does this intelligently 3. Extract links and context from HTML 4. Convert papers to (marked up) text 5. Decide if they’re really research papers 6. Extract header (author, title, abstract) and references sections 7. Separate citation block into individual citations

Basic processing stages 8. Do information extraction of author, title, etc. information in citations 9. Find context(s) of each citation in body of paper 10. Work out sets of variant forms for each person name, conference, paper (de-duping) 11. Normalize citations to unique full form 12. Map citations to papers to which they refer 13. Build Lucene IR system index (with fields) 14. Provide UI for querying, browsing (and visualization)

Tools We don’t need to reinvent the wheel. There are lots of tools that you can and should use for various stages: Lucene IR engine MySQL database PS/PDF to text engines We’ll do the project in Java Good URL handling, multithreading, etc. various packages for all sorts of things (e.g., touchgraph for visualization)

Computers etc. We’re going to start off doing development on Leland systems, with a CVS repository there We’ve got some small data sets, and you may make others At this stage, keep small: just download a couple of hundred papers, restrict yourself to the Stanford domain, etc. Later in the quarter we’ll transition things to a dedicated Linux machine (under construction) and attempt to run it on a larger scale…

Questions? Ok, concrete organization time…