Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
 2008 Pearson Education, Inc. All rights reserved Web Browser Basics: Internet Explorer and Firefox.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
6/10/2015Cookies1 What are Cookies? 6/10/2015Cookies2 How did they do that?
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Web Mining Research: A Survey
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
CS CS 5150 Software Engineering Lecture 13 System Architecture and Design 1.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Introduction to eValid Presentation Outline What is eValid? About eValid, Inc. eValid Features System Architecture eValid Functional Design Script Log.
Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
EUBA: The Emory User Behavior Analysis System Eugene Agichtein, Qi Guo and Ryan Kelly Intelligent Information Access Lab
Eclipse Overview Introduction to Web Programming Kirkwood Continuing Education Fred McClurg © Copyright 2015, Fred McClurg, All Rights Reserved.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Introducing Dreamweaver MX 2004
Tutorial 1: Getting Started with Adobe Dreamweaver CS4.
1 Chapter 2 & Chapter 4 §Browsers. 2 Terms §Software §Program §Application.
Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.
Microsoft Internet Explorer and the Internet Using Microsoft Explorer 5.
State of the KUMC Jameson Watkins Director, Internet Development Our Topics Updated stats New KU design Search engines: how they.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Chapter Chapter 3 Internet Agents. Chapter Contents Background Web Search Agents Information Filtering Agents Notification Agents Other Service.
Towards a Universal Client for Grid Monitoring Systems Towards a Universal Client for Grid Monitoring Systems Design and Implementation of the Ovid Browser.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Interception and Analysis Framework for Win32 Scripts (not for public release) Tim Hollebeek, Ph.D.
What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 12 This presentation © 2004, MacAvon Media Productions Hypertext and Hypermedia.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
© 2010 Pearson Education, Inc. | Publishing as Prentice Hall. Computer Literacy for IC 3 Unit 3: Living Online Chapter 2: Searching for Information.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
Web Browsing *TAKE NOTES*. Millions of people browse the Web every day for research, shopping, job duties and entertainment. Installing a web browser.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Semantic collaborative web caching Jean-Marc Pierson Lionel Brunie, David Coquil LISI, INSA de LYON
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data mining in web applications
ArcGIS for Server Security: Advanced
Web-based structures, links and testing
User Characterization in Search Personalization
Nurturing content-based collaborative communities on the Web
Lesson 4: Web Browsing.
Discovering User Access Patterns on the World-Wide Web
Search Engines & Subject Directories
Lesson 4: Web Browsing.
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Mining Research: A Survey
Presentation transcript:

Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay

WWW92 Information sources on the Web Web page contents  Early keyword search engines Hyperlink structure  Later engines: Google, Raging Search Searching behavior  Search site monitor clicks on search results Browsing behavior  Easily captured in stand-alone hypermedia  Need software infrastructure for the Web

WWW93 Personal Memex Archiving is feasible  ~25 GB in a lifetime Why archive?  Recall past events  Create a ‘profile’  Correlate with sites, directories, searches Challenges  Flexible architecture  Analyses techniques Your husband died, but here is his Memex (From Jim Gray’s Turing Award Lecture)

WWW94 Searching the personal Memex Keyword search (never lose a page) Advanced queries  Recreate my recent surfing history w.r.t. the topic ‘bicycling’  Extract from the MIT Web site all pages that match my ‘compiler research’ profile Topic taxonomy plays a central role  Characterized by bookmark folders  More familiar than ‘universal’ directories

WWW95 Archiving architecture choices Bookmarks only or all click history Installed application or plug-in  Closer integration, e.g. with COM CGI and Javascript  Slow, hard to monitor all clicks Applet-servlet  Portable, better UI compared to HTML Proxy or wiretap  Proxy involves configuring browser

WWW96 Memex block diagram Browser Memex server Client JAR Visit Running client applet Download Attach Event-handler servlets Search Folder Context Archive Memex client-server protocol and workload sharing negotiations Relational metadata Text index Mining demons Topic models Taxonomy synthesis Resource discovery Recommendation Classification Clustering

WWW97 Demon Registry Document workflow X Per-document version queue NODE table Crawler Search indexer Classifier service Clustering service Garbage collector Push new version Pop and discard old version Browser Memex client Page visit and bookmarking events logged

WWW98 Folder tab Valuable user input and feedback on topics and example documents File manager- like interface Privacy choice ‘?’ indicates automatic placement by Memex classifier User cuts and pastes to correct or reinforce the Memex classifier

WWW99 Context tab Choice of topic context Replay of recent browsing context restricted to chosen topic Active browser monitoring and dynamic layout of new incremental context graph Better mobility than one- dimensional history provided by popular browsers

WWW910 Search tab “Find the paper about collaborative filtering I was reading a month back” Search using keyword and visit statistics

WWW911 Mining issues Two relations  occurs_in(term, document)  bookmarked_into(document, folder)  (Ignore hyperlinks for now) Document classification and clustering  Exploit ‘bookmarked_into’ Taxonomy synthesis  Reconcile folders from a community of users into coherent themes

WWW912 Taxonomy synthesis: motivation Autonomy vs collaboration  Personalization  picking folders from Yahoo  Complex relations between users’ interests Need the “simplest common ground” Sports Hiking Subsumption User2User1 Yahoo Biz Shops Bikeshops Sports Cycling Bikeshops Sports User3 Tree ‘inversion’

WWW913 Taxonomy synthesis: intuition Broadcasting Entertainment bbc.co.uk kpfa.org channel4.com kron.com kcbs.com foxmovies.com lucasfilms.com miramax.com Media Studios FoldersDocuments Share documents Share folder Share terms

WWW914 Themes Taxonomy synthesis: intuition Movies TV Radio Broadcasting Entertainment Media Studios Folders bbc.co.uk kpfa.org channel4.com kron.com kcbs.com foxmovies.com lucasfilms.com miramax.com Documents

WWW915 Trade-off Using theme nodes can simplify graph  Shannon encoding of folder or theme ID Increases distortion of term distribution  Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder Compare cost in bits

WWW916 Algorithm BestSingle Pool all documents Find bottom-up hierarchical clustering (HAC) using text only Map each original folder to the one HAC node at the smallest KL distance Low mapping cost, high distortion Documents HAC Tree Broadcasting Entertainment Media Studios

WWW917 PatchHAC and Bicriteria PatchHAC:  Start with BestSingle  Greedily introduce additional mappings from folders to HAC nodes Bicriteria:  Start with each document a theme  Collapse greedily while total code length decreases

WWW918 Conclusion Recording history is feasible and useful  Few kilobytes per day per user Bookmark taxonomies are a valuable source of information; can be…  Integrated into dynamic community- specific taxonomies  Used to drive discovery and collaboration Memex can guide peer proxy caches  Cooperative caching between departments

WWW919 Software Demo: Client: Signed Swing/JFC applet  Netscape4.5+ (IE, HotJava planned) Server: DB2 + Berkeley DB + Servlets Infrastructure for plugging in research prototypes using the Demon API  Clustering, classification, visualization  Collaborative filtering and recommendation

WWW920 Related work Archiving, searching, categorization  Vistabar (Alta Vista)  Bookmark organizer (IBM Haifa)  PowerBookmarks (NEC)  Purple Yogi  Netscape roaming access, Backflip Mining  Attribute similarity via external probes  Non-linear dynamical systems