Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute.

Similar presentations


Presentation on theme: "Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute."— Presentation transcript:

1 Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay

2 WWW92 Information sources on the Web Web page contents  Early keyword search engines Hyperlink structure  Later engines: Google, Raging Search Searching behavior  Search site monitor clicks on search results Browsing behavior  Easily captured in stand-alone hypermedia  Need software infrastructure for the Web

3 WWW93 Personal Memex Archiving is feasible  ~25 GB in a lifetime Why archive?  Recall past events  Create a ‘profile’  Correlate with sites, directories, searches Challenges  Flexible architecture  Analyses techniques Your husband died, but here is his Memex (From Jim Gray’s Turing Award Lecture)

4 WWW94 Searching the personal Memex Keyword search (never lose a page) Advanced queries  Recreate my recent surfing history w.r.t. the topic ‘bicycling’  Extract from the MIT Web site all pages that match my ‘compiler research’ profile Topic taxonomy plays a central role  Characterized by bookmark folders  More familiar than ‘universal’ directories

5 WWW95 Archiving architecture choices Bookmarks only or all click history Installed application or plug-in  Closer integration, e.g. with COM CGI and Javascript  Slow, hard to monitor all clicks Applet-servlet  Portable, better UI compared to HTML Proxy or wiretap  Proxy involves configuring browser

6 WWW96 Memex block diagram Browser Memex server Client JAR Visit Running client applet Download Attach Event-handler servlets Search Folder Context Archive Memex client-server protocol and workload sharing negotiations Relational metadata Text index Mining demons Topic models Taxonomy synthesis Resource discovery Recommendation Classification Clustering

7 WWW97 Demon Registry Document workflow X Per-document version queue NODE table Crawler Search indexer Classifier service Clustering service Garbage collector Push new version Pop and discard old version Browser Memex client Page visit and bookmarking events logged

8 WWW98 Folder tab Valuable user input and feedback on topics and example documents File manager- like interface Privacy choice ‘?’ indicates automatic placement by Memex classifier User cuts and pastes to correct or reinforce the Memex classifier

9 WWW99 Context tab Choice of topic context Replay of recent browsing context restricted to chosen topic Active browser monitoring and dynamic layout of new incremental context graph Better mobility than one- dimensional history provided by popular browsers

10 WWW910 Search tab “Find the paper about collaborative filtering I was reading a month back” Search using keyword and visit statistics

11 WWW911 Mining issues Two relations  occurs_in(term, document)  bookmarked_into(document, folder)  (Ignore hyperlinks for now) Document classification and clustering  Exploit ‘bookmarked_into’ Taxonomy synthesis  Reconcile folders from a community of users into coherent themes

12 WWW912 Taxonomy synthesis: motivation Autonomy vs collaboration  Personalization  picking folders from Yahoo  Complex relations between users’ interests Need the “simplest common ground” Sports Hiking Subsumption User2User1 Yahoo Biz Shops Bikeshops Sports Cycling Bikeshops Sports User3 Tree ‘inversion’

13 WWW913 Taxonomy synthesis: intuition Broadcasting Entertainment bbc.co.uk kpfa.org channel4.com kron.com kcbs.com foxmovies.com lucasfilms.com miramax.com Media Studios FoldersDocuments Share documents Share folder Share terms

14 WWW914 Themes Taxonomy synthesis: intuition Movies TV Radio Broadcasting Entertainment Media Studios Folders bbc.co.uk kpfa.org channel4.com kron.com kcbs.com foxmovies.com lucasfilms.com miramax.com Documents

15 WWW915 Trade-off Using theme nodes can simplify graph  Shannon encoding of folder or theme ID Increases distortion of term distribution  Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder Compare cost in bits

16 WWW916 Algorithm BestSingle Pool all documents Find bottom-up hierarchical clustering (HAC) using text only Map each original folder to the one HAC node at the smallest KL distance Low mapping cost, high distortion Documents HAC Tree Broadcasting Entertainment Media Studios

17 WWW917 PatchHAC and Bicriteria PatchHAC:  Start with BestSingle  Greedily introduce additional mappings from folders to HAC nodes Bicriteria:  Start with each document a theme  Collapse greedily while total code length decreases

18 WWW918 Conclusion Recording history is feasible and useful  Few kilobytes per day per user Bookmark taxonomies are a valuable source of information; can be…  Integrated into dynamic community- specific taxonomies  Used to drive discovery and collaboration Memex can guide peer proxy caches  Cooperative caching between departments

19 WWW919 Software Demo: www.cs.berkeley.edu/~soumen Client: Signed Swing/JFC applet  Netscape4.5+ (IE, HotJava planned) Server: DB2 + Berkeley DB + Servlets Infrastructure for plugging in research prototypes using the Demon API  Clustering, classification, visualization  Collaborative filtering and recommendation

20 WWW920 Related work Archiving, searching, categorization  Vistabar (Alta Vista)  Bookmark organizer (IBM Haifa)  PowerBookmarks (NEC)  Purple Yogi  Netscape roaming access, Backflip Mining  Attribute similarity via external probes  Non-linear dynamical systems


Download ppt "Using Memex to archive and mine community Web browsing experience Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute."

Similar presentations


Ads by Google