Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.

Slides:



Advertisements
Similar presentations
CSE 5522: Survey of Artificial Intelligence II: Advanced Techniques Instructor: Alan Ritter TA: Fan Yang.
Advertisements

A (corny) ending. 2 Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can.
The Structure of Networks with emphasis on information and social networks RU T-214-SINE Summer 2011 Ýmir Vigfússon.
CSE 471/598 Introduction to Artificial Intelligence (aka the very best subject in the whole-wide-world) The Class His classes are hard; He is not.
Information Integration + a (corny) ending 5/4 An unexamined life is not worth living.. --Socrates  Mandatory blog qns  Final on next Tuesday 9:50—11:40.
Search Engines and Information Retrieval
CS/CMPE 535 – Machine Learning Outline. CS Machine Learning (Wi ) - Asim LUMS2 Description A course on the fundamentals of machine.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Interactive Review + a (corny) ending 12/05  Project due today (with extension)  Homework 4 due Friday  Demos (to the TA) as scheduled.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
1 5/4: Final Agenda… 3:15—3:20 Raspberry bars »In lieu of Google IPO shares.. Homework 3 returned; Questions on Final? 3:15--3:40 Demos of student projects.
CEP Welcome September 1, Matthew J. Koehler September 1, 2005CEP Cognition and Technology Who’s Who?  Team up with someone you don’t.
Query Processing in Data Integration + a (corny) ending
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
Overview of Web Data Mining and Applications Part I
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction CSE 1310 – Introduction to Computers and Programming
Computer Network Fundamentals CNT4007C
Cpt S 471/571: Computational Genomics Spring 2015, 3 cr. Where: Sloan 9 When: M WF 11:10-12:00 Instructor weekly office hour for Spring 2015: Tuesdays.
Search Engines and Information Retrieval Chapter 1.
CSE 501N Fall ‘09 00: Introduction 27 August 2009 Nick Leidenfrost.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
MAT 360 – Lecture 0 Introduction. About me  Moira Chas   Work phone :  Office Location:
Fall 2o12 – August 27, CMPSC 202 First Day Handouts  Syllabus  Student Info  Fill out, include all classes and standard appointments  Return.
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
CIS 842: Specification and Verification of Reactive Systems Lecture ADM: Course Administration Copyright , Matt Dwyer, John Hatcliff, Robby. The.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
1 CS 381 Introduction to Discrete Structures Lecture #1 Syllabus Week 1.
Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University
Computer Networks CNT5106C
IMS 4212: Course Introduction 1 Dr. Lawrence West, Management Dept., University of Central Florida ISM 4212 Dr. Larry West
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Introduction to CSCI 1311 Dr. Mark C. Lewis
Computer Network Fundamentals CNT4007C
CS6501 Advanced Topics in Information Retrieval Course Policy
Computer Networks CNT5106C
GC101 Introduction to computers and programs
Introduction to Web Mining
Data Mining: Concepts and Techniques Course Outline
Given two randomly chosen web-pages p1 and p2, what is the
Computer Networks CNT5106C
CS7280: Special Topics in Data Mining Information/Social Networks
Office Hours: 1-2pm T/Th 8/23
CSE591: Data Mining by H. Liu
Course Outcomes After this course, you should be able to answer:
PHYS 202 Intro Physics II Catalog description: A continuation of PHYS 201 covering the topics of electricity and magnetism, light, and modern physics.
1/21/10 Viewing the Coure in terms of IR, DB, Soc Net, ML adapted to web Start of IR.
Introduction to Information Retrieval
CS4501: Information Retrieval Course Policy
CS 345A Data Mining Lecture 1
Computer Networks CNT5106C
CS 345A Data Mining Lecture 1
Technologies of Google Seminar Week 1
Introduction to Web Mining
Lecture 1a- Introduction
CS 345A Data Mining Lecture 1
CSE591: Data Mining by H. Liu
Presentation transcript:

Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the end) CSE 494/598 Information Retrieval, Mining and Integration on the Internet

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Web as a bow-tie 39% 21% 19% 14% 7% Probability that two pages are connected: ( ) * ( ) =.348 Reference: The Web as a Graph. PODS 2000: 1-10PODS 2000 Ravi KumarRavi Kumar, Prabhakar Raghavan, Sridhar RajagopalanSridhar Rajagopalan, D. Sivakumar,D. Sivakumar Andrew TomkinsAndrew Tomkins, Eli Upfal:Eli Upfal Given two randomly chosen web-pages p 1 and p 2, what is the Probability that you can click your way from p 1 to p 2 ? 30%?. >50%?, ~100%? (answer at the end)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Contact Info Instructor: Subbarao Kambhampati (Rao) – –URL: rakaposhi.eas.asu.edu/rao.html –Course URL: rakaposhi.eas.asu.edu/cse494 rakaposhi.eas.asu.edu/cse494 –Class: T/Th 3:15—4:30 (BYAC 190) –Office hours: TBD (BY 560) TA: Bhaumik Chokshi –Office: BY 557BB

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Course Outcomes After this course, you should be able to answer: –How search engines work and why are some better than others –Can web be seen as a collection of (semi)structured data/knoweldge bases? –Can useful patterns be mined from the pages/data of the web? –Can we exploit the connectedness of the web pages? What did you think these were going to be??

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Main Topics Approximately three halves plus a bit: –Information retrieval –Information integration/Aggregation –Information mining –other topics as permitted by time

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Week by Week (from Fall 2005) Introduction Text retrieval; vectorspace ranking Indexing/Retrieval issues Correlation analysis & Latent Semantic Indexing Search engine technology Anatomy of Google etc Clustering Text Classification Filtering/Personalization Web & Databases: Why do we even care? XML and handling semi-structured data Semantic web and its standards (RDF/RDF-S/OWL...) Information Extraction Data/Information Integration/aggregation Query Processing in Data Integration: Gathering and Using Source Statistics Bridging Information Retrieval and Databases Social Networks Interactive Review + a (corny) ending (Here are the notes by the TA of the student review comments)

Topics Covered in Fall Introduction (8/22;) 2.Text retrieval; vectorspace ranking 3.Indexing/Retrieval issues 4.Correlation analysis & Latent Semantic Indexing 5.Search engine technology 6.Anatomy of Google etc 7.Clustering 8.Text Classification  (m) 9.Filtering/Personalization 10.Web & Databases: Why do we even care? 11.XML and handling semi- structured data 12.Semantic web and its standards (RDF/RDF- S/OWL...) 13.Information Extraction  14.Data/Information Integration/aggregation 15.Query Processing in Data Integration: Gathering and Using Source Statistics 16.Bridging Information Retrieval and Databases 17.Social Networks

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Books (or lack there of) There are no required text books –Primary source is a set of readings that I will provide (see “readings” button in the homepage) Relative importance of readings is signified by their level of indentation There are some good reference books (which should be available in the bookstore) –* Modeling the Internet and the Web Baldi, Frasconi and Smyth –Modern Information Retrieval (Baeza-Yates et. Al) –Mining the web (Soumen Chakrabarti) –Data on the web (Abiteboul et al).

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Pre-reqs Useful course background –CSE 310 Data structures (Also 4xx course on Algorithms) –CSE 412 Databases –CSE 471 Intro to AI + some of that math you thought you would never use.. –MAT 342 Linear Algebra Matrices; Eigen values; Eigen Vectors; Singular value decomp –Useful for information retrieval and link analysis (pagerank/Authorities-hubs) –ECE 389 Probability and Statistics for Engg. Prob solving Discrete probabilities; Bayes rule, long tail, power laws etc. –Useful for datamining stuff (e.g. naïve bayes classifier) You are primarily responsible for refreshing your memory... Homework Ready…

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati What this course is not (intended tobe) This course is not intended to –Teach you how to be a web master –Expose you to all the latest x-buzzwords in technology XML/XSL/XPOINTER/XPATH –(okay, may be a little). –Teach you web/javascript/java/jdbc etc. programming [] there is a difference between training and education. If computer science is a fundamental discipline, then university education in this field should emphasize enduring fundamental principles rather than transient current technology. -Peter Wegner, Three Computing Cultures

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Neither is this course allowed to teach you how to really make money on the web

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Mid-life crisis as a Personal Motivation My research group is schizophrenic –Plan-yochan: Planning, Scheduling, CSP, a bit of learning etc. –Db-yochan: Information integration, retrieval, mining etc. rakaposhi.eas.asu.edu/i3 Involved in ET-I 3 initiative (enabling technologies for intelligent information integration) Did a fair amount of publications, tutorials and workshop organization.. –One student went to Microsoft Research; One to MSN Search, two to Amazon; and a fourth at IBM India Research Labs

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Grading etc. –Projects/Homeworks (~45%) –Midterm / final (~40%) –Participation (~15%) Reading (papers, web - no single text) Class interaction (***VERY VERY IMPORTANT***) –will be evaluated by attendance, attentiveness, and occasional quizzes Subject to (minor) Changes 471 and 598 students are treated as separate clusters while awarding final letter grades (no other differentiation)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Projects (tentative) One project with 3 parts –Extending and experimenting with a mini-search engine Project description available online (tentative) Expected background –Competence in JAVA programming (Gosling level is fine; Fledgling level probably not..). We will not be teaching you JAVA

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Honor Code/Trawling the Web Almost any question I can ask you is probably answered somewhere on the web! –May even be on my own website Even if I disable access, Google caches! …You are still required to do all course related work (homework, exams, projects etc) yourself –Trawling the web in search of exact answers considered academic plagiarism –If in doubt, please check with the instructor

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Sociological issues Attendance in the class is *very* important –I take unexplained absences seriously Active concentration in the class is *very* important –Not the place for catching up on Sleep/State-press reading Interaction/interactiveness is highly encouraged both in and outside the class –There will be a class blog…

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Occupational Hazards.. Caveat: Life on the bleeding edge –494 midway between 4xx class & 591 seminars It is a “SEMI-STRUCTURED” class. –No required text book (recommended books, papers) –Need a sense of adventure..and you are assumed to have it, considering that you signed up voluntarily Being offered for the fifth time..and it seems to change every time.. –I modify slides until the last minute… To avoid falling asleep during lecture… Silver Lining?

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Life with a homepage.. I will not be giving any handouts –All class related material will be accessible from the web-page Home works may be specified incrementally –(one problem at a time) –The slides used in the lecture will be available on the class page The slides will be “loosely” based on the ones I used in f02 (these are available on the homepage) –However I reserve the right to modify them until the last minute (and sometimes beyond it). When printing slides avoid printing the hidden slides

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Readings for next week The chapter on Text Retrieval, available in the readings list –(alternate/optional reading) Chapter 2 of Information Retrieval (Models of text)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati 8/24

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Course Overview (take 2)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Web as a collection of information Web viewed as a large collection of__________ –Text, Structured Data, Semi-structured data – (multi-media/Updates/Transactions etc. ignored for now) So what do we want to do with it? –Search, directed browsing, aggregation, integration, pattern finding How do we do it? –Depends on your model (text/Structured/semi-structured)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Structure How will search and querying on these three types of data differ? A generic web page containing text A movie review [English] [SQL] [XML] Semi-Structured An employee record

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Structure helps querying Expressive queries Give me all pages that have key words “Get Rich Quick” Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary Give me all mails from people from ASU written this year, which are relevant to “get rich quick” Efficient searching –equality vs. “similarity”

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Does Web have Structured data? Isn’t web all text? –The invisible web Most web servers have back end database servers They dynamically convert (wrap) the structured data into readable english – => The capital of India is New Delhi. –So, if we can “unwrap” the text, we have structured data! »(un)wrappers, learning wrappers etc… –Note also that such dynamic pages cannot be crawled... –The Semi-structured web Most pages are at least “semi”-structured XML standard is expected to ease the presenatation/on-the-wire transfer of such pages. (BUT…..)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati How to get Structure? When the underlyign data is already strctured, do unwrapping –Web already has a lot of structured data! –Invisible web…that disguises itself..else extract structure –Go from text to structured data (using quasi NLP techniques)..or annotate metadata to add structure –Semantic web idea..

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Adapting old disciplines for Web-age Information (text) retrieval –Scale of the web –Hyper text/ Link structure –Authority/hub computations Social Network Analysis –Ease of tracking/centrally representing social networks Databases –Multiple databases Heterogeneous, access limited, partially overlapping –Network (un)reliability Datamining [Machine Learning/Statistics/Databases] –Learning patterns from large scale data

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Information Retrieval Traditional Model –Given a set of documents A query expressed as a set of keywords –Return A ranked set of documents most relevant to the query –Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency Web-induced headaches –Scale (billions of documents) –Hypertext (inter-document connections) Consequently –Ranking that takes link structure into account Authority/Hub –Indexing and Retrieval algorithms that are ultra fast

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Social Networks Traditional Model –Given a set of entities (humans) And their relations (network) –Return Measures of centrality and importance Propagation of trust (Paths through networks) –Many uses Spread of diseases Spread of rumours Popularity of people Friends circle of people Web-induced headaches –Scale (billions of entities) –Implicit vs. Explicit links Hypertext (inter-entity connections easier to track) Interest-based links Consequently –Ranking that takes link structure into account Authority/Hub –Recommendations (collaborative filtering; trust propagation)

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Information Integration Database Style Retrieval Traditional Model (relational) –Given: A single relational database –Schema –Instances A relational (sql) query –Return: All tuples satisfying the query Evaluation –Soundness/Completeness –efficiency Web-induced headaches Many databases all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Further headaches brought on by Semi-structured retrieval If everyone puts their pages in XML –Introducing similarity based retrieval into traditional databases –Standardizing on shared ontologies...

6/2/2015 9:29 PMCopyright © 2001 S. Kambhampati Learning Patterns (Web/DB mining) Traditional classification learning (supervised) –Given a set of structured instances of a pattern (concept) –Induce the description of the pattern Evaluation: –Accuracy of classification on the test data –(efficiency of learning) Mining headaches –Training data is not obvious –Training data is massive –Training instances are noisy and incomplete Consequently –Primary emphasis on fast classification Even at the expense of accuracy –80% of the work is “data cleaning”

Finding“Sweet Spots” in computer-mediated cooperative work It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop –All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” –…and the human very gratefully does the in-depth analysis on those few potential solutions Examples: –The incredible success of “Bag of Words” model! Bag of letters would be a disaster ;-) Bag of sentences and/or NLP would be good –..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing AKA Computizing Eyeballs A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks –It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) –Collaborative knowledge compilation (wikipedia!) –Collaborative Curation –Collaborative tagging –Paid collaboration/contracting Many big open issues –How do you pose the problem such that it can be solved using collaborative computing? –How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com ) Make it fun (ESP game)

Tapping into the Collective Unconscious Another thread of exciting research is driven by the realization that WEB is not random at all! –It is written by humans –…so analyzing its structure and content allows us to tap into the collective unconscious.. Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” Examples: –Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) –Analyzing the link-structure of the web graph to discover communities DoD and NSA are very much into this as a way of breaking terrorist cells –Analyzing the transaction patterns of customers (collaborative filtering)