X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.

Slides:



Advertisements
Similar presentations
Technology for Informatics PageRank April Geoffrey Fox Associate Dean for Research and Graduate Studies,
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Journal Citation Reports on the Web Don Sechler Customer Education – Science and Scholarly Research
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
The PageRank Citation Ranking “Bringing Order to the Web”
Journal Citation Reports on the Web. Copyright 2006 Thomson Corporation 2 Introduction JCR distills citation trend data for 7,600+ journals from more.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Link Analysis HITS Algorithm PageRank Algorithm.
Not all Journals are Created Equal! Using Impact Factors to Assess the Impact of a Journal.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Journal Impact Factors and H index
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
The ISI Web of Knowledge nce/training/wok/#tab3.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Optimization Indiana University July Geoffrey Fox
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Recommendation in Scholarly Big Data
CPS : Information Management and Mining
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
PageRank and Markov Chains
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
به نام هستی بخش یکتا کارگاه آموزشی علم سنجی با تاکید بر:
آشنایی با برخی از آیین‌نامه‌های تحصیلات تکمیلی
The Anatomy of a Large-Scale Hypertextual Web Search Engine
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
آشنایی با برخی از آیین‌نامه‌های تحصیلات تکمیلی
PageRank algorithm based on Eigenvectors
Anatomy of a Search Search The Index:
Web Search Engines.
Indiana University July Geoffrey Fox
Junghoo “John” Cho UCLA
Presentation transcript:

X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington 2013

The Course in One Sentence Study Clouds running Data Analytics processing Big Data to solve problems in X-Informatics

Document Preparation

Inverted Index

Index Construction

Then sort by termID and then docID

Query Structure and Processing

Link Structure Analysis including PageRank

Size of face proportional to PageRank

PageRank d=0.85

d = 0.85

PageRank PageRank is probability that Page will be visited by a surfer is clicks each link on page with equal probability – minor corrections for pages with no outgoing links Found Iteratively with each page getting at each iteration a contribution equal to its page rank divided by #Links on page PR(Page i) =  Page j pointing at I PR(Page j)/(Number of Pages linked on Page j) One adds to this the chance 1-d that surfer types a random URL into web browser. That takes PageRank to d times above plus (1 - d) divided by total number of pages on web On general principles, this will converge whatever the starting point – It can be written as iterative matrix multiplication

Related Applications Thinking of Page Rank as reputation A version of PageRank has recently been proposed as a replacement for the traditional Institute for Scientific Information (ISI) impact factor, and implemented at eigenfactor.org. Instead of merely counting total citation to a journal, the "importance" of each citation is determined in a PageRank fashion. – Impact Factor is number of citations of each article – The Eigenfactor score of a journal is an estimate of the percentage of time that library users spend with that journal. The Eigenfactor algorithm corresponds to a simple model of research in which readers follow chains of citations as they move from journal to journal. A similar new use of PageRank is to rank academic doctoral programs based on their records of placing their graduates in faculty positions. In PageRank terms, academic departments link to each other by hiring their faculty from each other (and from themselves).

EF= Eigenfactor AI = Article Influence over the first five years after publication Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson's Journal Citation Reports (JCR) is 100 Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports database has an article influence of 1.00

None done here!

Summary Issues

Crawling the Web

Web Advertising and Search

CS Technion

Clustering and Topics

Grouping Documents Together The responses to a search query give you a group documents If we represent documents as points in a space, we can try to identify regions – Clustering: Nearby regions of points – Support Vector Machine: Chop space up into parts – (Gaussian) Mixture Models: A type of fuzzy clustering – K-Nearest Neighbors (if have examples) Alternatively we can determine “hidden meaning” with a topic model – Latent Semantic Indexing – Latent Dirichlet Allocation – With lots of variants of these methods to find “latent factors”

Topic Models Illustrated by Google News These try to group documents by Topics such as “Presidential Election” and not by inclusion of particular phrases You imagine each document is a set of topics (the latent factors) and each topic is a bag of words. Find the best set of topics and best set of words in topics

A Latent Factor Finding Method

An example of DA-PLSA Top 10 popular words of the AP news dataset for 30 topics. Processed by DA-PLSI and showing only 5 topics among 30 topics