Web Intelligence Text Mining, and web-related Applications

Slides:



Advertisements
Similar presentations
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
Advertisements

Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.
Web Mining.
Internet Search Lecture # 3.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Assignment: Improving search rank – search engine optimization Read the following post carefully.
How to prepare better reports
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Advanced Technical Writing
Data Mining (and machine learning) ROC curves Rule Induction Basics of Text Mining.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Search Engines Session 5 INST 301 Introduction to Information Science.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
+ E nglish L anguage T eaching U nit. + Organisation Choosing the correct organisation style Answering the question Clarity.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Search Engine Optimization
Automated Information Retrieval
Information Organization: Overview
Clustering of Web pages
Advanced Technical Writing
A Context Sensitive Searching and Ranking
Introduction to Search Engines
Principles of Data Mining Published by Springer-Verlag. 2007
Data Mining Chapter 6 Search Engines
From frequency to meaning: vector space models of semantics
Technical Writing Abstract Writing.
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

Web Intelligence Text Mining, and web-related Applications ’

WEB-SOM A self-organizing-map (SOM) algorithm applied to over 1M newsgroup posts. See http://websom.hut.fi/websom/milliondemo/html/root.html and play around with it.

Finding similar literature Two different www documents X and Y might be closely related. If they are, then: a user interested in X will also probably be interested in Y If X is highly ranked in a search, Y should also be made prominently available to the searcher If a user is specifically trying to find documents similar to X, then Y is one of them. But, the problem is: X might turn up in a search, but not Y. There are no links between X and Y, they may be in very separated components of the www graph.

Another way of looking at it Suppose you do a search on the keyword pasta Google may retrieve 1,000,000 documents How can you (or, hopefully, an automated system) usefully organise these documents? If the documents were automatically clustered, so that similar groups of documents were put together in the same cluster, then we would be able to impose useful organisation. E.g. one cluster might be documents about the history of pasta, another cluster may be mainly recipes, etc… So, it will be very useful if we have some way of working out similarity between documents – then we can cluster them.

Applications/Motivations for document similarity Recommendations Many search engines and other sites try to help you manage your bookmarks/favourites; as part of this they offer recommendations, i.e. “if you like that, you might also like these …” On amazon, or any general product sales site, this can be based on distances between (e.g.) 200 word summaries or ToC of a book, or text that describes a product in a catalogue Research (scientific, scholarly, for lit review, for market research) Mapping for Browsing purposes – a 2D visualisation of the web, or a subset, where each page is a (clickable) point, and distance between them is related to document similarity

But a document is a “bag of words” – to work out distances, we need numbers

How did I get these vectors from these two `documents’? <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> 35, 2, 0 26, 2, 2

What about these two vectors? <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> 0, 0, 0, 1, 1, 1 1, 1, 1, 0, 0, 0

An unfair question, but I got that by using the following word vector: (Crossword, Cryptic, Difficult, Expression, Lexical, Token) If a document contains the word `crossword’, it gets a 1 in position 1 of the vector, otherwise 0. If it contains `lexical’, it gets a 1 in position 5, otherwise 0, and so on. How similar would be the vectors for two pages about crossword compilers? The key to measuring document similarity is turning documents into vectors based on specific words and their frequencies.

Turning a document into a vector We start with a template for the vector, which needs a master list of terms . A term can be a word, or a number, or anything that appears frequently in documents. There are almost 200,000 words in English – it would take much too long to process documents vectors of that length. Commonly, vectors are made from a small number (50—1000) of most frequently-occurring words. However, the master list usually does not include words from a stoplist, Which contains words such as the, and, there, which, etc … why?

The TFIDF Encoding (Term Frequency x Inverse Document Frequency) A term is a word, or some other frequently occuring item Given some term i, and a document j, the term count is the number of times that term i occurs in document j Given a collection of k terms and a set D of documents, the term frequency, is: … considering only the terms of interest, this is the proportion of document j that is made up from term i.

Term frequency is a measure of the importance of term i in document j Inverse document frequency (which we see next) is a measure of the general importance of the term. I.e. High term frequency for “apple” means that apple is an important word in a specific document. But high document frequency (low inverse document frequency) for “apple”, given a particular set of documents, means that apple is not all that important overall, since it is in all of the documents.

Inverse document frequency of term i is: Log of: … the number of documents in the master collection, divided by the number of those documents that contain the term.

TFIDF encoding of a document So, given: - a background collection of documents (e.g. 100,000 random web pages, all the articles we can find about cancer 100 student essays submitted as coursework …) - a specific ordered list (possibly large) of terms We can encode any document as a vector of TFIDF numbers, where the ith entry in the vector for document j is:

Turning a document into a vector Suppose our Master List is: (banana, cat, dog, fish, read) Suppose document 1 contains only: “Bananas are grown in hot countries, and cats like bananas.” And suppose the background frequencies of these words in a large random collection of documents is (0.2, 0.1, 0.05, 0.05, 0.2) The document 1 vector entry for word w is: This is just a rephrasing of TFIDF, where: freqindoc(w) is the frequency of w in document 1, and freq_in_bg(w) is the `background’ frequency in our reference set of documents

Turning a document into a vector Master list: (banana, cat, dog, fish, read) Background frequencies: (0.2, 0.1, 0.05, 0.05, 0.2) Document 1: “Bananas are grown in hot countries, and cats like bananas.” Frequencies are proportions. The background frequency of banana is 0.2, meaning that 20% of documents in general contain `banana’, or bananas, etc. (note that read includes reads, reading, reader, etc…) The frequency of banana in document 1 is also 0.2 – why? The TFIDF encoding of this document is: Suppose another document has exactly the same vector – will it be the same document? 0.464, 0.332, 0, 0, 0

Vector representation of documents underpins: Many areas of automated document analysis Such as: automated classification of documents Clustering and organising document collections Building maps of the web, and of different web communities Understanding the interactions between different scientific communities, which in turn will lead to helping with automated WWW-based scientific discovery.

What can you say about the TFIDF value for the word “and”? What about the word “cancer”? What is the TFIDF value of cancer, where the background collection of document is a collection of abstracts from a cancer journal?

Stoplists and Stemming Stoplists – we mentioned these already; this is a list of words that we should ignore when processing documents, since they give no useful information about content. Examples of such words? Stemming – this is the process of treating a set of words like “fights, fighting, fighter, …” as all instances of the same term – in this case the stem is “fight”. Why is this useful?

Examinable Reading The Sinka/Corne paper on my teaching site; I want you to be able to talk clearly about the findings (e.g. how the quality of clustering was affected by whether or not stemming was used)