IAT 814 1 1 Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
8.1 Types of Data Displays Remember to Silence Your Cell Phone and Put It In Your Bag!
Data Mining Techniques: Clustering
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
The Vector Space Model …and applications in Information Retrieval.
14.1 Vis_04 Data Visualization Lecture 14 Information Visualization : Part 2.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Recommender systems Ram Akella November 26 th 2008.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Pixel Visualization of keyword search results in large databases. Jay Koven Fall 2013.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Lecture 12: Network Visualization Slides are modified from Lada Adamic, Adam Perer, Ben Shneiderman, and Aleks Aris.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Organizing Your Information
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Document Collections cs5984: Information Visualization Chris North.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Chapter 23: Probabilistic Language Models April 13, 2004.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Displaying Data  Data: Categorical and Numerical  Dot Plots  Stem and Leaf Plots  Back-to-Back Stem and Leaf Plots  Grouped Frequency Tables  Histograms.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Graphical Techniques.
Graphical Techniques.
Representation of documents and queries
Visualizing Document Collections
cs5984: Information Visualization Chris North
2016 REPORT.
cs5984: Information Visualization Chris North
2016 REPORT.
Presentation transcript:

IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT] |

Nov 13, 2013 IAT Text is Everywhere We use documents as primary information artifact in our lives Our access to documents has grown tremendously in recent years due to networking infrastructure –WWW –Digital libraries –...

Nov 13, 2013 IAT How Can InfoVis Help? Example Specific Tasks Which documents contain text on topic XYZ? Which documents are of interest to me? Are there other documents that might be close enough to be worthwhile? What are the main themes of a document? How are certain words or themes distributed through a document? IAT 814 3

Nov 13, 2013 IAT Related Fields Information Retrieval –Active search process that brings back particular entities IAT 814 4

Nov 13, 2013 IAT Challenge Text is nominal data –Does not seem to map to geometric presentation as easily as ordinal and quantitative data The “Raw data --> Data Table” mapping now becomes more important IAT 814 5

“Raw” Text Visualization: TextArc Nov 13, 2013 IAT 814 6

Text Arc Sentences around periphery Words inner ring and center –More highly-used words nearer center Nov 13, 2013 IAT 814 7

Nov 13, 2013 IAT Document Collections How to present document themes or contents without reading docs? Who cares? –Researchers –News people –CSIS –Market researchers IAT 814 8

Problems Want to analyze meanings Raw words themselves are not computable Also, some words are unimportant So: –Analyze documents by word usage –Compare documents by similar word usage Nov 13, 2013 IAT 814 9

Nov 13, 2013 IAT Vector Space Analysis How does one compare the similarity of two documents? One model –Make list of each unique word in document Throw out common words (a, an, the, …) Make different forms the same (bake, bakes, baked) –Store count of how many times each word appeared –Alphabetize, make into a vector IAT

Nov 13, 2013 IAT Vectors, Inner Products A The quick brown fox jumped over the lazy dog B The fox found his way into the henhouse C The fox and the henhouse are both words Vector A Vector B = 1 VectorB VectorC = 2 Thus B and C are most similar IAT quickbrownfoxjumplazydogfindhiswayhenhousebothwordSUM A B A.B 1 1 C B.C 1 1 2

Nov 13, 2013 IAT Vector Space Analysis Model (continued) –Want to see how closely two vectors go in same direction, inner product –Can get similarity of each document to every other one –Use a mass-spring layout algorithm to position representations of each document Similar to how search engines work IAT

Nov 13, 2013 IAT Some adjustments Not all terms or words are equally useful Often apply TF/IDF –Term Frequency / Inverse Document Frequency Weight of a word goes up if it appears often in a document, but not often in the collection IAT

Nov 13, 2013 IAT Process IAT

Nov 13, 2013 IAT Smart System Uses vector space model for documents –May break document into chapters and sections and deal with those as atoms Plot document atoms on circumference of circle Draw line between items if their similarity exceeds some threshold value »Salton et al Science ‘95 IAT

Nov 13, 2013 IAT Nov 20, Fall 2007 IAT

Nov 13, 2013 IAT Text Relation Maps Label on line can indicate similarity value Items spaced by length of section IAT

Nov 13, 2013 IAT Text Themes Look for sets of regions in a document (or sets of documents) that all have common theme –Closely related to each other, but different from rest Need to run clustering process IAT

Nov 13, 2013 IAT VIBE System Smaller sets of documents than whole library Example: Set of 100 documents retrieved from a web search Idea is to understand contents of documents relate to each other »Olsen et al Info Process & Mgmt ‘93 IAT

Nov 13, 2013 IAT Focus Points of Interest –Terms or keywords that are of interest to user Example: cooking, pies, apples Want to visualize a document collection where each document’s relation to points of interest is shown Also visualize how documents are similar or different IAT

Nov 13, 2013 IAT Technique Represent points of interest as vertices on convex polygon Documents are small points inside the polygon How close a point is to a vertex represents how strong that term is within the document IAT Term1 Term2Term3

Nov 13, 2013 IAT Example Visualization IAT laser plasma fusion

Nov 13, 2013 IAT VIBE Pro’s and Con’s Effectively communicates relationships Straightforward methodology and vis are easy to follow Can show relatively large collections Not showing much about a document Single items lose “detail” in the presentation Starts to break down with large number of terms (eg. 8 terms: octagon) IAT

InSpire Clusters docs, then reports the most common TF/IDF words Presents docs in “Galaxy” display –Projects from high-dimensional space to 2D Nov 13, 2013 IAT

InSpire Nov 13, 2013 IAT

InSpire Nov 13, 2013 IAT

InSpire Clusters Documents by word vectors K-means Clustering method: 1)Select K random docs (cluster centers) 2)For Each remaining document: Assign it to the closest of the above K docs (Creates K clusters) 3)For each cluster, compute “average” cluster center Repeat 2 and 3 until every doc stops moving from cluster to cluster Nov 13, 2013 IAT

K-means Thanks, Wikipedia! Nov 13, 2013 IAT

Entities Entities are words that have classes of meanings –Places, names, people, times, money, How? –Standards of Grammar help recognize nouns –Nouns are places into categories according to their presence in training texts Nov 13, 2013 IAT

Entities Named Entity Recognition is the act of identifying entities Brings more meaning, and enables richer queries than treating all words equally Eg “show me all the people named John” Nov 13, 2013 IAT

Nov 13, 2013 IAT Thanks: John Stasko IAT