Presentation on theme: "ENV 20066.1 Envisioning Information Lecture 6 – Document Visualization Ken Brodlie"— Presentation transcript:
ENV 20066.1 Envisioning Information Lecture 6 – Document Visualization Ken Brodlie firstname.lastname@example.org
ENV 20066.2 Document Visualization - Challenges Large collections of electronic text –the Web is prime example! –E-mail archives –Literature collections Can we use visualization to help us understand..: –content of groups of documents? –relationships between documents? Powerful search and retrieval engines –return documents based on some sort of keyword search Can we visualize the results of a query?
ENV 20066.3 Views of Documents – 1D View Documents can be viewed in different dimensions: 1D, 2D, 3D, multidimensional Linear text –Sees document as 1D string of words –Split into tiles of similar text Visualization idea –Tilebars –Each document a bar, length proportional to document length –Shown as set of tiles, with shading indicating strength of relevance of tile to keywords Hearst, CHI, 1995
ENV 20066.4 2D Document View This is how we normally think of documents –Structure on page is 2D –Zooming interfaces have been developed –Early one was PAD++: documents visible at different scales –(return to zooming interfaces later) http://www.cs.umd.edu/hcil/pad++/
ENV 20066.5 3D Document Views Innovative 3D views have been suggested WebBook: Card et al, CHI, 1996
ENV 20066.6 Approach Generally approach is in three steps: –Analyse to capture essential features of document (for Tilebars, relative frequency of words in a segment of text) –Use algorithms to generate a viable representation of the documents (1D representation in Tilebars) –Create an interactive visual representation (clicking on a tile gives a list of the corresponding text with keywords highlighted) Analysis Algorithms Visualization
ENV 20066.7 Multidimensional Text Recent research sees text as multi-dimensional Document collection scanned for distinguishing words –Words distinctive to each document (keywords) –Gives a mathematical signature for each document as a high-dimensional vector –Similarities between documents can then be calculated, so as to create clusters –Clusters are mapped down to a 2D space, with similar clusters close together and dissimilar ones far apart Galaxy – developed at PNNL, part of IN-SPIRE product
ENV 20066.8 How do we transform from multidimensional to 2D space? Self-organising feature maps (Kohonen maps) –Form of neural network Input are the vectors for each document Output is a 2D grid whose nodes represent clusters of similar documents, with related clusters placed close together How does it work? Multilingual information retrieval documents from database
ENV 20066.9 Self-organising maps – A worked example Set of 311 documents in a database 40 key words extracted from titles Matrix of documents vs keywords created Set up rectangular grid (10 x 14 was used) Each node gets assigned a reference vector with small random values kw1kw2kw3 doc1101 doc2110 doc3001 doc4110
ENV 20066.10 Self-organising maps – Worked example Select a document at random Find the nearest reference vector in N-dimensional space (ie 40-D here) Adjust the reference vector to be closer to the document… …and adjust all its neighbours on the grid also Iterate (here for 2500 iterations) Finally map each document to nearest node doc2110 Ref(5,7)0.6 0.1 Ref(5,7)0.9 0.03 5,7
ENV 20066.11 Self-organising map – Worked example Concept areas are clustered: languages; technologies; tools
ENV 20066.12 Multidimensional Text The Galaxy View is extended by ThemeView High peaks indicate large number of documents with strong content similarity Peaks close together suggest themes which are related http://in-spire.pnl.gov/
ENV 20066.13 Cartographic approach Cartographic principles are very relevant to document visualization Landscapes are very easy for us to recognise (cf faces) Level of detail well understood by cartographers (cf Google maps) 3 different zoom levels Skupin, IEEE CG&A, 2002 2200 abstracts Clusters formed
ENV 20066.14 Case Study: Visualizing results from a search query Case study from NIST in US Suppose search returns a keyword strength –ie user enters a number of keywords –engine returns list of documents –each document has a score for each keyword specified (eg number of occurrences) –most relevant document has largest total score How can we visualize this information?
ENV 20066.15 Document Spiral Arrange docs in spiral, most relevant at centre
ENV 20066.16 Document Three-Keyword Axes Display One keyword per axis Plot docs in a scatter plot using keyword strengths to position along axes Same keyword on all axes lines docs up on X=Y=Z line
ENV 20066.17 Nearest Neighbour Sequence Choose one doc and place on circle Find the closest in keyword strength space and place adjacent to it.... and so on http://zing.ncsl.nist.gov/~cugini/uicd/viz.html
ENV 20066.18 Visualizing Web Searches www.kartoo.co.uk
Your consent to our cookies if you continue to use this website.