Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Similar presentations


Presentation on theme: "Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*"— Presentation transcript:

1 Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014

2 What is Visual Analytics? 2 AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items Data MiningVisualization

3 AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items What is Visual Analytics? Leveraging Both Worlds 3 Data MiningVisualization Visual Analytics +

4 Visual Analytics for Large-Scale Documents 4 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System

5 Motivation: Too Many Documents to Read 5 Product reviews Which tablet to buy? iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers Which sub-area in data mining to focus on? >Thousands of new papers every year Patent search Many other applications

6 Topic Modeling: Summarizing Documents 6 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 6 … …

7 Topic Modeling: Summarizing Documents Topic: distribution over keywords 7 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 7 … …

8 Topic Modeling: Summarizing Documents Topic: distribution over keywords Document: distribution over topics 8 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 8 … …

9 Nonnegative Matrix Factorization (NMF) Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation (vs. better approximation, e.g., SVD) 9 ~=~=  min || A – WH || F W>=0, H>=0 A H W

10 ~=~= A H W H W Topic: distribution over keywords Document: distribution over topics 10 genednalifeevolvebrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 organism NMF as Topic Modeling … …

11 Documents’ topical membership changes among 10 runs Why NMF (instead of LDA)? Consistency from Multiple Runs 11 InfoVis/VAST paper data set 20 newsgroup data set

12 Why NMF (instead of LDA)? Empirical Convergence Documents’ topical membership changes between iterations 12 LDANMF 10 minutes 48 seconds InfoVis/VAST paper data set

13 NMF vs. LDA Topic Summary (Top Keywords) 13 NMF RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 visualization design information user analysis system graph layout visual analytics data sets color weaving #2 visualization design information user analysis system graph layout visual analytics data sets color weaving LDA RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 document similarities knowledge edge query collaborative social tree measures multivariate tree animation dimension treemap #2 document query analysts scatterplot spatial collaborative text document multidimensi onal high tree aggregation dimension treemap InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.

14 UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF [Choo et al., TVCG’13] 14 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation

15 Visualization Example: Car Reviews Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.

16 Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH || F 2 + α||(W – W r )M W || F 2 + β||M H (H – D H H r ) || F 2 W>=0, H>=0 W r, H r : reference matrices for W and H (user-input) M W, M H : diagonal matrices for weighting/masking columns and rows of W and H Algorithm: block-coordinate descent framework 16

17 Interaction Demo Video 17 After topic splitting (triangle) and topic merging (circle) Before interaction InfoVis-VAST Paper Data http://tinyurl.com/UTOPIAN2013

18 VisIRR: Information Retrieval and Personalized Recommender System 18

19 Features Efficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)

20 Features Personalized Recommendation 20 Works based on user preference on document Preference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes Based on content, citation network, and co-authorship Algorithm Preference propagation on graph using heat kernel r α = α ∑ k (1- α) k fW k r α is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph

21 VisIRR Demo Citation-based Recommendation 21 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article. http://tinyurl.com/VisIRR

22 VisIRR Demo Co-authorship-based Recommendation 22 http://tinyurl.com/VisIRR Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ It shows other areas of the authors of this paper. Computational zoom-in on recommended items Retrieved + recommended items

23 23 Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm

24 24 Thank you! Jaegul Choo jaegul.choo@cc.gatech.edu (Currently on the Academic Job Market) jaegul.choo@cc.gatech.edu Selected Papers Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm

25 Refining topic keywords Merging topics Splitting a topic Creating new topics from seed documents/keywords UTOPIAN Interactions and Key Techniques Visualization Supervised t-SNE Topic modeling NMF Interaction Weakly- supervised NMF Per-iteration Visualization Framework

26 Original t-SNE Documents do not have clear topic clusters. Supervised t-SNE: Visualizing documents Supervised t-SNE d(x i, x j ) ← αd(x i, x j ) if x i and x j belong to the same topic. (e.g., α = 0.3)

27 PIVE: (Per-iteration Visualization Environment) Standard approachPIVE approach Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit] 27


Download ppt "Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*"

Similar presentations


Ads by Google