Presentation is loading. Please wait.

Presentation is loading. Please wait.

SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore.

Similar presentations


Presentation on theme: "SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore."— Presentation transcript:

1 SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore

2 Min-Yen Kan, Digital Libraries 2Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Scholarly Digital Libraries: what do we use them for? Find articles to print, read offline Browse, select research work Assess authors, publication venues, research groups Papers (documents) don’t store all of the information about a discovery: Datasets Tools Implementation details / conditions They also don’t help a person learn the research: Textbooks Slide presentations We’ll focus on this

3 Min-Yen Kan, Digital Libraries 3Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Qualities of slide presentations Good slide sets complement a document. They often: focus and highlight findings in the document create a bridge into the document itself are a visual and oral summary of a document How can we leverage slides in a digital library? “ PowerPoint is presenter-oriented, not content-oriented or audience-oriented…” The remedy?: “Visual reasoning usually works more effectively when the relevant evidence is shown adjacent in space within the eyespan.” (Tufte, 2006) What about poor slides? Four score and seven years ago

4 Min-Yen Kan, Digital Libraries 4Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Documents and presentations as duals Present identical or highly overlapping materials Document: for archival and reference purposes Presentation: for introducing and summarizing the work As the two can be seen as duals, we should allow them to be viewed together. – Would like random access of the presentation and document pair Answer: find pairs of documents and presentations.

5 Min-Yen Kan, Digital Libraries 5Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E A model: MIT’s Open CourseWare A better answer: add fine-grained alignment. Slides in context Audio of lecture Simplified transcript of lecture

6 Min-Yen Kan, Digital Libraries 6Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Talk Outline Motivation Architecture 1. Resource Discovery 2. Alignment 3. User Interface Demo Status and Conclusions Resource discovery Converters pdftohtml Searc h Engin e cz-ppt2txtcz-ppt2gif convert Data Store Aligner Web Server Javascri pt- enabled browser OfflineOnline sv dv pv ssv search 1. Resource Discovery 3. User Interface 2. Alignment

7 Min-Yen Kan, Digital Libraries 7Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 1. Resource Discovery Algorithm: Obtain suitable document metadata Web search to find candidate presentations Post process to useable form

8 Min-Yen Kan, Digital Libraries 8Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 1. Resource Discovery – Obtaining Metadata Start with CiteSeer (thanks to IST: CL Giles, I Councill) 750K records with parsed header metadata Complete with.pdf documents Enhancement: Merge DBLP snapshot (Aug 2006; 1.2M docs) with CiteSeer – Large scale record linkage task, O(nm) complexity unacceptable – Indexed DBLP into Lucene, use each CS record to retrieve DBLP variants, resulting in O(n) complexity – Result size: 1.5M

9 Min-Yen Kan, Digital Libraries 9Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 1. Resource Discovery – Finding presentations Google API on title, author to find corresponding presentation Use simple Jaccard similarity threshold to decide matches – threshold λ 3 for title+author similarity CiteSeer + DBLP merge Present- ations DBLP Lucene Index λ2λ2 λ1λ1 λ3λ3 Web filetype: ppt

10 Min-Yen Kan, Digital Libraries 10Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 1. Resource Discovery – Conversion Final results: ~85% precision, recall difficult to calculate (~80%) 11K pairs after processing 200K of 1.5M records Many caveats: only.pdf and.ppt formats currently handled conversion fails often, pdf conversion difficult current work: use OCR to redo text extraction Via pdftohtml - text - formatted text Via czppt2gif/convert - png - text

11 Min-Yen Kan, Digital Libraries 11Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Problem formulation Q: What are we aligning? A: Text of slides to document text – Use paragraphs to delimit text units in documents – Use document headers to delimit sections Q: What type of alignment is necessary? A: Depends. Presentation or document centered view? – Presentation: 1 slide aligned to 0 to more paragraphs – Document: 1 section aligned to 0 to more slides Q: What’s the approach? A: Two stages: – Basic similarity measure to calculate a similarity matrix – Alignment schemes to establish alignment mapping Similarity Matrix Slides Text Units 1 1 s p Concentrate on this

12 Min-Yen Kan, Digital Libraries 12Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Related Work 1.Narration to presentation alignment –Usually naturally synchronous: Monotonic alignment 2.Multilingual text alignment –Used in Machine Translation (MT) –Polynomial complexity (~O(n 3 )) but heuristics tend to work well 3.Slide/abstract to document alignment –Use Hidden Markov Model (HMM) for alignment –Doesn’t handle missing materials well. Desiderata: Should take context into account But shouldn’t enforce monotonicity Nil (zero) alignments needed, when materials don’t overlap

13 Min-Yen Kan, Digital Libraries 13Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Similarity Measures Take text units, cut into tokens. Then calculate similarity using: 1.Cosine –Standard IR metric –TF×IDF for token weight –Calculate slide, paragraph vector similarity using cosine 2.Jaccard –unigram tokens –bigram –unigram + bigram –Use IDF weighting for tokens. For both schemes, use IDF weighting from WebBase corpus

14 Min-Yen Kan, Digital Libraries 14Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment - Schemes 1. Max Similarity – Baseline – Can’t do nil alignment 2. Edit Distance – Efficient dynamic programming – But outputs only monotonic alignments 3. Local Jump Model – Variation on #2 to allow local backward jumps – Backward jumps within 5% of text units – Still doesn’t handle reordered sections 4. Hidden Markov Model – Word-based – Attempts to find origin of s in p – Only handles overlapping information Using matrix of similarity, align using: wjwj s i-5 : … s i-1 : … s i : w j-5 w j-1 w j+1 w j+5 s i+1 : … s i+5 : … p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p6p6 p 1 >p 2 >p 3 >p 4 >p 5 >p 6

15 Min-Yen Kan, Digital Libraries 15Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Span Extension Idea: post-process to extend from points to spans Retrieve top n (n=10) most sim paragraphs Try all ( n ) possible spans for alignment alignment_score (x,y) = span_sim × ln(span_length) As Maximum Similarity does quite well, let’s extend the algorithm 2 Slightly favor longer spans

16 Min-Yen Kan, Digital Libraries 16Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Alignment Correction (a) monotonic alignment → ok (b) s i jumps back from s i-1, but then proceeds monotonically → probably ok, minor penalty (c) s i jumps back, but s i+1 jumps back forward → looks more like an error, major penalty applied Final alignment score: alignment_score × (1-penalty) (a)(b)(c) s i-1 s i+1 sisi s i-1 s i+1 sisi s i-1 s i+1 sisi p1p1 p1p1 p1p1 pnpn pnpn pnpn Neighboring alignments can help to correct a spurious one

17 Min-Yen Kan, Digital Libraries 17Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Nil classifier Use machine learning (SVM) to learn a binary classifier Features 1.Similarity score 2.Number of words on slide Few words can indicate figures, pictures with less preference for alignment 3.Words on slide Cue phrases: “outline”, “questions”, “thanks” 4.Alignment path Jumping alignments (e.g., outline slides) But not all text units should be aligned

18 Min-Yen Kan, Digital Libraries 18Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Evaluation Dataset Manually compiled alignment dataset by author and fellow researcher Gold standard: annotate all acceptable spans, or nil 20 presentation and document pairs from databases – Dataset is freely downloadable Average number of slides in presentation37.6 Average number of paragraphs in document277.3 Average number of nil (zero) alignments6.6 (17.4%) Average number of span alignments (s, x-y)8.8 (23.4%) Average number of point alignments (s, x)22.2 (59.2%) Total37.6 (100%)

19 Min-Yen Kan, Digital Libraries 19Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 2. Alignment – Evaluation 40%? Why is it so difficult? Noise in conversion process. Other studies have used clean data. Other have used soft accuracy (any overlap is correct) Use Weighted Jaccard accuracy as metric Fractional accuracy for partially correct answers Give false positives (extra spurious alignments) less weight Alignment Method 1. Max Similarity (cosine)33.4% 2. Edit Distance (cosine)28.8% 3. Local Jump (cosine)25.1% 4. Jing HMM28.8% 5. Max Sim + spanning (Jaccard bigram)39.9% 6. Max Sim + spanning + nil classification (Jaccard bigram)41.2% Weighted Jaccard Accuracy

20 Min-Yen Kan, Digital Libraries 20Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 3. User Interface – Rationale Coordinated Views Learning / Comprehension Summarization Offline Viewing Collection Interface Comparing pairs Searching for suitable materials How might fine-grained aligned pairs be utilized in a large DL?

21 Min-Yen Kan, Digital Libraries 21Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 3. UI – Coordinated Views Document View Slide View Slideshow View Full Document View Print View Slide centricDocument centric Gallery View

22 SlideSeer Prototype Demo Production environment differs from demo

23 Min-Yen Kan, Digital Libraries 23Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E 3. UI – Collection Interface Searching –Lucene indexing of the static print view –Show title along with the set of results Spider-friendly –Main content loaded dynamically by Javascript, not spiderable –Currently use print view (as it is static) for spiderable interface URLs –Most material in the form –Implies hierarchy of papers –Constructed URLs to promote browsing access Simple keyboard shortcuts –For expert user navigation

24 Min-Yen Kan, Digital Libraries 24Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Conclusion Alignment of documents to presentations Simple approach works well thus far – Tweaks to get more mileage out of simple approach – Span alignment, nil alignment modifications – But certainly more models to try! – 40% best performance, certainly much room to improve Deployment status – In Alpha (development) – Beta hopefully in mid 2008 – Usability testing underway Interested in digital anthologies? Join our mailing list (web: dAnth) Current: text extraction project for ACL Anthology

25 Other slides

26 Min-Yen Kan, Digital Libraries 26Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Future Work Planning to hook up current work in progress – 2 stage CRF/SVM re-ranking citation segmentation algorithm – Automatic keyphrase extraction program – Automatic synthetic image classification – Automatic de-duplication module Partnering with Simone Teufel (Cambridge U.) to do argumentative zoning of documents – What is a citation used for?

27 Min-Yen Kan, Digital Libraries 27Web IR / NLP Group @ NUS20 June 2007 - JCDL: Session E Poor slides Often represent a biased view of the full results – Cherry picking evidence to support claims – Imply that evidence is independent (when it is statistically correlated) – May summarize other findings inaccurately (secondary or tertiary sources


Download ppt "SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore."

Similar presentations


Ads by Google