Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Similar presentations


Presentation on theme: "Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1."— Presentation transcript:

1 Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1

2 Motivation More articles  more users Searching for documents is difficult Aim: Find pairs of presentations and documents automatically 10/23/2015 2

3 System Architecture Search Engine Wrapper Query “File type”( PDF, PPT or PS) operator is added with the user query Before sending it to Google. Re-Ranking Top results of Google Output:(3-way) 1.Exact URL 2.Message for No-free files 3.No result Google Used Yee Fan’s Search Engine Wrapper – just Google subsystem 10/23/2015 3

4 Methodology (1) Re-Ranking Computed similarity between user query and documents retrieved for re-ranking. Methods used for computing similarity are Jaccard co- efficient, Bilingual Evaluation Understudy (BLEU). Threshold value is used to restrict the system from considering low similarity scored documents. Google’s Top Results Similarity Score Computation Re-Ranking Results Based on Similarity Score Similarity is computer between Query Title and each Google’s result Title, Snippet, URL. 10/23/2015 4

5 Jaccard Measure Jaccard measure is used to compute similarity between Query Title and Google’s result Title, Snippet, and URL. Simple word by word matching. Problems are: Snippets have more words than title. Union in Jaccard increases while intersection remains same. Sentence1: Finding related pages in the world wide web. Sentence2: Finding Related pages using the Link structure of the WWW. 10/23/2015 5

6 BLEU metric Why BLEU?? n-gram similarity of words. Helps in accessing the sequential order of the words when finding similarity between two sets. Sequential order of words matters with snippet  query terms may appear in a random position. 10/23/2015 6

7 Rules Special rules are used for better matching: Rule1: Removing special symbols. (On/Off) Rule2: Stop-words removal (On/Off) Rule3: URL filter by.edu (On/Off) Rule4: Stemming (Porter stemming algorithm) (On/Off) All these rules are used with both the methodologies. 10/23/2015 7

8 Methodology MIME-types: To differentiate free PDF from subscription type, I used the MIME-types. It returns the content-type of the URL. Dataset collection: Queries from, Computer science. Medical science. Architecture. Mathematics. 10/23/2015 8

9 Experiment Experiments on – Jaccard Measure.(All special rules are tested with On/Off). – BLEU measure (All special rules are tested with On/Off). – Query set with about 50 queries. – Threshold is set from 0.1 to 1.0 range for all experiments. – Highest recall with high threshold is considered. Experiment results – Jaccard similarity. – BLEU similarity. 10/23/2015 9

10 Experiment result of Jaccard RuleGoogle Target ThresholdPrecisionRecallF-Score 1234 OnOff Title0.70.98180.91520.9473 OffOnOff Title0.60.78570.37280.5057 Off OnOffTitle0.50.92720.86440.8947 Off OnTitle0.50.91070.86440.8869 On Off Title0.60.78570.37280.5057 OffOn OffTitle0.50.66410.37280.5057 Off On Title0.40.91070.86440.8869 OnOff OnTitle0.50.91070.86440.8869 On OffTitle0.50.78570.37280.5057 OffOn Title0.50.76670.38980.5168 OnOffOn Title0.40.91070.86440.8869 On OffOnTitle0.70.82140.38980.5287 On Title0.50.93100.45760.6136 Off Title0.70.92590.84740.8849 OnOffOnOffTitle0.50.92720.86440.8947 OffOnOffOnTitle0.30.91070.86440.8869 Best F-score achieved 10/23/2015 10

11 Best F- score achieved Experiment result of BLEU 10/23/2015 11 RuleGoogle Target ThresholdPrecisionRecallF-Score 1234 1OnOff Title0.50.92720.86440.8947 2OffOnOff Title0.30.76670.38980.5168 3Off OnOffTitle0.60.94120.81360.8727 4Off OnTitle0.50.91300.71180.8 5On Off Title0.80.81480.37290.5116 6OffOn OffTitle0.50.75860.37290.5 7Off On Title0.50.91300.71190.8 8OnOff OnTitle0.80.90740.83050.8672 9On OffTitle0.80.81480.37280.5116 10OffOn Title0.20.75860.37290.5 11OnOffOn Title0.70.90740.83050.8672 12On OffOnTitle0.80.81480.37280.5116 13On Title0.60.80760.35590.4941 14Off Title0.50.90900.84780.8772 15OnOffOnOffTitle0.50.91070.86440.8899 16OffOnOffOnTitle0.50.760.32200.4523

12 Related Work Base Reference: – SlideSeer: a digital library of aligned document and presentation pairs, [Kan, JCDL’07]. – Learning to Rank for Information Retrieval. [Liu et al., WWW’09]. – Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites.[Hänse, ICADL’09] Approaches to Similarity Computation – BLEU: a Method for Automatic Evaluation of Machine Translation. [Papineni et al., ACL July’02]. – BLEU algorithm for evaluation machine translations implementation.[Payson et al.] 10/23/2015 12

13 Conclusion Matching documents based on similarity score Jaccard measure -- Jaccard similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold:0.7 -- F-score:0.9473 BLEU metric -- BLEU similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold 0.5 -- F-score:0.8947 10/23/2015 13

14 Thank you Comments are welcome 10/23/2015 14


Download ppt "Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1."

Similar presentations


Ads by Google