Presentation is loading. Please wait.

Presentation is loading. Please wait.

PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Similar presentations


Presentation on theme: "PageRank + Inverted Index. Un Motor de Búsqueda “obama”"— Presentation transcript:

1 PageRank + Inverted Index

2 Un Motor de Búsqueda

3 “obama”

4 PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc

5 Input Structure 31.5 million edges 960,109 nodes document-with-linkdocument-linked

6 Step 0. Start Downloading Datasets http://aidanhogan.com/teaching/cc5212- 1/mdp-lab9-data/ http://aidanhogan.com/teaching/cc5212- 1/mdp-lab9-data/ – page_links_es_f.tsv.gz – wiki_abstracts_es.tsv.gz – http://aidanhogan.com/teaching/cc5212-1/mdp- lab9.zip http://aidanhogan.com/teaching/cc5212-1/

7 Step 1. Dictionary Encode Links Strings difficult to fit in memory Encode strings as OIDs (object ids = integers) Input line: http://es.wikipedia.org/wiki/Ciencia_ficción http://es.wikipedia.org/wiki/Robot Output line: 1203952673 Dictionary: 12039http://es.wikipedia.org/wiki/Ciencia_ficción … 52673http://es.wikipedia.org/wiki/Robot … OIDCompress -i [folder]/page_links_es_f.tsv.gz -igz -o [folder]/page_links_es_f.oid.gz -ogz -d [folder]/page_links_es_f.dict.gz -dgz

8 Step 2. Copy PageRank Code Copy PageRankGraph.java from mdp-lab8 to mdp-lab9 (same package) – Use your code to be marked on it! – Marked from 20 for this lab If you weren’t here last week, copy PageRankGraph.java from http://aidanhogan.com/cc5212-1/mdp-lab9-data/ – Marked from 10 for this lab

9 Step 3. Rank and sort full data Run ranking ( PageRankGraph.java) – 50 iterations: ITERS = 50 -i [folder]/page_links_es_f.oid.gz -igz -o [folder]/page_ranks_es_f.oid.tsv.gz –ogz Sort ranks by rank score ( SortByRank.java ) -i [folder]/page_ranks_es_f.oid.tsv.gz -igz -o [folder]/page_ranks_es_f_s.oid.tsv.gz –ogz

10 Step 4. Make Predictions & Bets Which will be the highest ranked articles in Wikipedia according to PageRank?

11 Step 5. Decode the ranks Decode the file ( OIDDecompress.java ) -d [folder]/page_links_es_f.dict.gz -dgz -i [folder]/page_ranks_es_f_s.oid.tsv.gz -igz -n 0 - o [folder]/page_ranks_es_f_s.tsv Open the output in a text editor and have a look

12 Step 6. Copy Inverted Index Code Copy IndexTitleAndAbstract.java and SearchIndex.java from mdp-lab7 into mdp-lab9 (if you were here) Otherwise grab them from http://aidanhogan.com/cc5212-1/mdp-lab9- data/ http://aidanhogan.com/cc5212-1/mdp-lab9- data/

13 Step 7. Rebuild Inverted Index IndexTitleAndAbstract.java -i [folder]/wiki_abstracts_es.tsv.gz -igz -o [folder]/es_wiki_index/ Try searches using SearchIndex.java – Copy the top 10 results for 5 searches including ‘ obama ’ and ‘ universidad ’ into a text file somewhere

14 Step 8. Add in the boost values Open BoostRanks.java Follow the board to code Run: -o [folder]/es_wiki_index/ -i [folder]/page_ranks_es_f_s.tsv

15 Step 9. Profit Re-run the same five queries as before over the boosted index and see if the results improve http://www.lucenetutorial.com/lucene-query- syntax.html http://www.lucenetutorial.com/lucene-query- syntax.html

16

17 Course Marking 45% for Weekly Labs (~3% a lab!) 35% for Final Exam 20% for Small Class Project

18 Class Project Done in pairs (Except Alejandro/Mauricio :P) Goal: Use what you’ve learned to do something cool (basically) Expected difficulty: More than a lab’s worth – But from scratch / without my help! Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness – Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! Process: – Pair up (default random) by Wednesday – Decide on a topic (by June 9 th ) or let me assign one – If you need data or get stuck, I will (try to) help out Deliverables: 10 minute presentation (June 23 rd ) & 4-page report – 2 weeks!

19 Groups Pairings: Catalina Espinoza y Felipe Quintanilla Eduardo Acha y Jaime Salas Francisca Concha y Nicolás Miranda Lone agents: Alejandro Infante Mauricio Quezada

20 Topics Let’s talk topics – Catalina Espinoza y Felipe Quintanilla – Eduardo Acha y Jaime Salas – Francisca Concha y Nicolás Miranda – Mauricio Quezada What’s the idea? What will be the result of your project? How much data will you process/where will you source it? Which techniques from the class will you use? How cool is it?


Download ppt "PageRank + Inverted Index. Un Motor de Búsqueda “obama”"

Similar presentations


Ads by Google