Presentation is loading. Please wait.

Presentation is loading. Please wait.

איחזור מידע אלגוריתמי חיפוש PageRank ד " ר אבי רוזנפלד.

Similar presentations


Presentation on theme: "איחזור מידע אלגוריתמי חיפוש PageRank ד " ר אבי רוזנפלד."— Presentation transcript:

1 איחזור מידע אלגוריתמי חיפוש PageRank ד " ר אבי רוזנפלד

2 שלבים למנוע חיפוש בניית המאגר מידע (Web crawler) בניית האנדקסים ( לאנדקס Index) –ניקיון המידע מכפילות, STEMMING בניית התשובה –עיבוד השאלתה ( הורדת STOP WORDS) –דירוג תוצאות (PAGERANK) ניתוח התוצאות – FALSE POSITIVE / FALSE NEGATIVE – Recall / Precision

3 Indexing Process

4 Indexes Indexes are data structures designed to make search faster Text search has unique requirements, which leads to unique data structures Most common data structure is inverted index – general name for a class of structures – “inverted” because documents are associated with words, rather than words with documents similar to a concordance

5 Inverted Index Each index term is associated with an inverted list – Contains lists of documents, or lists of word occurrences in documents, and other information – Each entry is called a posting – The part of the posting that refers to a specific document or location is called a pointer – Each document in the collection is given a unique number – Lists are usually document-ordered (sorted by document number)

6 6 Inverted List Information to be Published Word (key) Address 1Address2Address3Address4Address5Address6 a111-1111111-1112111-1113111-1114111-1115111-1116 aardvark111-4323 the111-1111111-1112111-1113111-1114111-1115111-1116 zoo123-4214123-9714333-9714 zygote548-4342

7 Simple Inverted Index

8 Inverted Index with counts supports better ranking algorithms

9 Inverted Index with positions supports other weights like td*idf

10 Indexes and Ranking Indexes are designed to support search – faster response time, supports updates Text search engines use a particular form of search: ranking – documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm What is a reasonable abstract model for ranking? – enables discussion of indexes without details of retrieval model

11 Abstract Model of Ranking

12 Query Process

13 User interaction – supports creation and refinement of query, display of results Ranking – uses query and indexes to generate ranked list of documents Evaluation – monitors and measures effectiveness and efficiency (primarily offline)

14 ניתוח התוכן בהיסתוריה אתיקה ( לפני GOOGLE) היה שימוש בתוכן כולל ניתוח האתר –תגי META –זמן הטעינה אחרי GOOGLE יש ניתוח של מבנה הרשת ביחד עם דברים אלו... –שיטה בשם PAGERANK

15 The History of PageRank PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin. It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. Shortly after, Page and Brin founded Google. 16 billion…

16 PageRank – PageRank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance. Based on the hyperlinks map An excellent way to prioritize the results of web keyword searches

17 Link Structure of the Web 150 million web pages  1.7 billion links Backlinks and Forward links:  A and B are C’s backlinks  C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off www.yahoo.com?

18 Simplified PageRank algorithm Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: A B C D A B C D

19 אבל זה יכול להיות רקורסיבי... פה C הוא חשוב בגלל שיש לו קישור שנכנס מ B, חשוב בגלל שיש קישורים שנכנסים לו מכמה אתרים. יש PageRank מצטבר אבל בתוספת שולית (damping factor), d. נניח שיש פה d=0.85 אז ה PR של A =

20 אפשר לראות את המדד PAGERANK

21 קידום אתרים במנועי חיפוש Search Engine Optimization (SEO) בגלל ש PAGERANK היה ידועה, היו אנשים שקידמו אתרים ( למה אבי רוזנפלד – אני – ראשון ?) בניית קישורים מלאכותיים – Building, Link Farming יצירת אתרי זבל – בלוגים, מיילים וכדומה לאתר סתם הוספת תוכן בתגי META

22 השוואת האתרים של מכון לב ובר - אילן External Backlinks Referring Domains Backlinks EDUBacklinks GOVPR Quality 14765141522964Very Strong External Backlinks Referring Domains Backlinks EDUBacklinks GOVPR Quality 512427968467301311Very Strong Backlinks information provided by Majestic SEOMajestic SEO http://checkpagerank.net/ מכון לב - PageRank = 6/10 בר - אילן - PageRank = 7/10

23 גוגל " פנדה " לא רק על בסיס PAGERANK המקורי לא פורסם שוקל ותק הקישור שוקל מקור הקישור שוקל היעד של הקישור בניית שיטות של למידת מכונה לתת משקל לקישורים PageRank is now one of 200 ranking factors that Google uses to determine a page’s popularity. http://www.accuracast.com/articles/optimisation /jagger/ ( העדכון Jagger מ 2005) http://www.accuracast.com/articles/optimisation /jagger/

24 Search Engine Optimization (SEO)

25 Evaluation – False Positive / Negative Predicted Label Positive (A)Negative (B) Known Label Positive (A) True Positive (TP) False Negative (FN) Negative (B) False Positive (FP) True Negative (TN)

26 Definitions MeasureFormulaIntuitive Meaning PrecisionTP / (TP + FP) The percentage of positive predictions that are correct. RecallTP / (TP + FN) The percentage of positive labeled instances that were predicted as positive. SpecificityTN / (TN + FP) The percentage of negative labeled instances that were predicted as negative. Accuracy (TP + TN) / (TP + TN + FP + FN) The percentage of predictions that are correct.

27 Example Predicted Label Positive (A)Negative (B) Known Label Positive (A)5001000 Negative (B)50010,000 Precision = 50% (500/1000) Recall = 83% (500/600) Accuracy = 95% (10500/11100)

28 28 General form of precision/recall -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

29 Effectiveness Measures A is set of relevant documents, B is set of retrieved documents

30 Classification Errors False Positive (Type I error) – a non-relevant document is retrieved False Negative (Type II error) – a relevant document is not retrieved – 1- Recall Precision is used when probability that a positive result is correct is important

31 Caching Query distributions similar to Zipf – About ½ each day are unique, but some are very popular Caching can significantly improve effectiveness – Cache popular query results – Cache common inverted lists Inverted list caching can help with unique queries Cache must be refreshed to prevent stale data


Download ppt "איחזור מידע אלגוריתמי חיפוש PageRank ד " ר אבי רוזנפלד."

Similar presentations


Ads by Google