Presentation is loading. Please wait.

Presentation is loading. Please wait.

WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.

Similar presentations


Presentation on theme: "WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion."— Presentation transcript:

1 WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion

2 Web Search Engines Independent of IR model Distributed index and servers - Crawler - Query server - Indexer Crawlers and Spiders - Centralized control, Coordinated, Refresh, Filtering - Not the main problem Queries - Interface, processing, results Indexing - Data normalization, load balancing, data sharing

3 Harvesting Not just Web data - Caching, Duplication, Normalization Armies of crawlers Filtering collected data Gatherers - Collects and extracts on various schedules - Works with several brokers Brokers - Indexes and interfaces to queries - Works with other Brokers and Gatherers Topical Agents?

4 Web Crawling Issues Follow chains of URLs to gather more URLs Extract index (content) from each page Lather-Rinse-Repeat Update crawler to-do list Associate frequency of crawls Breadth or Depth first? Endless looping Duplicate pages/sites Changed page (or not really?) Dynamically generated pages Intranet pages Markup language getting in the way NOROBOTS What should a crawler get?

5 Indexing the Web Inverted File Index - Sorted words with pointers to location(s) & page(s) - Pointers are the focus (inversion) What about pages and sites? - Massive redundancy on well-organized sites Navigation Topics Content “State of the art indexing techniques” = 30% of text (not page) size. p 383 How can you tune an index for massively changing documents?

6 Ranking Boolean and Vector models mostly used - Why? - Works from the index, not the text Which ranking methods are best? - Datasets - Syntaxes - Users & Testing

7 Ranking Methods TF-IDF - Simple, smaller data sets Boolean Spread - Degrees of match - Within a document - Set of documents - Links between documents (meta docs?) Vector Spread - Standard cosine between query and index (to document) - Links with answer or pointing to answer Most Cited

8 Is Web ranking different? Links are the difference that makes the difference - Internal links on a page - Internal links on a site - Relationships between sites - Link freshness Kleinberg’s HITS method (1998) - Hypertext Induced Topic Search - Number of pages that point to (processed) query - Authorities (relevant content by links) - Hubs (links to varied authorities)

9 Problems with Hubs & Authorities Is more links always better? What about pages without many outgoing links? How do you count multiple links from within one page to another? Do automatically generated sites/pages have an advantage? - CMS systems may have linking “fingerprints” - Metadata How varied are the link weights? - Simple counts - Modified by other IR measures

10 Anatomy of a LS Web Search Engine Initial Google Design PageRank - PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) - “A model of user behavior” probability of a random surfer visiting a page is its PageRank + a damping factor (boredom) - Pages point to a page - Highly ranked pages point to a page - Anchor text is mined (the label for the link) - Proximity included

11 Anatomy 2 Repository of page content Document index - Forward (sorted) - Inverted (sorter) Lexicon of words & pointers Hit Lists of word occurrence(s) Crawlers Ranking Feedback of selection (~)

12 Popularity? Do you always want the most popular information source? - Talk Radio - New York Times Bestseller List - “Lincoln’s Doctors Dog” - “The C.S.I. Diet and Cookbook” Trend or Fad? Blogs, Editorials and Propaganda vs. “Facts”? Result Diversity Death of the Mid-List

13 Metasearch Issues One place for everything? First or Last place to look? Better or different interface? Combined, sorted results would be best - How to sort? - Sorting for different types of queries Syntax Errors State Information (monitoring) Copyright issues (robots) User, content and interface mismatches/challenges

14 Web Searching Metaphors How do people visualize the Web? Is Browsing better? Do we need new metaphors for using the Web? - Searching - Browsing - What else?

15 Search Engine Optimization Found by spiders and submissions - More links to and from site - Registration on major directories - Links to and from major directories Real Contact information Helps prove validity - META tag - Header and footer of home page - About Us or Contact Us pages - Location/Map page

16 Good Design is SEO Basic interface Well-structured links - Comprehensive Site Navigation - Updated and accurate links Easy to find (via the Web or on the site itself) Clear labels - TITLEs - Headings - Term consistency - Link consistency Small sizes to download quickly

17 Web Search Tests Perform searches with targeted keywords Compare and contrast top results with your potential site - Similar terms - Links (external and internal) - Popularity (sites that link to the site) Use Data to - Build a keyword list - Build an introductory text Blurbs Description (2 sentences max) Any page found via a Web search engine should have search for the site itself Regularly monitor Search with your terms

18 Internal Search Robots.txt Log and analyze search results - Measure success and failure - Tune for click-through productivity - Keep list of terms - Match terms to pages Add terms Script terms to certain pages - Provide list (links) of most recent search terms - Provide list (links) of most popular search terms

19 Page Design Use CSS - - Keep content in pages, not CSS templates Put JavaScript, etc. in external files - - tag too for alternate content Continually verify external links ALT tags & Accessibility Compliance Index link on Splash page (if needed) Exact consistency on internal links (ending “/”s) Redirects http://www.newsite.com/index.html

20 Search and MIME types Flash now supports internal text PDF files - Add comments and authorship info - Modify existing PDFs Check Document Properties  Fonts with fonts shows that PDF can be indexed (not a group a graphics files) - Provide text abstract or summary of PDF PPT, use text if possible Java interfaces prove difficult Dynamic pages should have key(word) static elements FORMs not always completely indexed

21 Track your Tracking Keep list of sites submitted to - When, Who, Email address, exact URL submitted - Suggested categories, Current site description - Terms and Conditions Keep list of “goal” keywords Keep list of sites you check keywords - Keywords - Dates - Successes/Failures

22 Assignment Overview & Scheduling Leading WIRED Topic Discussions - # in class = # of weeks left? Web Information Retrieval System Evaluation & Presentation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief history of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness

23 How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications Projects and/or Papers Overview


Download ppt "WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion."

Similar presentations


Ads by Google