WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages.

What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages on different computers linked via hyperlinks. An enormous graph. No central planning: created by independent actions of millions No central planning: created by independent actions of millions Sometimes front-ends of databases served via web pages. E.g., car search Sometimes front-ends of databases served via web pages. E.g., car searchcar searchcar search Over 1 trillion unique indexed URLs Over 1 trillion unique indexed URLshttp://news.softpedia.com/news/Google-Reached-1-Trillion-Indexed-Pages-90864.shtml

History 1980-1991 1980-1991 –Tim Berners-Lee @ European Organization for Nuclear Research (CERN) –Info for physicists - no uniformity, access –by 1990 http –took long for anybody to pay attention 1992-1995 1992-1995 –Universities get on board –Initially all text-based [gopher, Lynx]

more History 1993 Mosaic @ UIUC/NCSA 1993 Mosaic @ UIUC/NCSA  graphical content capabilities, fueled rapid growth 1994 1994 –Web organizations formed W3C –“Hot lists” - pages of organized bookmarks –“Yet another hierarchical officious oracle” 1996-1998 1996-1998 –rapid commercialization, dawn of e-commerce –browser wars: Netscape 80%; by 2001 Explorer 90%

even more History 1999-2001 1999-2001 –dot.com boom –dot.com bust 2002-present 2002-present –shakeup, rise of giants: Amazon, Yahoo, Google, eBay, Paypal –youth culture: myspace, facebook, napster –democratization of web: blogging, flickr, wikipedia, youtube, twitter –some of major players only a few years old!

Things to know http(s) http(s) url anatomy url anatomy hyperlinks hyperlinks cookies cookies caches caches plugins plugins applets applets deep web deep web www.internettutorials.net READ:

Things to know http(s) http(s) –protocol followed by computers communicating and transferring web pages (securely) url anatomy url anatomy –example with file path http://admin.illinois.edu/policy/code/article1_part1_1-101.html –with dbase query: –with dbase query: http://illinois.edu/ricker/CampusMap?buildingID=43&target=displayHighlight http://illinois.edu/ricker/CampusMap?buildingID=43&target=displayHighlight hyperlinks hyperlinks –url at bottom of window or in address bar –what happens when you click? cookies cookies –sites store text information about you on your computer –allows customized web sessions but privacy or security concern? –can delete, refuse cookies

Things to know caches caches –browser stores copies of what you’ve visited… images….text –privacy/security/performance concerns. –can delete at expense of performance plugins plugins –software extending browser’s capabilities to view different type of content. Browsers come with some built in. –Security: somebody else’s program running on your computer applets applets –program transmitted to your computer that you run –security issues deep web deep web –dynamic web pages, dbases, secure pages, multimedia

Crawling and storing Web crawlers… how do they work? Web crawlers… how do they work? Google recorded 1 trillion unique URLs Google recorded 1 trillion unique URLs Back-of-envelope calculation for TEXT pages: Back-of-envelope calculation for TEXT pages: 1 trillion x 10Kb/page = 10 trillion Kb = 10 quadrillion bytes = 10 quadrillion bytes = 10 petabytes = 10 petabytes = 10,000 terabytes = 10,000 terabytes = 10,000 disks = 10,000 disks x $100/disk x $100/disk = $1,000,000 = $1,000,000 Actual Google specs kept secret, estimates around 2006: 450,000 servers. Actual Google specs kept secret, estimates around 2006: 450,000 servers.

Searching the web What is Lenny Pitt’s phone number? What is Lenny Pitt’s phone number? In database, find credit-worthy consumers In database, find credit-worthy consumers Find web pages relevant to “computer music” Find web pages relevant to “computer music” Among cell phone conversations from country X, identify suspicious ones Among cell phone conversations from country X, identify suspicious ones Search all religion and philosophy books for the meaning of life Search all religion and philosophy books for the meaning of life thanks to sanjeev arora for this slide

Searching the web What is Lenny Pitt’s phone number? What is Lenny Pitt’s phone number?  simple dbase lookup, simple keyword search In database, find credit-worthy consumers In database, find credit-worthy consumers  AI/learning problem Find web pages relevant to “computer music” Find web pages relevant to “computer music” Among cell phone conversations from country X, identify suspicious ones Among cell phone conversations from country X, identify suspicious ones Search all religion and philosophy books for the meaning of life Search all religion and philosophy books for the meaning of life thanks to sanjeev arora for this slide

Searching the web Find “computer music” Find “computer music” ? Search all pages for the phrase ? ? Sort according to number of occurrences ? ? Human staff answers questions ? Pitfalls Pitfalls –Spamming by unscrupulous websites –Synonyms –Homographs [homonym] –Polysemes [bank = institution or building or verb?] thanks to sanjeev arora for this slide

Exploit Link Structure! Example: Hubs and Authorities Authorities (NYT) have lots of incoming links Hubs have lots of outgoing links Iterate: Authority score = sum of incoming hub scores Hub score = sum of incoming authority scores Normalize so they don’t grow without bound Compute these values at query time, only on most relevant documents.

Exploit Link Structure! Example: PageRank [Google] http://en.wikipedia.org/wiki/PageRank Ideas: PR = how many pages vote for you BUT!: some votes are more important BUT!: some votes are more important (those from pages with higher PR)

C has higher PageRank than E

Page Rank and Random Walks Random Walk Method 1 Random Walk Method 1 Choose equally from outgoing links. Choose equally from outgoing links. Walk for a long time Walk for a long time PR(X) is probability you end up at page X PR(X) is probability you end up at page X Random Walk Method 2 Random Walk Method 2 Choose equally from all pages on the web. Choose equally from all pages on the web. PR algorithm is an interpolation between both methods: models a random surfer who gets bored after several clicks and switches to a random page. PR algorithm is an interpolation between both methods: models a random surfer who gets bored after several clicks and switches to a random page.

Round 1: New PRs: PR A =.15(PR A ) +.85(PR B /2 + PR C /1 + PR D /3) =.15(.25) +.85(.25/2 +.25 +.25/3) =.352 PR B =.15(PR B ) +.85(PR A /3 + 0 + PR D /3) =.15(.25) +.85(.25/3 +.25/3) =.179 PR C =.15(.25) +.85(.25/3 +.25/2 +.25/3) =.2854 PR D =.15(.25) +.85(.25/3) =.10833.. A B C D Assume that A links to all pages equally Assume initially that PR of all pages is 1/N =.25 Compute new PR for each node:.15(old PR) +.85(“votes” from other nodes) B’s votes: ½ of its PR to A, ½ to C.

Round 1: New PRs: PR A =.15(PR A ) +.85(PR B /2 + PR C /1 + PR D /3 =.15(.25) +.85(.25/2 +.25 +.25/3) =.352 PR B =.15(PR B ) +.85(PR A /3 + 0 + PR D /3) =.15(.25) +.85(.25/3 +.25/3) =.179 PR C =.15(.25) +.85(.25/3 +.25/2 +.25/3) =.2854 PR D =.15(.25) +.85(.25/3) =.10833.. Now substitute the new PRs into equations to get newer PRs PR A =.15(PR A ) +.85(PR B /2 + PR C /1 + PR D /3 =.15(.352) +.85(.179/2 +.2854 +.10833/3) =.402 PR B =.15(.179) +.85(.352/3 +.10833/3) =.1572 Etc. Continue until values “settle down”.

A page's PageRank = 0.15/N + 0.85 * (a "share" of the PageRank of every page that links to it) A link from a page with PR=.4 and 5 outbound links is worth more than a link from a page with PR=.8 and 100 outbound links. The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it. Executed ahead of time (when indexing all documents).

in praise of XML HTML: HTML: Introduction Introduction Tags specify formatting information, not content. XML XML Introduction Introduction Peg Babcock Peg Babcock Tags specify content. Separate file gives formatting information for various content areas. Advantages: Advantages:  Helps search engines  Easy to make global format changes

WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages.

Similar presentations

Presentation on theme: "WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages.

Similar presentations

Presentation on theme: "WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages."— Presentation transcript:

Similar presentations

About project

Feedback