Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Characterization Week 11 LBSC 690 Information Technology.

Similar presentations


Presentation on theme: "Web Characterization Week 11 LBSC 690 Information Technology."— Presentation transcript:

1 Web Characterization Week 11 LBSC 690 Information Technology

2 The Why of the Web (in 1995) Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo

3 Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal? Content or behavior?

4 Number of Web Sites

5 Discussion Topic: What’s a Web “Site”? OCLC counted any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

6 Crawling the Web

7 Link Structure of the Web

8 Web Crawl Challenges Discovering “islands” and “peninsulas” Duplicate and near-duplicate content –30-40% of total content Server and network loads Dynamic content generation Link rot –Changes at 1% per week Temporary server interruptions

9 Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

10 Robots Exclusion Protocol Requires voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

11 Hands on: The Internet Archive alexa.com Web crawls since 1997 –http://archive.orghttp://archive.org Check out the CLIS Web site from 1998! Check out the history of your favorite site

12 Discussion Point Can we save everything? Should we? Do people have a right to remove things?

13 The “Deep Web” Dynamic pages, generated from databases Not easily discovered using crawling Perhaps 400-500 times larger than surface Web Fastest growing source of new information

14

15 Content of the Deep Web

16 Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp://www.ncdc.noaa.gov/ol/satellite/satellitereso urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 AlexaPublic (partial) http://www.alexa.com/15,860 Right-to-Know Network (RTK Net)Publichttp://www.rtk.net/14,640 MP3.comPublichttp://www.mp3.com/

17 Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

18 Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

19 Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

20 World Trade in 2001 Source: World Trade Organization

21 European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

22 Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Blogs Doubling

23 Blue = Mainstream Media Red = Blog Challenge: Fight, or Embrace?

24 Kryptonite Lock Controversy US Election Day Indian Ocean Tsunami Superbowl Schiavo Dies Newsweek Koran Deepthroat Revealed Justice O’Connor Live 8 Concerts London Bombings Katrina Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18%

25

26 A Web of Speech? Web in 1995Speech in 2005 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M “Last Mile” (Download time) 1 second (no graphics) Streaming Display Capability (Computers/US population) 10%100% Search SystemsLycos Yahoo

27 Rethinking the Spoken Word Speech is better for some things than writing Spoken bits are as persistent as written bits Storage costs is 80 times more than text –Disk cost falls by a factor of 80 in ~16 years  If speech is searchable, we will keep lots of it

28 A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day

29 A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day Storage array sales > 5 PB/day –457 PB in 2Q 2005 (increasing 59% per year) $22/person/year (decreasing at 31%/year) Source: IDC Worldwide Disk Storage Systems Tracker, 2Q 2005

30 Human History Oral Tradition Writing Human Future Writing and Speech

31 Hands On: Speech on the Web audio.search.yahoo.com blinkx.com ocw.mit.edu podcasts.net

32 View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Type Edit

33 View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit

34 Minimum Scope SegmentObjectClass View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit

35 Estimating Authority from Links Authority Hub

36 Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com

37 Search Engine Query Logs A: Southeast Asia (Dec 27, 2004) B: Indonesia (Mar 29, 2005) C; Pakistan (Oct 10, 2005) D; Hawaii (Oct 16, 2006) E: Indonesia (Aug 8, 2007) F: Peru (Aug 16, 2007)

38 Search Engine Query Logs http://hannu.biz/aolsearch/

39 AOL User 4417749

40 Gaining Access to Observations Observe public behavior –Hypertext linking, publication, citing, … Policy protection –EU: Privacy laws –US: Privacy policies + FTC enforcement Statistical assurance of privacy –Distributed architecture –Model and mitigate privacy risks

41 0 20 40 60 80 100 120 140 160 180 No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications) 50 32 58 43

42 More Complete Observations User selects an article –Interpretation: Summary was interesting User quickly prints the article –Interpretation: They want to read it User selects a second article –Interpretation: another interesting summary User scrolls around in the article –Interpretation: Parts with high dwell time and/or repeated revisits are interesting User stops scrolling for an extended period –Interpretation: User was interrupted

43 No Interest No Interest Low Interest Moderate Interest High Interest Abstracts (Pharmaceuticals) 42 55 52 51

44 Critical Issues Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?


Download ppt "Web Characterization Week 11 LBSC 690 Information Technology."

Similar presentations


Ads by Google