Presentation on theme: "Genesis of the Open Directory Project Rich Skrenta January 21, 2003."— Presentation transcript:
Genesis of the Open Directory Project Rich Skrenta January 21, 2003
March 1998 Work project was winding down Going up and down Sand Hill road trying to get a web-calendar startup funded Read Danny Sullivan’s report on Yahoo’s listing problems on Search Engine Watch
Idea for GnuHoo Yahoo seemed to be ignoring their core asset - the directory How could we build a competitor? Didn't want to pay an editorial staff –even a cheap one Tequila + Brainstorming = GnuHoo
Idea for GnuHoo Use volunteer editors to build a web directory like Yahoo’s Volunteers would do a better job than paid generalists, since they would be experts about their area & have a personal interest Restrict editors to sub-branches of the directory, to limit the harm they could do
Original Goals Thought if we could reach 1,000 editors the directory would be successful Bootstrap problem was key - how to get the first 10,000 sites. The directory had to look “real” from Day 1 Figured we needed 1M sites for a competitive directory Original get-off-the-coach motivational goal: We told ourselves that if we could get a story in Wired out of the effort, it would be worth doing
“Seed” Problem Needed a hierarchy & 10,000 sites to launch the directory Briefly considered Dewey Decimal –good thing we didn’t, it’s not free –didn’t seem to fit the web Original GnuHoo hierarchy mirrored Usenet
ARTS RECREATION Movies Television Books... Travel Food Outdoors Humor... BUSINESS REFERENCE Jobs Companies Investing... Education Libraries Taxes... COMPUTERS REGIONAL Internet Software Hardware... US Canada UK Australia Belgium... GAMES SCIENCE Video MUDs Gambling... Engineering Psychology Physics... HEALTH SHOPPING Fitness Medicine Diseases... Autos Clothing Directories... HOME SOCIETY Kids Houses Consumers... People Religion Issues... NEWS SPORTS Online Media Newspapers... Baseball Football Skiing... Original Homepage Mock-up
Category Bootstrapping Scanned URLs mentioned in newsgroups to find seed sites for the corresponding directory category This yielded something that looked pretty good at a casual glance …but a lot of the of the original seed URLs were bad sites or placed in the wrong category The first editor in a category simply had to delete or move the bad entries, which left behind a good category
Coding & Launch Coded from April-June, 1998 Perl cgi and flat files Simple HTML forms to add/edit/delete websites in the directory Web pages served from static HTML files in a directory tree HTML files regenerated whenever an edit was made
Simple Flat File Format u: t: NewHoo! d: The largest human-edited directory of the web c: Computers/Internet/Web_Directories
Minimalist Design Minimal locking, last-writer-wins semantics –flock() only used for category counts Write-with-append, rename() only safe operations No big database A few DBM files for minor stuff
Coding & Launch Used publicly-available software for keyword search of the directory: Originally Glimpse, later Isearch First ran on BSDI, later moved to Linux –filesystem progression: ufs, ext2, vxfs Launched June 5, 1998 Acquired by Netscape in October, 1998
Early Press was Key to Growth About 1% of the visitors to NewHoo applied to become editors Some fraction of those would be accepted The more traffic we got, the more editors we would get We grubbed around for any hits we could in the beginning Initial Slashdot, Netly, Wired, Red Herring stories were vital traffic sources No matter what the story said, “Just spell our URL right”
Social Design of NewHoo Not a free-for-all links page - every editor had to apply & be approved Every edit logged and possible to undo Hierarchy of editors, with senior ones keeping an eye on the new ones Emergent editing guidelines, enforced with peer review
Why Did You Apply to be a NewHoo Editor? “There is a link to my old warwick uni account that has been dead for two years. As editor I could change it.”
Why Did You Apply to be a NewHoo Editor? I’m already building Linux indexes and sites, better to have them all nicely integrated in computers/software/linux
Why Did You Apply to be a NewHoo Editor? We already maintain a site called CoinLink which lists over 800 coin related sites. We know the coin industry and could easily assist in building and maintaining this section of the index.
Why Did You Apply to be a NewHoo Editor? You have no category in Recreation/Collecting that focuses on Christmas ornament collecting. Ornament collecting is one of the fastest growing hobbies. I've collected ornaments for 25 years and feel I know many of the "best" web sites dealing with this subject.
Motivations to Edit Same urge that makes you straighten a crooked picture you see on the wall People were maintaining link lists on their own manually; they could do so more easily with NewHoo’s web forms Didn’t need to see the whole directory finished to have their category be useful …but knowing they were helping to build the pyramid was a warm fuzzy
Directory Editing is Amenable to Incremental Effort First editor finds a good site and adds it Second fixes a typo in the description Third editor moves it to a more appropriate category Fourth editor later notices the site moved and fixes the URL Not as hard as writing device drivers; many can help If you ask too much, results fall off quickly
The Free Use License Netscape offered the data from the ODP under a free-use license Directory data was adopted by Lycos, AltaVista, Google and other search engines Only requirement was that the Add URL link point back to dmoz.org –helped keep dmoz authoritative & prevent forks
GnuHoo -> NewHoo -> ODP FSF objected to the “Gnu” Yahoo objected to the “Hoo” Netscape renamed it to the Open Directory Project and hosted it on directory.mozilla.org directory.mozilla.org was too long to type, so we shortened it to dmoz.org
Robozilla Lloyd Tabb wrote a crawler to visit every site in the ODP to see if it was 404/301/302 Didn’t take action on its own, but alerted editors to potentially bad or moved sites Brought bad sites in the ODP down to 0.25% Our crawl of Yahoo showed 8% bad links
“That’s a Problem We Want to Have” Design decisions were made in the interest of expediency. Why invest more time in the infrastructure if the site never takes off? Still running much of the 1.0 code today, over 4 years later Zillions of flat files in a gigantic VXFS filesystem Were we wrong? No, I don’t think so.
The ODP Won 55,000 total editors, probably 10,000 active 3.4M sites, 460K categories Largest human-created taxonomy ever Several times larger than competitors Cited in 83 academic research papers (source: citeseer.nj.nec.com)
The ODP “Won” Everyone uses :-) …but directories no longer scale to the web for users: –small web: use a directory –big web: use keywords
“Lost Ark” Ending? The traffic & validation provided by Netscape was key to the ODP’s success Possible future: lost server in an ops farm What new idea can take the ODP to the next level?