Founded in 2000 by executives who used to work for Chadwyck-Healey, SilverPlatter, Wolters-Kluwer, Gale and Wilson. Headquartered just outside Washington DC, USA Offices in Stevenage, England; Shanghai, China; Kuala Lumpur, Malaysia; Sydney, Australia; Brazil; New Zealand 3,000 customers 2,500 licensors About Alexander Street Press
The Challenge By 2020 the web will have > 5 Bn users, (currently 2.3 Bn - 37% of the world) > 90% of published works prior to 1923 > Most works published to 2020 > 4 Billion websites (currently 555m, 71% growth p.a) > 1 Trillion photographs (Facebook adds 300m daily) > 100 Million pages of facsimiles of manuscripts > 100 Million audio files > 1 Billion video files (YouTube adds 72 hrs every minute)
Preservation and Access More than 6,500 endangered languages Countless cultural artifacts, audio, video, texts Hidden collections (Personal) archives Field Notes Data sets Little or no cataloging Mostly undigitized Decaying film and audio formats Increasing opportunities to embellish (HD-video, 3-D models, social annotation etc)
You must consult the laws of nature…you say What do you want brick? and the brick says to you I like an arch and you say to brick Look, I want one too, but arches are expensive… Brick says I like an arch… Honor the material you use Louis Kahn (1979) The nature of virtual space…
Steel – High cost to create, strong, easy to stamp shapes, medium weight… Wood – Low cost to create, moderately strong, needs to be crafted, light weight… Glass – Medium cost to create, weak, easy to craft, transparent The Web - ? Understanding the medium
Nature of electronic publications Atomic Interconnected Interdependent The link matters more than the object Pliable Evolving quickly Unlimited in size Page
Understanding the medium 0111010011010000101101101000101110100010001110 1010101010101010111110101010101011111010111001 00011101 Binary Machine Code Assembly Code Programming languages C++, PERL, VB, etc…
Understanding the medium Communications Protocols – TCP-IP, Modems Display Standards – Super VGA Font Standards – Postscript Plug-in standards – Java Browser Standards – IE 7.0 Document formats - PDF Mark-up Standards – SGML, XML, HTML Image Standards – JPG, TIFF, etc, etc
Understanding the medium Phone standards – 3G, 4G, 5G Four Square Twitter – local, custom, news Network protocols – 801 Map Standard - Google Maps, Open Map iOS, Android, Devices – Nook, Kindle, iPad, Video Standards – H264, Silverlight, Flash
Evolving quickly Processing speed – by 2015 machines 4 times more powerful than todays. Storage space – by 2015 20 Terabytes of storage (8 Bn pages) will cost under $100 > than 90% of all developed world will have Web access Significant improvements in the developing world Phone Bandwidth > 1.5 Mb/s On current trends…
Where were headed… After Data, Information, Knowledge, and Wisdom, Gene Bellinger, Durval Castro, Anthony Mills. http://www.systems-thinking.org/ Who, What, When, Where? Therefore Why?
Value in the electronic world is about... Understanding electronic products The manner in which or the efficiency with which something reacts or fulfills its intended purpose Websters Unabridged
What do we need to do? Comprehensive - everything on the network Everyone on the network Local and personal (unique verified identity) Ubiquitous access (everywhere, all devices) High quality (peer review) Workflow integration and analysis (deep links to relevant content and tools) Maximize efficiencies (easy ingestion and dissemination) Real time currency
Device s Inbound Discovery Quality Bandwidth Encodes # of pixels Sampling Tools Transcripts Subtitles Chaptering Translation Usage Stats Permissions Privacy Permissions Anonymity Shibboleth Indexing MARC Semantic Controlled vocabularies Outbound Discovery API Harvesting Promotion Conferences Adsense E-mail Mailings Ingestion Scanning Uploading Data Crosswalking Community Peer Review Crowdsource Annotation Playlists Producing Filming Recording Licensing Writing Commissioning
Evolution of tasks Fading Growing Typesetting Printing Compiling Directories Simple, One database Search Rare and unpublished material Inbound discovery Republishing public domain Process integration Workflow tools & apps Warehousing Community Building Outbound discovery Automated ingestion and tagging Human tagging Permissions
Evolution of tasks Fading Growing Typesetting Printing Compiling Directories Simple, One database Search Rare and unpublished material Inbound discovery Licensing? Republishing public domain Process integration Workflow tools & apps Warehousing Community Building Outbound discovery Automated ingestion and tagging Human tagging Commissioning? Editorial? Quality? Selection? Permissions Marketing?
The strain on keyword search… Questions Google: Martin Luther King – 8.3m hits (2005), 32.5m (2012) Google Scholar: 202k hits, options to restrict: Article Legal document Date range (year published) Patent or Citation
Semantic Indexing Collection Series Book or Volume Chapter Page Word Where ? When ? What ? Who ? Traditional indexing > Semantic indexing >
Increases in Utility Access Keyword Search Fielded Search Semantic Search Do you have the book titled… All mentions of Star Wars All mentions of Star Wars in texts about Regan published in 1985 All mentions of Star Wars by Regan in speeches he delivered in 1985
Identify and divide texts into content elements (e.g. letter, diary entry…) Identify key concepts for these elements (e.g. authors, sources, battles, encounters…) Index both elements and associated concepts Integrate to form a cohesive whole Unique ways of browsing through concepts Unique ways to ask questions What is Semantic Indexing ?
Semantic Indexing… Encounter Author Source Encounter Name Cultural Groups Estimated # of people Start year Start month Start day Location Expedition Encounter Type Fatalities Etc… Name Date of birth Place of birth Date of death Place of death Nationality Religion Sexual Orientation Occupation Etc… Source Editor/Translator Original Language Publisher Publication Date Publication Place Subject of Work Etc… Document Text Author ID Encounter ID Source ID Date Subject Age writing Etc…
Semantic Indexing… Encounter Author Source Encounter Name Cultural Groups Estimated # of people Start year, month, day Location Expedition Encounter Type Fatalities Etc… Name Date of birth Place of birth Date of death Place of death Nationality Religion Sexual Orientation Occupation Etc… Source Editor/Translator Original Language Publisher Publication Date Publication Place Subject of Work Etc… Document Text Author ID Encounter ID Source ID Date Subject Age writing Etc… Show me writings by Jesuits, originally written in French, that discuss trade involving the Huron.
More than a way to answer questions A framework by which users can be guided to understand, explore, discover and learn. A route-map to guide users through data - saving time and effort. The intellectual fabric by which information should be organized… Delivers answers that cannot be asked elsewhere Discipline specific Oriented towards the user and the content At the right level Thoroughly controlled Metadata should be open Semantic Indexing…
Higher value linkages… Loosely Held Tightly Held Free Websites Loosely integrated Tightly integrated Refuse to License License widely License widely and be a Licensor
Higher value links Semantic indexing and keyword searching of more than 3,000 oral history collections. Represents the personal histories of some 300,000 people. Value: –Context –Selection –Search Power –Licensed material –Integration Higher value linkages…
Building the network… Unhelpful Legal warnings not to link Changing links constantly Disabling links No permanent URLs No crawling Randomly changing URLs Insisting on one interface and one access point Unattached pages Helpful Visibility Permanent URLs RSS feeds OpenURL, Open Metadata Design for multiple interfaces Open to crawling Published open APIs Welcome linking Ask others to do the same
Women and Social Movements Collaboration with the Center for the Historical Study of Women and Gender at SUNY Binghamton and ASP Original site is free –new content is for fee. Usage across the free site dipped only slightly – more usage following commercial launch. Added video, audio, > 200k pages, new functionality.
Were engaged in a leviathan task Money is needed For fee content can sit alongside open content Publishers can help Need for collaboration and openness Summary
It will all be available in digital form It will not cost too much Many more people will use it It will be enriched through better display, better integration, better links, better context, etc, etc Good for publishers Good for academics Good for society Where were headed…