Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vocabulary, Statistics, Time and Geography

Similar presentations


Presentation on theme: "Vocabulary, Statistics, Time and Geography"— Presentation transcript:

1 Vocabulary, Statistics, Time and Geography
National Institute of Informatics Seminar, July 20, 2007 Fredric C. Gey UC Data Archive & Technical Assistance. University of California, Berkeley (not available at NII, firewall) Institute for Museum and Library Services Grants: Seamless search of textual and numeric databases ( ), Going places in the catalog: Improved Geographic Access ( ), What Where, When and Why– support for the learner ( ), Bringing Lives to Light – Biography in Context ( ) last two within Electronic Cultural Atlas Initiative, International and Area Studies Colleagues: Michael Buckland, Ray Larson, Kim Carl, Jeanette Zerneke, host of students including Vivien Petras Fredric C. Gey 9/20/2018 Vocabulary, Statistics, Time and Geography

2 Vocabulary, Statistics, Time and Geography
HETEROGENEOUS DIGITAL INFORMATION SEARCH Current Search Technology (multiple independent searchs without search aids) Numeric Statistical Databases Bibliography Full Text Patents QUERY Maps and other Geospatial data Music and other media 9/20/2018 Vocabulary, Statistics, Time and Geography

3 Searching Statistical Information: One problem statement
Numeric statistical information is often thinly documented or documented with specialized vocabularies Census: welfare --> public assistance Foreign trade harmonized commodity classification: computer --> digital adp machine Standardized Industrial Classification (SIC codes): automobile --> motor vehicle In search, the user’s ordinary language term is unlikely to match the limited technical vocabulary used to document the statistical resource Can we remedy this situation? Fredric C. Gey 9/20/2018 Vocabulary, Statistics, Time and Geography

4 Searching Statistical Information: Evidence Poor Data
Compared to searching textual databases, numeric statistical information and its terminology is both evidence-poor and highly technical Statistical databases lack the rich set of textual clues which can identify items of numeric information However, if we can find a textual resource associated with the numeric data, we may be able to mine the text to improve numeric search. This is possible for some numeric classification schemes Fredric C. Gey 9/20/2018 Vocabulary, Statistics, Time and Geography

5 Searching Statistical Information: Economic Classification Codes
Standard Industrial Classification (SIC) and North American Industrial Classification (NAICS) codes have been used to index trade magazines This provides a textual resource of hundreds of thousands of documents and millions of words associated with the numeric data. Thus mappings can be made between the words from the magazines abstracts and the classifications User queries can be matched against words and phrases most closely associated with the particular numeric data classification A ranked list of classifications can be displayed to the user in order to improve the search Harmonized commodity classifications can be searched using SIC codes as a search proxy Fredric C. Gey 9/20/2018 Vocabulary, Statistics, Time and Geography

6 SEARCHING UNFAMILIAR METADATA: PROBLEM STATEMENT
Numerous databases are indexed by structured metadata classifications Classification schemes are highly specialized As digital libraries multiply in size and diversity, there is a need for search engines for non-specialists Search engines should translate from ordinary language to specialized classifications In U.S. Import-Export Database “computer” No result found 9/20/2018 Vocabulary, Statistics, Time and Geography

7 Vocabulary, Statistics, Time and Geography
SEARCHING UNFAMILIAR METADATA: EXAMPLES FROM FOREIGN TRADE IMPORTS-EXPORTS U.S. Foreign Trade Imports-Exports Two CD-ROMS/month Classified by 16,000 numeric commodity codes in 12 digit hierarchy Search for ‘automobile’ or ‘computer’ yields no result Search for ‘Car’ yields “Railway or Tramway Stock, Etc.” Need to know classifications “Passenger Motor Vehicle” or “Digital ADP Machine w/CPU” In U.S. Import-Export Database “Car” “Railway or Tramway Stock” 9/20/2018 Vocabulary, Statistics, Time and Geography

8 Vocabulary, Statistics, Time and Geography
SEARCHING UNFAMILIAR METADATA: U.S. STANDARD INDUSTRIAL CLASSIFICATION SYSTEM U.S. Standard Industrial Classification System (SIC) Used to classify and aggregate industrial activity in the U.S. Codes defined by Office of Management and Budget County Business Patterns reports annual employment, payroll, firm size by county, SIC code In U.S. SIC System “Lobster” “Nothing found” 9/20/2018 Vocabulary, Statistics, Time and Geography

9 SEARCHING UNFAMILIAR METADATA SOLUTION -- ENTRY VOCABULARY TECHNOLOGY
Maps between ordinary language and specialized classifications Implemented using text categorization techniques Requires training collections which have been manually indexed Preserves and leverages investment in creation of complex classification structures Can also be applied to the task of multi-lingual information access In U.S. Patent Database For: “Automobile” Try: 180/280 9/20/2018 Vocabulary, Statistics, Time and Geography

10 SEARCHING UNFAMILIAR METADATA: ENTRY VOCABULARIES CONSTRUCTED
Ordinary language to U.S. Patent Classification Ordinary language to INSPEC thesaurus terms Ordinary language to Library of Congress classification codes Ordinary language to Standard Industrial Classification system Ordinary Language to NAICS with link to 1997 Economic Census In U.S. SIC classification For: “Lobster” Try: “Shellfish” 9/20/2018 Vocabulary, Statistics, Time and Geography

11 Vocabulary, Statistics, Time and Geography
HETEROGENEOUS DIGITAL INFORMATION SEARCH Enhanced Search (augmented with Entry Vocabulary Module (EVM) Technology) Numeric Statistical Databases Bibliography Full Text Patents EVMp EVMs EVMt QUERYplus EVMg EVMm QUERY Maps and other Geospatial data Music and other media 9/20/2018 Vocabulary, Statistics, Time and Geography

12 SEARCHING UNFAMILIAR METADATA -- CONCLUSIONS
Entry Vocabulary Technology proved valuable in multiple applications -- Searching complex classification schemes Searching numeric data BUT WE FOUND NUMERIC DATA IS INTERTWINED WITH PLACE NIACS Prototype Need to specify place to retrieve data In Addition BOOKS ARE ALSO USUALLY ABOUT PLACE (e.g. History of Tulare County, California).  NEED UNIFIED SEARCH OF PLACE BETWEEN GENRES 9/20/2018 Vocabulary, Statistics, Time and Geography

13 Exogenous Research and Development
National Science Foundation grant on methods for text retrieval and document ranking -- turned into Cross-Language Information Retrieval research and evaluation participations TREC English  Chinese , English Arabic NTCIR – Asian language retrieval (Chinese-Japanese-Korean) CLEF – European language search and question-answering English, German, Portuguese, Russian, Spanish DARPA Grant “Translingual Information Management Using Domain Ontologies“ -- Hindi surprise language exercise 2003 CDL’s Counting California project (2000-present), unifying access to statistical data about California Organization of GeoCLEF – evaluation of Geographic Information Retrieval from multilingual textual sources 9/20/2018 Vocabulary, Statistics, Time and Geography

14 UNIFIED SEARCH OF PLACE BETWEEN GENRES
Existing geographic search and display mechanisms for statistical data on the web University of Virginia (GeoStat Center Historical Census Browser) Syracuse University (Paul Bern) We created an additional one interfaced to the ECAI Time Map software, with link to Counting California Exogenous research created prototype geographic search of news stories in time and space. Hindi new stories (BBC news in Hindi) Russian new stories (Izvestia) These prototypes interface to either text or numeric data, but not both genres 9/20/2018 Vocabulary, Statistics, Time and Geography

15 UNIFIED SEARCH OF PLACE BETWEEN GENRES (2)
These above prototype interface to either text or numeric data Ray Larson has a search interface for world library catalogs Uses the Z39.50 distributed search protocol Interfaced to the California Digital Library MELVYL catalog We connected the two to produce demonstration interfaces to California counties and cities and towns (2000 census) US states and counties ( census) 9/20/2018 Vocabulary, Statistics, Time and Geography

16 Vocabulary, Statistics, Time and Geography
HETEROGENEOUS DIGITAL INFORMATION SEARCH Gateway Search between Multiple Information Types Numeric Statistical Databases Bibliography Full Text Patents EVMs EVMp EVMt QUERYplus EVMg EVMm QUERY Maps and other Geospatial data Music and other media 9/20/2018 Vocabulary, Statistics, Time and Geography

17 Vocabulary, Statistics, Time and Geography
HETEROGENEOUS DIGITAL INFORMATION SEARCH Direct Mappings and Search Between Multiple Information Types Numeric Statistical Databases Patents Bibliography Full Text EVMs EVMp EVMt QUERYplus EVMg EVMm Maps and other Geospatial data Music and other media QUERY 9/20/2018 Vocabulary, Statistics, Time and Geography

18 Vocabulary, Statistics, Time and Geography
New directions in research Biography markup and search ( IMLS grant) To develop tools for editors, archivists and compilers of historical papers Emma Goldman papers To develop display in time/space to facilitate historical discovery Congressional Biography – automatic markup of place, date, time-range <biog source="cong_dict" page_start="19" page_end="19"> <name> ADAMS, JOHN QUINCY. </name> <text> Born in Braintree, now Quincy, Mass., July 11, When ten years of age, he accompanied his father to France ; and when fifteen, was private secretary to the American Minister in Russia. He was graduated at Harvard University in 1787 ; studied law in Newburyport, and settled in Boston. From 1794 to 1801 he was American Minister to Holland, England, Sweden, and Prussia. He was a Senator in Congress from 1803 to 1808 ; </text> </biog> UK Archives Hub 9/20/2018 Vocabulary, Statistics, Time and Geography

19 Vocabulary, Statistics, Time and Geography
REFERENCES M Buckland and L Lancaster 2004, "Combining Place, Time, and Topic" D-Lib Magazine, May 2004, Volume 10 Number 5 M Buckland, A Chen, F Gey & R Larson, “Search Across Different Media: Numeric Data Sets and Text Files.” Information Technology and Libraries. December 2006, pp M Buckland, A Chen, F Gey, R Larson, R Mostern & V Petras ”Geographic Search: Catalogs, Gazetteers, and Maps.” College & Research Libraries (Forthcoming, Sept 2007.) 9/20/2018 Vocabulary, Statistics, Time and Geography


Download ppt "Vocabulary, Statistics, Time and Geography"

Similar presentations


Ads by Google