Presentation is loading. Please wait.

Presentation is loading. Please wait.

2007.04.10 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations


Presentation on theme: "2007.04.10 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

1 2007.04.10 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 19: DLs and GIR

2 2007.04.10 - SLIDE 2IS 240 – Spring 2007 Today Digital Libraries and IR Image Retrieval in DL From paper presented at the 1999 ASIS Annual Meeting More on Geographic Information Retrieval

3 2007.04.10 - SLIDE 3IS 240 – Spring 2007 UCB Digital Library Project: Research Agenda Funded by NSF/NASA/DARPA Digital Library Initiative (Phases I and II) ~1993- 2004 Research agenda –Understand user needs. –Extend functionality of documents. “Enliven” legacy documents. –Improve access to information. –Scale to large systems. –Re-Invent Scholarly Information Access and Use

4 2007.04.10 - SLIDE 4IS 240 – Spring 2007 Testbed: An Environmental Digital Library Collection: Diverse material relevant to California’s key habitats. Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries. Potential: Impact on state-wide environmental system (CERES )

5 2007.04.10 - SLIDE 5IS 240 – Spring 2007 The Environmental Library - Users/Contributors California Resources Agency, California Environment Resources Evaluation System (CERES) California Department of Water Resources The California Department of Fish & Game SANDAG UC Water Resources Center Archives New Partners: CDL and SDSC

6 2007.04.10 - SLIDE 6IS 240 – Spring 2007 The Environmental Library - Contents Environmental technical reports, bulletins, etc. County general plans Aerial and ground photography USGS topographic maps Land use and other special purpose maps Sensor data “Derived” information Collection data bases for the classification and distribution of the California biota (e.g., SMASCH) Supporting 3-D, economic, traffic, etc. models Videos collected by the California Resources Agency

7 2007.04.10 - SLIDE 7IS 240 – Spring 2007 The Environmental Library - Contents As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases.

8 2007.04.10 - SLIDE 8IS 240 – Spring 2007 Botanical Data: The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.

9 2007.04.10 - SLIDE 9IS 240 – Spring 2007 Geographical Data: Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.

10 2007.04.10 - SLIDE 10IS 240 – Spring 2007 Documents: Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.

11 2007.04.10 - SLIDE 11IS 240 – Spring 2007 Documents - cont. The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.

12 2007.04.10 - SLIDE 12IS 240 – Spring 2007 Photographs: The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos.

13 2007.04.10 - SLIDE 13IS 240 – Spring 2007 Testbed Success Stories LUPIN: CERES’ Land Use Planning Information Network –California Country General Plans and other environmental documents. –Enter at Resources Agency Server, documents stored at and retrieved from UCB DLIB server. California flood relief efforts –High demand for some data sets only available on our server (created by document recognition). CalFlora: Creation and interoperation of repositories pertaining to plant biology. Cloning of services at Cal State Library, FBI

14 2007.04.10 - SLIDE 14IS 240 – Spring 2007 Research Highlights Documents –Multivalent Document prototype Page images, structured documents, GIS data, photographs Intelligent Access to Content –Document recognition –Vision-based Image Retrieval: stuff, thing, scene retrieval –Natural Language Processing: categorizing the web, Cheshire II, TileBar Interfaces

15 2007.04.10 - SLIDE 15IS 240 – Spring 2007 User Interface Paradigms: Multivalent Documents An approach to new document types and their authoring. Supports active, distributed, composable transformations of multimedia documents. Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

16 2007.04.10 - SLIDE 16IS 240 – Spring 2007 Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Network Protocols & Resources

17 2007.04.10 - SLIDE 17IS 240 – Spring 2007

18 2007.04.10 - SLIDE 18IS 240 – Spring 2007 GIS in the MVD Framework Layers are georeferenced data sets. Behaviors are –display semi-transparently –pan –zoom –issue query –display context –“spatial hyperlinks” –annotations Written in Java (to be merged with MVD-1 code line?)

19 2007.04.10 - SLIDE 19IS 240 – Spring 2007 GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

20 2007.04.10 - SLIDE 20IS 240 – Spring 2007 Overview of Cheshire II The Cheshire II system is intended to provide an easy-to-use, standards- compliant system capable of retrieving any type of information in a wide variety of settings.

21 2007.04.10 - SLIDE 21IS 240 – Spring 2007 Overview of Cheshire II It supports SGML and XML. It is a client/server application. Uses the Z39.50 Information Retrieval Protocol. Server supports a Relational Database Gateway. Supports Boolean searching of all servers. Supports probabilistic ranked retrieval in the Cheshire search engine. Search engine supports ``nearest neighbor'' searches and relevance feedback. GUI interface on X window displays. WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire. Image Content retrieval using BlobWorld Support for the SDLIP (Simple Digital Library Interoperability Protocol) for search and as Z39.50 Gateway

22 2007.04.10 - SLIDE 22IS 240 – Spring 2007 Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

23 2007.04.10 - SLIDE 23IS 240 – Spring 2007 Current Usage of Cheshire II Web clients for: –NSF/NASA/ARPA Digital Library Includes support for full-text and page-level search. Experimental Blob-World image search –SunSite –University of Liverpool. –University of Essex, HDS (part of AHDS) –California Sheet Music Project –Cha-Cha (Berkeley Intranet Search Engine) –Univ. of Virginia Cheshire ranking algorithm is basis for Inktomi (i.e., Yahoo, Hotbot, MSN? and others)

24 2007.04.10 - SLIDE 24IS 240 – Spring 2007 Image Retrieval Research Finding “Stuff” vs “Things” BlobWorld Other Vision Research

25 2007.04.10 - SLIDE 25IS 240 – Spring 2007 Blobworld: use regions for retrieval We want to find general objects  Represent images based on coherent regions

26 2007.04.10 - SLIDE 26IS 240 – Spring 2007 Outline Why regions? Creating Blobworld: segmentation and description Using Blobworld: query experiments Indexing blobs for faster querying Conclusions

27 2007.04.10 - SLIDE 27IS 240 – Spring 2007 Creating and using Blobworld extract featuressegment imagedescribe regionsquery CreateUse

28 2007.04.10 - SLIDE 28IS 240 – Spring 2007 Extract features for each pixel Color –Take average color (L*a*b*) at the selected scale  ignore local color variations due to texture –“zebra = gray horse + stripes” Texture –Find contrast, anisotropy, polarity at the selected scale Position

29 2007.04.10 - SLIDE 29IS 240 – Spring 2007 Find groups in feature space Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)

30 2007.04.10 - SLIDE 30IS 240 – Spring 2007 Find regions in the image Label each pixel based on its Gaussian cluster Find connected components  regions 1 33 4 2 1 1 34 2

31 2007.04.10 - SLIDE 31IS 240 – Spring 2007 Describe regions by color, texture, shape Color –Color histogram within region –Quadratic distance: encode similarity between color bins d 2 hist (x, y) = (x - y)' A (x - y) Texture –Mean contrast and anisotropy  stripes vs. spots vs. smooth (Basic) Shape –Fourier descriptors of contour

32 2007.04.10 - SLIDE 32IS 240 – Spring 2007 Select appropriate scale for processing Polarity: do all the gradient vectors point in the same direction? Choose scale where polarity stabilizes  include one approximate period

33 2007.04.10 - SLIDE 33IS 240 – Spring 2007 Initialize means using image data Before, we picked random initialization Now, choose initial means based on image tiles Add noise to means and restart EM (4 runs per K) K = 2K = 5K = 4K = 3

34 2007.04.10 - SLIDE 34IS 240 – Spring 2007 update ,  update labels update ,  Grouping: Expectation-Maximization Given class characteristics ( ,  ), find class membership Given class membership, find class characteristics ( ,  ) Iterate update labels

35 2007.04.10 - SLIDE 35IS 240 – Spring 2007 How many Gaussians? Model selection: Minimum Description Length –Prefer fewer Gaussians if performance is comparable vs.

36 2007.04.10 - SLIDE 36IS 240 – Spring 2007 Find groups in feature space Model feature distribution as a mixture of Gaussians using Expectation- Maximization (EM)

37 2007.04.10 - SLIDE 37IS 240 – Spring 2007 EM math Probability density: Update equations: where

38 2007.04.10 - SLIDE 38IS 240 – Spring 2007 Encode similarity between color bins Quadratic distance Distance between histograms x and y: d 2 hist (x, y) = (x - y)' A (x - y) A ij is based on the similarity between bins i and j –Neighboring bins have A ij = 0.5

39 2007.04.10 - SLIDE 39IS 240 – Spring 2007 Fourier descriptors for shape [Zahn & Roskies ’72, Kuhl & Giardina ’82] Find (x,y) representation of outer contour Find Fourier series of (x,y) –Coefficients specify an ellipse (4 parameters): –major axis, minor axis, orientation, starting point Remove starting point ambiguity Store first ten Fourier coefficients

40 2007.04.10 - SLIDE 40IS 240 – Spring 2007 Creating and using Blobworld extract featuressegment imagedescribe regionsquery CreateUse

41 2007.04.10 - SLIDE 41IS 240 – Spring 2007 Querying: let user see the representation Current systems are unsatisfying –User can’t see what the computer sees –Unclear how parameters relate to the image User should interact with the representation –Helps in query formulation –Makes results understandable –Minimizes disappointment http://elib.cs.berkeley.edu/photos/blobworl d

42 2007.04.10 - SLIDE 42IS 240 – Spring 2007

43 2007.04.10 - SLIDE 43IS 240 – Spring 2007

44 2007.04.10 - SLIDE 44IS 240 – Spring 2007

45 2007.04.10 - SLIDE 45IS 240 – Spring 2007

46 2007.04.10 - SLIDE 46IS 240 – Spring 2007

47 2007.04.10 - SLIDE 47IS 240 – Spring 2007

48 2007.04.10 - SLIDE 48IS 240 – Spring 2007 Query experiments Collection of 10,000 Corel stock photos Five query images in each of ten categories (e.g., cheetahs, polar bears, airplanes) Compare Blobworld to global histogram queries Precision (% of retrieved images that are correct) vs. Recall (% of correct images that are retrieved)

49 2007.04.10 - SLIDE 49IS 240 – Spring 2007 Distinctive objects Tigers, cheetahs, and zebras: –Blobworld does better than global histograms cheetahs zebras

50 2007.04.10 - SLIDE 50IS 240 – Spring 2007 black bears Distinctive objects and backgrounds Eagles and black bears: –Blobworld does better than global histograms

51 2007.04.10 - SLIDE 51IS 240 – Spring 2007 Distinctive scenes Airplanes and brown bears: –Global histograms do better than Blobworld –But Blobworld has room to grow (shape, etc.) airplanes

52 2007.04.10 - SLIDE 52IS 240 – Spring 2007 Index to search huge collections Indexing is trickier than for traditional data We can afford some mistakes: even with full search, we’ll miss some tigers and include some pumpkins Two approaches we have tried: –Store terms and treat image as a document –Store features and index using a tree Final (“correct”) ranking of images from index

53 2007.04.10 - SLIDE 53IS 240 – Spring 2007 Index using conventional IR methods Treat each database blob as a document –Store “terms” (bins) for color, texture, location, and shape –Repeat color terms based on histogram weights Index using Cheshire II Treat each query blob as a document –Repeat “terms” according to query weights

54 2007.04.10 - SLIDE 54IS 240 – Spring 2007 Indexing and Retrieval with Cheshire II Originally used the same probabilistic algorithm used for text –Blobs are not distributed like text words or stems Now using a weighting based on coordination level match with a minimum threshold (must have at least half of the characteristics of the query cluster. Still eyeballing data, but seems much better for many types of queries

55 2007.04.10 - SLIDE 55IS 240 – Spring 2007

56 2007.04.10 - SLIDE 56IS 240 – Spring 2007

57 2007.04.10 - SLIDE 57IS 240 – Spring 2007

58 2007.04.10 - SLIDE 58IS 240 – Spring 2007

59 2007.04.10 - SLIDE 59IS 240 – Spring 2007 Conclusions Image retrieval in general collections requires region segmentation and description Blobworld yields high precision in queries for distinctive objects Blobworld can be indexed to allow fast querying

60 2007.04.10 - SLIDE 60IS 240 – Spring 2007 Further Information Full Cheshire II client and server source is available http://cheshire.berkeley.edu/ UC Berkeley Digital Library Project –http://elib.cs.berkeley.edu Site no longer live Some parts available from the Internet Archive

61 2007.04.10 - SLIDE 61IS 240 – Spring 2007 User Interface Paradigms: Multivalent Documents An approach to new document types and their authoring. Supports active, distributed, composable transformations of multimedia documents. Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

62 2007.04.10 - SLIDE 62IS 240 – Spring 2007 Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Network Protocols & Resources

63 2007.04.10 - SLIDE 63IS 240 – Spring 2007 Image Retrieval Research Finding “Stuff” vs “Things” BlobWorld

64 2007.04.10 - SLIDE 64IS 240 – Spring 2007

65 2007.04.10 - SLIDE 65IS 240 – Spring 2007 Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

66 2007.04.10 - SLIDE 66IS 240 – Spring 2007 GIS in the MVD Framework Layers are georeferenced data sets. Behaviors are –display semi-transparently –pan –zoom –issue query –display context –“spatial hyperlinks” –annotations Written in Java

67 2007.04.10 - SLIDE 67IS 240 – Spring 2007 GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

68 2007.04.10 - SLIDE 68IS 240 – Spring 2007 Geographic Information Retrieval and Spatial Browsing Ray R. Larson School of Library and Information Studies University of California, Berkeley

69 2007.04.10 - SLIDE 69IS 240 – Spring 2007 Concerns for Digital Libraries Excellent summary in Distributed Geolibraries from NRC. –Distributed resources –Distributed users –Distributed services Access for a broad population is critical for many Digital Libraries

70 2007.04.10 - SLIDE 70IS 240 – Spring 2007 Concerns for Digital Libraries Georeferenced Information (geoinformation) provides one organizational perspective Other common perspectives include Topical Classification schemes, Temporal/Historical organization (ECAI) DL’s can provide multiple views of the same information

71 2007.04.10 - SLIDE 71IS 240 – Spring 2007 Concerns for Digital Libraries Most DLs are intended for a broad user base: –varying levels of expertise in the contents –varying requirements for access methods –simple expressions of interest in natural language should be supported –Mapping NL to controlled vocabularies (including Digital Gazetteers)

72 2007.04.10 - SLIDE 72IS 240 – Spring 2007 Digital Library Needs Geographic and Spatial Querying Spatial Browsing Geographic and Spatial Indexing (Berkeley DL contents and examples)

73 2007.04.10 - SLIDE 73IS 240 – Spring 2007 Overview What is Geographic Information Retrieval? Geographic and Spatial Querying and Browsing. Geographic and Spatial Indexing. Examples of GIR Systems and Geographically Indexed Information.

74 2007.04.10 - SLIDE 74IS 240 – Spring 2007 Introduction What is Geographic Information Retrieval? –GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. –It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

75 2007.04.10 - SLIDE 75IS 240 – Spring 2007 Introduction The need for Geographic and Spatial Information Retrieval. –Digital Libraries Sequoia 2000 UC Berkeley NSF/NASA/ARPA Digital Library Project UC Santa Barbara Alexandria Project NSDI - National Spatial Data Infrastructure –Next-Generation Online Catalogs Cheshire II

76 2007.04.10 - SLIDE 76IS 240 – Spring 2007 Geographic and Spatial Querying Both imply querying on relationships within a particular coordinate system Spatial querying is the more general term Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

77 2007.04.10 - SLIDE 77IS 240 – Spring 2007 Geographic and Spatial Querying Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale) –E.g. “5.21 miles north of Champaign” Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction) –E.g.: “inside the city limits” –“left side of Beckman Institute”

78 2007.04.10 - SLIDE 78IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries –Point-in-polygon : “What do we have at this X,Y point?” –Region Queries : “What do we have in this region?” Which point encoded items lie within the region What lines (borders, etc.) lie within or the cross the region What areas overlap the region area Y X

79 2007.04.10 - SLIDE 79IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries, cont. –Distance and Buffer Zone Queries What cities lie within 40 miles of the border of Northern and Southern Ireland? What wetlands lie within 50 miles of London? –Path Queries What is the shortest route from San Francisco to Los Angeles?

80 2007.04.10 - SLIDE 80IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries, cont. –Multimedia Queries : Use non- map georeferenced information. What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties? p123 p127

81 2007.04.10 - SLIDE 81IS 240 – Spring 2007 Spatial Browsing Combines ad hoc spatial querying with interactive displays HyperMap concept Pseudo-HyperMaps

82 2007.04.10 - SLIDE 82IS 240 – Spring 2007 Spatial Browsing Advantages: –May not need the accuracy of a full GIS –Comprehensible searching metaphor for many materials Problems: –Clutter and differing scales. –Requires good (and preferably accurate) geographical indexing –Assumes that the user knows some geography

83 2007.04.10 - SLIDE 83IS 240 – Spring 2007 Geographic and Spatial Indexing Traditional geographic indexing involves using place names from LCSH and name authorities. These have some problems: –Names are not unique –The places referred to change size, shape and names over time –Spelling variations –Some places are temporary conventions (study areas, etc.)

84 2007.04.10 - SLIDE 84IS 240 – Spring 2007 Digital Gazetteers Geographic names are and will remain the primary Entry Vocabulary for DL spatial queries –The gazetteer must support as many variant forms of the name as possible Including temporal ranges for particular names –querying must support spatial reasoning based on gazetteer and other geographic and temporal information in the system or accessible by network access

85 2007.04.10 - SLIDE 85IS 240 – Spring 2007

86 2007.04.10 - SLIDE 86IS 240 – Spring 2007 Geographic and Spatial Indexing Geographic coordinates have some advantages over names: –They are persistent regardless of name, political boundary or other changes –The can be simply connected to spatial browsing interfaces and GIS data. –They provide a consistent framework for GIR applications and spatial queries. However, the geographic extents and boundaries of entities also change over time –This may be the primary interest of historical scholarship

87 2007.04.10 - SLIDE 87IS 240 – Spring 2007 Geographic and Spatial Indexing GIPSY: Automatic georeferencing of texts (Geographic Info Processing System) –The work of Allison Woodruff and Christian Plaunt - Later DBMS-based version by Jolly Chen -- New version planned –Designed to operate on the full text of documents –Extracts geographic terms and attempts to identify the coordinates of the places discussed in the text using a combination of evidence

88 2007.04.10 - SLIDE 88IS 240 – Spring 2007 Geographic and Spatial Indexing GIPSY cont. –Used the USGS Geographic Names Information System (GNIS) and Geographic Information Retrieval and Analysis System (GIRAS) to associate names with coordinates of named places, geographic features and land use characteristics.

89 2007.04.10 - SLIDE 89IS 240 – Spring 2007 Geographic and Spatial Indexing GIPSY cont. –Identified places are added as “elevations” with each place adding a weight based on its frequency in the text and database characteristics –The resulting map is analysed to identify the most likely locations, and coordinates for those locations are extracted

90 2007.04.10 - SLIDE 90IS 240 – Spring 2007 Geographic and Spatial Indexing GIPSY Map Overlay “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “

91 2007.04.10 - SLIDE 91IS 240 – Spring 2007 Geographic and Spatial Indexing To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must –Support many different time ranges, location and boundary changes –Support synonymous and variant names with differing locations for the same entity –Support names in multiple languages, scripts and usages

92 2007.04.10 - SLIDE 92IS 240 – Spring 2007 ECAI The Electronic Cultural Atlas Initiative is a collaboration between IT professionals and humanities scholars ECAI is developing a globally distributed spatio- temporal library of cultural and historical resources with a centralized metadata catalogue and a GIS viewer Currently the ECAI consortium includes over 250 projects

93 2007.04.10 - SLIDE 93IS 240 – Spring 2007 ECAI Projects range from small works by individual scholars to large nationally and internationally funded efforts. E.g.: –geography of Greco-Roman culture (Perseus project) –toponym locations for over 300,000 images of Buddhist art and architecture –Seals of the Sassanian Empire –historical trade routes of Eurasia –the map of Hideyoshi’s invasion of Korea –historical GIS projects for China, Great Britain, the United States, the Black Sea and Tibet

94 2007.04.10 - SLIDE 94IS 240 – Spring 2007 Perseus

95 2007.04.10 - SLIDE 95IS 240 – Spring 2007 The Sasanian Empire

96 2007.04.10 - SLIDE 96IS 240 – Spring 2007 Opening shot of the Sasanian Empire ECAI project, showing a map with diverse resources, a timeline, and a menu of available map layers.

97 2007.04.10 - SLIDE 97IS 240 – Spring 2007 Users may zoom in to see resources that are only visible at a higher level of detail.

98 2007.04.10 - SLIDE 98IS 240 – Spring 2007 Spatial objects on the map are linked to a table of attributes, which may include any information about the objects. Note that this is a scholarly tool. By creating a “name quality” field, the author has noted that there is disagreement about the locations and names of places in the Sasanian Empire.

99 2007.04.10 - SLIDE 99IS 240 – Spring 2007 Sites on the map may be linked to resources elsewhere on the internet. In this case, important archaeological sites on the map are linked to web-based tours.

100 2007.04.10 - SLIDE 100IS 240 – Spring 2007 The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

101 2007.04.10 - SLIDE 101IS 240 – Spring 2007 In a different time range, not only do the boundaries of the empire appear different, but the sites that were active during the earlier era (the red dots) have moved as well.

102 2007.04.10 - SLIDE 102IS 240 – Spring 2007 TimeMap is a user authoring tool, not merely a viewer. Users can control the look of the icons, the map layers that comprise a project, and, as shown here, the map scale at which different layers will become visible.

103 2007.04.10 - SLIDE 103IS 240 – Spring 2007 This screen displays the metadata for the a part of the Sasanian Empire project. The metadata includes functional (tm.) metadata to enable connection to the map interface in addition to cataloguing (dc. and ecai.) metadata. Using the menu on the left, users may choose to map individual map layers or packaged projects.

104 2007.04.10 - SLIDE 104IS 240 – Spring 2007 Historic Sydney

105 2007.04.10 - SLIDE 105IS 240 – Spring 2007 Google Earth GIR - Demo

106 2007.04.10 - SLIDE 106IS 240 – Spring 2007 The Mongol Empire

107 2007.04.10 - SLIDE 107IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 23: GIR Continued

108 2007.04.10 - SLIDE 108IS 240 – Spring 2007 Today Review –Geographic Information Retrieval Parts of this this lecture were presented at the invitational conference “The ‘I’ in Geographic Information Science”, Manchester, U.K., July 2001 GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.

109 2007.04.10 - SLIDE 109IS 240 – Spring 2007 Introduction What is Geographic Information Retrieval? –GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. –It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

110 2007.04.10 - SLIDE 110IS 240 – Spring 2007 Introduction The need for Geographic and Spatial Information Retrieval. –Digital Libraries Sequoia 2000 UC Berkeley NSF/NASA/ARPA Digital Library Project UC Santa Barbara Alexandria Project NSDI - National Spatial Data Infrastructure –Next-Generation Online Catalogs Cheshire II

111 2007.04.10 - SLIDE 111IS 240 – Spring 2007 Geographic and Spatial Querying Both imply querying on relationships within a particular coordinate system Spatial querying is the more general term Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

112 2007.04.10 - SLIDE 112IS 240 – Spring 2007 Geographic and Spatial Querying Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale) –E.g. “5.21 miles north of Champaign” Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction) –E.g.: “inside the city limits” –“left side of Beckman Institute”

113 2007.04.10 - SLIDE 113IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries –Point-in-polygon : “What do we have at this X,Y point?” –Region Queries : “What do we have in this region?” Which point encoded items lie within the region What lines (borders, etc.) lie within or the cross the region What areas overlap the region area Y X

114 2007.04.10 - SLIDE 114IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries, cont. –Distance and Buffer Zone Queries What cities lie within 40 miles of the border of Northern and Southern Ireland? What wetlands lie within 50 miles of London? –Path Queries What is the shortest route from San Francisco to Los Angeles?

115 2007.04.10 - SLIDE 115IS 240 – Spring 2007 Geographic and Spatial Querying Types of spatial queries, cont. –Multimedia Queries : Use non- map georeferenced information. What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties? p123 p127

116 2007.04.10 - SLIDE 116IS 240 – Spring 2007 Spatial Browsing Combines ad hoc spatial querying with interactive displays HyperMap concept Pseudo-HyperMaps

117 2007.04.10 - SLIDE 117IS 240 – Spring 2007 Geographic and Spatial Indexing GIPSY Map Overlay “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “

118 2007.04.10 - SLIDE 118IS 240 – Spring 2007 Geographic and Spatial Indexing To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must –Support many different time ranges, location and boundary changes –Support synonymous and variant names with differing locations for the same entity –Support names in multiple languages, scripts and usages

119 2007.04.10 - SLIDE 119IS 240 – Spring 2007 The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

120 2007.04.10 - SLIDE 120IS 240 – Spring 2007 Historic Sydney

121 2007.04.10 - SLIDE 121IS 240 – Spring 2007 The Mongol Empire

122 2007.04.10 - SLIDE 122IS 240 – Spring 2007 ECDL 2004 Ray R. Larson and Patricia Frontiera University of California, Berkeley Spatial Ranking Methods for Geographic Information Retrieval (GIR) in Digital Libraries

123 2007.04.10 - SLIDE 123IS 240 – Spring 2007 Geographic information retrieval (GIR) is concerned with spatial approaches to the retrieval of geographically referenced, or georeferenced, information objects (GIOs) –about specific regions or features on or near the surface of the Earth. –Geospatial data are a special type of GIO that encodes a specific geographic feature or set of features along with associated attributes maps, air photos, satellite imagery, digital geographic data, etc Geographic Information Retrieval (GIR) Source: USGS

124 2007.04.10 - SLIDE 124IS 240 – Spring 2007 San Francisco Bay Area -122.418, 37.775 Georeferencing and GIR Within a GIR system, e.g., a geographic digital library, information objects can be georeferenced by place names or by geographic coordinates (i.e. longitude & latitude)

125 2007.04.10 - SLIDE 125IS 240 – Spring 2007 GIR is not GIS GIS is concerned with spatial representations, relationships, and analysis at the level of the individual spatial object or field. GIR is concerned with the retrieval of geographic information resources (and geographic information objects at the set level) that may be relevant to a geographic query region.

126 2007.04.10 - SLIDE 126IS 240 – Spring 2007 Spatial Approaches to GIR A spatial approach to geographic information retrieval is one based on the integrated use of spatial representations, and spatial relationships. A spatial approach to GIR can be qualitative or quantitative –Quantitative: based on the geometric spatial properties of a geographic information object –Qualitative: based on the non-geometric spatial properties.

127 2007.04.10 - SLIDE 127IS 240 – Spring 2007 Spatial Matching and Ranking Spatial similarity can be considered as a indicator of relevance: documents whose spatial content is more similar to the spatial content of query will be considered more relevant to the information need represented by the query. Need to consider both: –Qualitative, non-geometric spatial attributes –Quantitative, geometric spatial attributes Topological relationships and metric details We focus on the latter…

128 2007.04.10 - SLIDE 128IS 240 – Spring 2007 Spatial Similarity Measures and Spatial Ranking Three basic approaches to spatial similarity measures and ranking Method 1: Simple Overlap Method 2: Topological Overlap Method 3: Degree of Overlap:

129 2007.04.10 - SLIDE 129IS 240 – Spring 2007 Method 1: Simple Overlap Candidate geographic information objects (GIOs) that have any overlap with the query region are retrieved. Included in the result set are any GIOs that are contained within, overlap, or contain the query region. The spatial score for all GIOs is either relevant (1) or not relevant (0). The result set cannot be ranked –topological relationship only, no metric refinement

130 2007.04.10 - SLIDE 130IS 240 – Spring 2007 Method 2: Topological Overlap Spatial searches are constrained to only those candidate GIOs that either: –are completely contained within the query region, –overlap with the query region, –or, contain the query region. Each category is exclusive and all retrieved items are considered relevant. The result set cannot be ranked –categorized topological relationship only, –no metric refinement

131 2007.04.10 - SLIDE 131IS 240 – Spring 2007 Method 3: Degree of Overlap Candidate geographic information objects (GIOs) that have any overlap with the query region are retrieved. A spatial similarity score is determined based on the degree to which the candidate GIO overlaps with the query region. The greater the overlap with respect to the query region, the higher the spatial similarity score. This method provides a score by which the result set can be ranked –topological relationship: overlap –metric refinement: area of overlap

132 2007.04.10 - SLIDE 132IS 240 – Spring 2007 Example: Results display from CheshireGeo: http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html

133 2007.04.10 - SLIDE 133IS 240 – Spring 2007 Geometric Approximations The decomposition of spatial objects into approximate representations is a common approach to simplifying complex and often multi- part coordinate representations Types of Geometric Approximations –Conservative: superset –Progressive: subset –Generalizing: could be either –Concave or Convex Geometric operations on convex polygons much faster

134 2007.04.10 - SLIDE 134IS 240 – Spring 2007 1) Minimum Bounding Circle (3) 2) MBR: Minimum aligned Bounding rectangle (4) 3) Minimum Bounding Ellipse (5) 6) Convex hull (varies)5) 4-corner convex polygon (8)4) Rotated minimum bounding rectangle (5) Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation After Brinkhoff et al, 1993b Other convex, conservative Approximations

135 2007.04.10 - SLIDE 135IS 240 – Spring 2007 Our Research Questions Spatial Ranking –How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? Geometric Approximations & Spatial Ranking: –How do different geometric approximations affect the rankings? MBRs: the most popular approximation Convex hulls: the highest quality convex approximation

136 2007.04.10 - SLIDE 136IS 240 – Spring 2007 Spatial Ranking: Methods for computing spatial similarity

137 2007.04.10 - SLIDE 137IS 240 – Spring 2007 Proposed Ranking Method Probabilistic Spatial Ranking using Logistic Inference Probabilistic Models –Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query –Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) –Rely on accurate estimates of probabilities

138 2007.04.10 - SLIDE 138IS 240 – Spring 2007 Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the m X attribute measures (on the following page) Logistic Regression

139 2007.04.10 - SLIDE 139IS 240 – Spring 2007 Probabilistic Models: Logistic Regression attributes X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO X 3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) Where: Range for all variables is 0 (not similar) to 1 (same)

140 2007.04.10 - SLIDE 140IS 240 – Spring 2007 Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Computationally efficient, straight- forward implementation (if based on LR) Relevance information is required -- or is “guestimated” Important indicators of relevance may not be captured by the model Optimally requires on- going collection of relevance information AdvantagesDisadvantages

141 2007.04.10 - SLIDE 141IS 240 – Spring 2007 Test Collection California Environmental Information Catalog (CEIC) http://ceres.ca.gov/catalog. Approximately 2500 records selected from collection (Aug 2003) of ~ 4000.

142 2007.04.10 - SLIDE 142IS 240 – Spring 2007 Test Collection Overview 2554 metadata records indexed by 322 unique geographic regions (represented as MBRs) and associated place names. –2072 records (81%) indexed by 141 unique CA place names 881 records indexed by 42 unique counties (out of a total of 46 unique counties indexed in CEIC collection) 427 records indexed by 76 cities (of 120) 179 records by 8 bioregions (of 9) 3 records by 2 national parks (of 5) 309 records by 11 national forests (of 11) 3 record by 1 regional water quality control board region (of 1) 270 records by 1 state (CA) –482 records (19%) indexed by 179 unique user defined areas (approx 240) for regions within or overlapping CA 12% represent onshore regions (within the CA mainland) 88% (158 of 179) offshore or coastal regions

143 2007.04.10 - SLIDE 143IS 240 – Spring 2007 CA Named Places in the Test Collection – complex polygons Counties Cities National Parks National Forests Water QCB Regions Bioregions

144 2007.04.10 - SLIDE 144IS 240 – Spring 2007 CA Counties – Geometric Approximations MBRs Ave. False Area of Approximation: MBRs: 94.61%Convex Hulls: 26.73% Convex Hulls

145 2007.04.10 - SLIDE 145IS 240 – Spring 2007 CA User Defined Areas (UDAs) in the Test Collection

146 2007.04.10 - SLIDE 146IS 240 – Spring 2007 Test Collection Query Regions: CA Counties 42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model

147 2007.04.10 - SLIDE 147IS 240 – Spring 2007 Test Collection Relevance Judgements Determine the reference set of candidate GIO regions relevant to each county query region: Complex polygon data was used to select all CA place named regions (i.e. counties, cities, bioregions, national parks, national forests, and state regional water quality control boards) that overlap each county query region. All overlapping regions were reviewed (semi-automatically) to remove sliver matches, i.e. those regions that only overlap due to differences in the resolution of the 6 data sets. –Automated review: overlaps where overlap area/GIO area >.00025 considered relevant, else not relevant. –Cases manually reviewed: overlap area/query area <.001 and overlap area/GIO area <.02 The MBRs and metadata for all information objects referenced by UDAs (user-defined areas) were manually reviewed to determine their relevance to each query region. This process could not be automated because, unlike the CA place named regions, there are no complex polygon representations that delineate the UDAs. This process resulted in a master file of CA place named regions and UDAs relevant to each of the 42 CA county query regions.

148 2007.04.10 - SLIDE 148IS 240 – Spring 2007 LR model X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO Where: Range for all variables is 0 (not similar) to 1 (same)

149 2007.04.10 - SLIDE 149IS 240 – Spring 2007 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: These results suggest: Convex Hulls perform better than MBRs Expected result given that the CH is a higher quality approximation A probabilistic ranking based on MBRs can perform as well if not better than a non- probabiliistic ranking method based on Convex Hulls Interesting Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go.

150 2007.04.10 - SLIDE 150IS 240 – Spring 2007 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa

151 2007.04.10 - SLIDE 151IS 240 – Spring 2007 Results for MBR - Named data Recall Precision

152 2007.04.10 - SLIDE 152IS 240 – Spring 2007 Results for Convex Hulls -Named Precision Recall

153 2007.04.10 - SLIDE 153IS 240 – Spring 2007 Offshore / Coastal Problem California EEZ Sonar Imagery Map – GLORIA Quad 13 PROBLEM: the MBR for GLORIA Quad 13 overlaps with several counties that area completely inland.

154 2007.04.10 - SLIDE 154IS 240 – Spring 2007 Adding Shorefactor Feature Variable Candidate GIO MBRs A) GLORIA Quad 13: fraction onshore =.55 B) WATER Project Area: fraction onshore =.74 Query Region MBR Q) Santa Clara County: fraction onshore =.95 Onshore Areas Computing Shorefactor: Q – A Shorefactor: 1 – abs(.95 -.55) =.60 Q – B Shorefactor: 1 – abs(.95 -.74) =.79 Shorefactor = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) A B Q Even though A & B have the same area of overlap with the query region, B has a higher shorefactor, which would weight this GIO’s similarity score higher than A’s. Note: geographic content of A is completely offshore, that of B is completely onshore.

155 2007.04.10 - SLIDE 155IS 240 – Spring 2007 About the Shorefactor Variable Characterizes the relationship between the query and candidate GIO regions based on the extent to which their approximations overlap with onshore areas (or offshore areas). Assumption: a candidate region is more likely to be relevant to the query region if the extent to which its approximation is onshore (or offshore) is similar to that of the query region’s approximation.

156 2007.04.10 - SLIDE 156IS 240 – Spring 2007 About the Shorefactor Variable The use of the shorefactor variable is presented as an example of how geographic context can be integrated into the spatial ranking process. Performance: Onshore fraction for each GIO approximation can be pre-indexed. Thus, for each query only the onshore fraction of the query region needs to be calculated using a geometric operation. The computational complexity of this type of operation is dependent on the complexity of the coordinate representations of the query region (we used the MBR and Convex hull approximations) and the onshore region (we used a very generalized concave polygon w/ only 154 pts).

157 2007.04.10 - SLIDE 157IS 240 – Spring 2007 Shorefactor Model X1 = area of overlap(query region, candidate GIO) / area of query region X2 = area of overlap(query region, candidate GIO) / area of candidate GIO X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) –Where: Range for all variables is 0 (not similar) to 1 (same)

158 2007.04.10 - SLIDE 158IS 240 – Spring 2007 Some of our Results, with Shorefactor These results suggest: Addition of Shorefactor variable improves the model (LR 2), especially for MBRs Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

159 2007.04.10 - SLIDE 159IS 240 – Spring 2007 Precision Recall Results for All Data - MBRs

160 2007.04.10 - SLIDE 160IS 240 – Spring 2007 Results for All Data - Convex Hull Precision Recall

161 2007.04.10 - SLIDE 161IS 240 – Spring 2007 Future work Improve test collection –Add to the set of queries + relevance judgements (I.e. so query regions not just based on counties). –Remove/decrease subjectivity of relevance judgements for GIOs referenced by UDAs. –Add metadata to test collection –Add random selection of queries & metadata Test other geometric approximations –5-corner convex polygon –Concave approximations Test other spatial feature variables


Download ppt "2007.04.10 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."

Similar presentations


Ads by Google