Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part 3B: Text Indexing, Term Lists & Taxonomies

Similar presentations


Presentation on theme: "Part 3B: Text Indexing, Term Lists & Taxonomies"— Presentation transcript:

1 Part 3B: Text Indexing, Term Lists & Taxonomies

2 Value space continuum of expressivity…
Text indexing Thesauri Ontology Term lists Faceted Classification Less More Taxonomies Analytico-synthetic Classification Tagging Enumerated Classification Increasing control over form, relationships and meaning…

3 Text Indexing ▪ Full-text and inverted files/indexes

4 Inverted files… Primary form of index developed for use in information systems for full-text retrieval It is called an “inverted file” because the normal rows (documents) and columns (words) of a database are inverted with rows representing words and columns representing documents.

5 Example inverted file…
Main Data File ID HOUSE PRICE 1 1208 Twin Oaks Way $100,000 2 100 Sutton Heights $200,000 3 10 Pine Street $150,000 4 8539 Billings Circle 5 9537 Highway 101 North 6 10 Capitol Hill Avenue North For example, assume you have a database of houses (rows) and one field (column) for each of those houses is the price. If you want to do rapid search by house price, build an inverted file with the rows being the prices and the columns the houses. You look up the price once and harvest the row’s columns for the houses. Inverted File or Inverted Index $100,000 1 4 5 $150,000 3 6 $200,000 2

6 Inverted file (document level)…
Text 1 Gold silver truck 2 Shipment of gold damaged in a fire 3 Delivery of silver arrived in a silver truck 4 Shipment of gold arrived in a truck Number Term Times; Documents 1 a <3; 2,3,4> 2 arrived <2; 3,4> 3 damaged <1; 2> 4 delivery <1; 3> 5 fire 6 Gold <3; 1,2,4> 7 of 8 in 9 shipment <2; 2,4> 10 silver <2; 1,3> 11 truck <3; 1,3,4>

7 Inverted file (term-level)…
Document Text 1 Gold silver truck 2 Shipment of gold damaged in a fire 3 Delivery of silver arrived in a silver truck 4 Shipment of gold arrived in a truck Number Term Times; Documents Words 1 a <3; (2;6),(3;6),(4;6)> 2 arrived <2; (3;4),(4;4)> 3 damaged <1; (2;4)> 4 delivery <1; (3;1)> 5 fire <1; (2;7)> 6 gold <3; (1;1),(2;3),(4;3)> 7 of <3; (2;2),(3;2),(4;2)> 8 in <3; (2;5),(3;5),(4;5)> 9 shipment <2; (2;1),(4;1)> 10 silver <2; (1;2),(3;3,7)> 11 Truck <3; (1;3),(3;8),(4;7)>> Proximity operator support

8 Inverted file (document level)…
Text 1 Gold silver truck 2 Shipment of gold damaged in a fire 3 Delivery of silver arrived in a silver truck 4 Shipment of gold arrived in a truck Number Term Times; Documents 1 a <3; 2,3,4> 2 arrived <2; 3,4> 3 damaged <1; 2> 4 delivery <1; 3> 5 fire 6 Gold <3; 1,2,4> 7 of 8 in 9 shipment <2; 2,4> 10 silver <2; 1,3> 11 truck <3; 1,3,4> Stop words With very sophisticated full-text retrieval systems, the aggregate size of the inverted files necessary to support search can be larger than the text files they index.

9 Term Lists

10 Term lists… The simplest forms of controlled value spaces are term lists—lists of controlled terms ordered by some principle (frequently alphabetical) Infants Ankle biters Rug rats The list of authorized U.S. state abbreviations An alphabetic list of enumerated subject terms Infants (preferred term) Don’t underestimate the power of these simple, controlled lists

11 Simple (yet powerful) lists…
A list (also sometimes called a pick list) is a limited set of terms arranged as a simple alphabetical list or in some other logically evident way. Lists are used to describe aspects of entities that have a limited number of possibilities. Examples include geography (e.g., country, state, city), language (e.g., English, French, Swedish), or format (e.g., text, image, sound) Simple alphabetical list: Alabama Alaska Arkansas California Connecticut Delaware Simple logical list: Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune Pluto*

12 Taxonomies ▪ Yahoo! Directory

13 Dominant form on the Web…
Hierarchical tree structure Example: Yahoo! Directory Frequently permit polyhierarchy (multiple parents) No general principles guiding design of taxonomies “A collection of controlled vocabulary terms organized into a hierarchical structure. Each term in a taxonomy is in one or more parent/child (broader/narrower) relationships to other terms in the taxonomy.” [NISO/Z39.19] [emphasis added] This is not intended to imply that Web taxonomies are necessarily unprincipled. It just means that as a form, they do not have the same guiding principles as we will see with thesauri and classifications. Individual designs of taxonomies can be rigorously structured and with consistent, intelligent, well thought out forms of cross referencing and other devices.

14

15

16

17

18

19 Polyhierarchy

20

21 Polyhierarchy… [NISO/Z39.19]
Based on generic relationship Based on whole-part relationship Based on multiple types of relationship musical instruments stringed instruments percussion instruments piano biology chemistry biochemistry bones head skull

22 Node Labels milk . . <milk by source animal> .. buffalo milk
Non-indexable concepts used for purposes of organizing other concepts in meaningful ways milk . . <milk by source animal> .. buffalo milk .. cow milk .. goat milk .. sheep milk . <milk by region> .. United States .. India ..China

23 End • Part 3B: Text Indexing, Term Lists & Taxonomies


Download ppt "Part 3B: Text Indexing, Term Lists & Taxonomies"

Similar presentations


Ads by Google