Presentation on theme: "Digital Libraries Models and Content. Goals for tonight Finish up from last week – the 5 S model more formally – Status of the systems available Obtaining,"— Presentation transcript:
Goals for tonight Finish up from last week – the 5 S model more formally – Status of the systems available Obtaining, describing, indexing content – XML – Dublin Core – Introducing content exchanges (OAI)
Applying the 5S model, informally Choose a subject area – then answer the questions Stream - what types of data? gif, jpg, avi, docx, pdf, html? Structure - How are the elements organized? Is there a hierarchy? Are there multiple structures? Spaces - How will we index the items? How will we divide them into related groups Scenarios - what services will we provide? What information do we need to provide those services? What events might happen that we need to plan for? Societies - who is the library intended to serve? Remember to include agents and other processes as well as users. This is the first deliverable for your first project.
More formally: Definitions Definition: A stream is a sequence whose codomain is a non empty set. Definition: A structure is a tuple (G, L, F) where G = (V,E) is a directed graph with vertex set V and edge set E, L is a set of label values, and F is a labeling function. F : (V ∪ E ) → L. See http://www.mathsisfun.com/sets/domain-range-codomain.html for a nice description of domain, range, codomain if you need it.http://www.mathsisfun.com/sets/domain-range-codomain.html
Structure illustration ImagesAudio files Books Collection includes A very simple structure. How might it be enhanced? How would an index be included? What substructures might be added? What are the G, L, F, V, E parts of this example?
Definitions, cont’d Definition: A space is a measurable space, measure space, probability space, vector space, topological space, or metric space – A vector space is a representation for the set of elements in a collection. The vector representing each element is a set of characteristics held by that element and both connecting that element to others that are similar and distinguishing it from those that are different. – We will do an exercise to illustrate
Vector space illustration Consider a car. What are the characteristics that you associate with a car? – If you want to compare one car to another, what characteristics would you choose? – If you wanted to distinguish a car from another type of vehicle, what characteristics would you need? distinguish from a snowmobile distinguish from a truck Make a vector of those characteristics. Then, fill in the vector for several specific cars.
Definitions - 3 Definition: A scenario is a sequence of related transition events (e 1, e 2, …, e n ) on state set S such that e k = (s k, s k+1,) for 1 <= k <= n. – More easily visualized, a scenario is a path in a directed graph, G = (S, ∑ e ), where vertices correspond to states in the state set S and directed edges are equivalent to events in a set of events, ∑ e, and correspond to transitions between states. – Scenarios must be implemented to make a working system.
Definitions - 4 Definition: A society is a tuple (C,R) where – C = (c 1, c 2, …, c n ) is a set of conceptual communities, each community referring to a set of individuals of the same class or type (e.g. actors, activities, components, hardware, software, data); – R = (r 1, r 2, …, r m ) is a set of relationships, each relationship being a tuple r j = (e j, i j ) where e j is a Cartesian product c k 1 x c k 2 x … x c k n j. 1<= k 1 < k 2 < … < k n j <= n, which specifies the communities involved in the relationship and i j is an activity.
Projects in our DL laboratory Mendel 289 is the center of activity for projects related to digital libraries and similar projects. Summary of the projects under way, which may present opportunities for class projects or for independent study NSDL, CITIDEL, CSTA, Ensemble, Distributed Expertise, Computing Ontology, Interdisciplinary Computing and its relationship to the libraries ….
Our systems Now available – Fedora linux machines, remotely accessible (use the gateway) – Bare machines with just basic system – We can install Drupal either from the Drupal site (doing things for ourselves) or from the Bitnami site (builds the stack for us) I just heard that Drupal may already be installed. Feel free to uninstall and reinstall if you wish. If you have a computer of your own and want to use it, – Fine, but you must be able to demonstrate it to the class at the end of the semester. I will need to be able to see what you are doing from time to time during the semester. – That means you need a static IP address.
The Digital Library Content Essential elements for a digital library – Users – Content – Services
Content - requirements Obtain Store – Organize – Describe Find Deliver
Describing the content How to describe content – Metadata Machine readable description of anything What description – Machine readable requires standard descriptive elements Dublin Core (http://dublincore.org/)http://dublincore.org/ – International standard – “a standard for cross-domain information resource description.” – 15 descriptive elements Other metadata schemes – IEEE-LOM
Metadata What does metadata look like? Metadata is data about data – Information about a resource, encoded in the resource or associated with the resource. The language of metadata: XML – eXtensible Markup Language
XML XML is a markup language XML describes features There is no standard XML Use XML to create a resource type Separately develop software to interact with the data described by the XML codes. Source: tutorial at w3school.com
XML rules Easy rules, but very strict First line is the version and character set used: – The rest is user defined tags Every tag has an opening and a closing
Element naming XML elements must follow these naming rules: –Names can contain letters, numbers, and other characters –Names must not start with a number or punctuation character –Names must not start with the letters xml (or XML or Xml..) –Names cannot contain spaces
Elements and attributes Use elements to describe data Use attributes to present information that is not part of the data – For example, the file type or some other information that would be useful in processing the data, but is not part of the data.
Repeating elements Naming an element means it appears exactly once. Name+ means it appears one or more times Name* means it appears 0 or more times. Name? Means it appears 0 or one time.
Parts of an XML document Elements – The components of an XML document – Some contain other parts, some are empty Ex in HTML: “br” or “table” in XML “ingredient” Attributes – Information about elements, not data Ex in HTML “src=” in XML “scale=” Entities – Special characters or strings with pre-assigned meaning Ex in HTML   for non-breaking space PCDATA – Parsed Character data: text that will be parsed and interpreted by the reader. Tags and entities will be expanded and used in presentation. CDATA – Character data: text that will not be parsed and interpreted. It will be displayed exactly as provided. The HTML examples are familiar; the XML examples are made up – dependent on the specific XML scheme used
Using XML - an example Define the fields of a recipe collection: ISO 8859 is a character set. See http://www.bbsinc.com/iso8859.html
Processing the XML data How do we know what to do with the information in an XML file? – Document Type Definition (DTD) Put in the same file as the data -- immediate reference Put a reference to an external description Provides the definition of the legitimate content for each element
Document Type Definition Repeat 0 or more times
Meringue cookies 3 egg whites 1 cup sugar 1 teaspoon vanilla 2 cups mini chocolate chips Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off and leave overnight. Not the way that I want to see a recipe in a magazine! What could we do with a large collection of such entries? How would we get the information entered into a collection? External reference to DTD
XML exercise Design an XML schema for an application of your choice. Keep it simple. Examples -- address book, TV program listing, DVD collection, …
Another example A paper with content encoded with XML: http://tecfaseed.unige.ch/staf18/modules/ePBL/uploads/proj3/paper81.xml http://tecfaseed.unige.ch/staf18/modules/ePBL/uploads/proj3/paper81.xml First few lines: Standards E-learning and their possible support for a rich pedagogic approach in a 'Integrated Learning' context Rodolophe Borer http://tecfa.unige.ch/perso/staf/borer/ "ePBLpaper11.dtd” shown on next slide
Vocabulary Given the need for processing, do you want free text or restricted entries? Free text gives more flexibility for the person making the entry Controlled vocabulary helps with – Consistent processing – Comparison between entries Controlled vocabulary limits – Options for what is said
Vocabulary example Recipe example – What text should be controlled? – What should be free text? Ingredients – Ingredient-amount – Ingredient-name – Should we revise how we coded ingredient amount? Directions
Dublin Core Standard set of metadata fields for entries in digital libraries: – Title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights
Dublin Core elements see: http://dublincore.org/documents/dces/ Title Creator Subject - C Description Publisher Contributor Date Type - C Format - C Identifier Source Language Relation Coverage - C Rights Rights Management information Space, time, jurisdiction. C = controlled vocabulary recommended. Ref. to related resource Standards RFC 3066, ISO639 Unambiguous ID Ex: collection, dataset, event, image YYYY-MM-DD, ex. Entity primarily responsible for making content of the resource Entity making the resource available Contributor to content of the resource What is needed to display or operate the resource.
Dublin Core Terms An update to the original DC elements – Adds the concept of range and domain Each term has this minimal set of attributes: Name: A token appended to the URI of a DCMI namespace to create the URI of the term. Label: The human-readable label assigned to the term. URI: The Uniform Resource Identifier used to uniquely identify a term. Definition: A statement that represents the concept and essential nature of the term. Type of Term: The type of term as described in the DCMI Abstract Model [DCAM].
DC Terms Additional Attributes possible : Comment: Additional information about the term or its application. See: Authoritative documentation related to the term. References: A resource referenced in the Definition or Comment. Refines: A Property of which the described term is a Sub-Property. Broader Than: A Class of which the described term is a Super-Class. Narrower Than: A Class of which the described term is a Sub-Class. Has Domain: A Class of which a resource described by the term is an Instance. Has Range: A Class of which a value described by the term is an Instance. Member Of: An enumerated set of resources (Vocabulary Encoding Scheme) of which the term is a Member. Instance Of: A Class of which the described term is an instance. Version: A specific historical description of a term. Equivalent Property: A Property to which the described term is equivalent.
DC terms See http://dublincore.org/documents/dcmi- terms/http://dublincore.org/documents/dcmi- terms/ Review the list and see what has been added
A Drupal example Ensemble: www.computingportal.orgwww.computingportal.org
IEEE - LOM Example of a specialized metadata scheme – Learning Object Metadata Specifically for collections of educational materials Includes all of Dublin Core See http://projects.ischool.washington.edu/sasutton/IEEE1484.html
Computing systems Linux machines Introduction to unix: http://www.csc.villanova.edu/~lab/unix/ http://www.csc.villanova.edu/~lab/unix/ Dspace: http://www.dspace.org/http://www.dspace.org/ – Documentation, including installation - http://www.dspace.org/index.php?option=com_content&task=view&id=151&Itemid=116 Najib Nadi, our system administrator, is setting up the machines. He will send a message to the class by the middle of the week with details of machine location and login. Remember - you have the option to use your own machine, but must meet the criteria described last week.
This session Defined meta data and its role in digital libraries. Introduced XML as a language for describing a collection of content. Described the computing resources and how to get ready for the first DL setup.