Presentation is loading. Please wait.

Presentation is loading. Please wait.

Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Similar presentations


Presentation on theme: "Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow."— Presentation transcript:

1 Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow Research Center for Information Technology Innovation & Institute of Information Science, Academia Sinica 2012.04.20

2 Outline  Introduction  Union catalog  Databases and metadata for digital contents and websites  Knowledge engineering  Future perspective

3 Introduction The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly. The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.

4 Outline  Introduction  Union catalog  Databases and metadata for digital contents and websites  Knowledge engineering  Future perspective

5 What is the union catalog ? It is a catalog and portal for all digital collections of TELDAP. It is an integrated platform for browsing and searching entire digital contents of TELDAP. Metadata provides core descriptions and licensing information of each digital collection.

6 Browsing by topics Search by keywords Home Page of Union Catalog

7 Some improved functions for IR Keyword suggestion Keyword extension Recommendation of related collections

8 Keyword suggestion

9 Keyword extension

10 Digital Image Recommendation of related collections Hyperlink to database Metadata Citation Social networking service Licensing Information

11 Outline  Introduction  Union catalog  Databases and metadata for digital contents and websites  Knowledge engineering  Future perspective

12 Metadata models for different types of objects Archived digital items Union catalog metadata model- Dublin core+ Web sites DCCAP (Dublin Core Collections Application Profile) Fields for internal used only ― Unique Identifier, Format, Evaluation, Cataloging History Documents Document metadata-Dublin core

13 13 Over 4 million digital items and still increasing ElementDefinition Title A name given to the resource Creator An entity primarily responsible for making the content of the resource Subject and Keywords The topic of the content of the resource Description An account of the content of the resource Publisher An entity responsible for making the resource available Contributor An entity responsible for making contributions to the content of the resource Date A date associated with an event in the life cycle of the resource Resource Type The nature or genre of the content of the resource Format The physical or digital manifestation of the resource Resource Identifier An unambiguous reference to the resource within a given context Source A Reference to a resource from which the present resource is derived Language A language of the intellectual content of the resource Relation A reference to a related resource Coverage The extent or scope of the content of the resource Rights Management Information about rights held in and over the resource Metadata for digital items :

14 14

15 Metadata for websites Over 690 websites and still increasing Metadata – DCCAP (Dublin Core Collections Application Profile) – To Combine the standard with our requirements: 19 data fields

16 The Website Homepage Picture URL, Project Information Type, Name, Author, Subject, Description, Language, Item Type, Target Archived Information: URL, time, authorization Copyright, Purpose, Other Information Figure: http://digitalarchives.twhttp://digitalarchives.tw Social networking service

17 Uses of Metadata Search collections by matching keyword and features Provide basic information of each collection Dynamic categorization Provide information to compute similarity or relatedness of two collections Extract keywords

18 (1) Chinese Keyword Search  Keyword+(Features)  Synonyms, hyponyms  Matched Collections  Collections+Weights  Display Results Keyword Extension AAT- Taiwan &Teldap Thesauru s Keyword Matching Ranking Filtering Keyword Dictionary

19 English Keyword Search English Keyword+ (Features) Translations, Synonyms, Hyponyms Matched Collections Collections+Weights Display Results Keyword Translation & Extension AAT- Taiwan &Teldap Thesaurus Keyword Matching Ranking Filtering Keyword Dictionary

20 Ranking Algorithm  Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item) –Association(Keyword, item)=W1*Topical Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item) –Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item) Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item)) Importance of relation (Keyword, item) = W1*Keyword- from Value + W2*Mutual Information (keyword, Topic(item)) Keyword-from Value= 1 if keyword is contained in title(item) 0.5 if keyword is contained in description(item) Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}

21 Algorithm for Recommending Related Collections  i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….}  Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…; where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0;  Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)

22 (2) Dynamic categorization User-oriented categorization General, elementary school students, high school students, researchers, …etc. Topical-based categorization Archaeology, painting, animal, plant, document, …etc. Functional-based categorization Research, education, business, technology,… Categorization based on institutions Academia Sinica, Taiwan U., Palace museum,…

23 (3) Multi-purposes of Core IR System and Databases  Teldap –Whole collections –Searched by institutes, domains, and media types (documents, images, videos, and web sites) –Monolingual  Digital Shop –Whole collections or only fine arts –General search and searched by licensing types –Rely on multilingual thesaurus Taiwan Academy – Fine arts Searched by institutes and domains – Multilingual – Rely on multilingual thesaurus

24 Figure: http://digitalarchives.twhttp://digitalarchives.tw Digitalarchives.tw

25 Purpose: Education Target: Elementary school student, Junior high school student, Teacher… Purpose: Creative applications Purpose: Academic research Subject: Animal, Archaeology, Anthropology… Digitalarchives.tw

26 Figure: http://taiwanacademy.tw http://taiwanacademy.tw Taiwan Academy

27 Categorization based on institutions Topical-based categorization Taiwan Academy

28 Outline  Introduction  Union catalog  Databases and metadata for digital contents and websites  Knowledge engineering  Future perspective

29 Plans of making knowledge structures for TELDAP Construct metadata models for different objects. Establish hyperlinks between contexts and objects. Develop keyword extraction tools. Design automatic tagging tools. Construct TELDAP ontology and thesaurus. Art & Architecture Thesaurus by Getty Chinese WordNet

30 (1) Metadata models for different objects Digital collections – Union catalog metadata model- Dublin core+ Web sites – DCCAP (Dublin Core Collections Application Profile) – Public fields – Private fields Unique Identifier, Format, Evaluation, Cataloging History Documents – Document metadata-Dublin core

31 (2) Create keyword dictionary  Extract from metadata  Collect from Google search terms  By social tagging  Manually collect while tag hyperlinks

32 Lexical Entry of Keyword Dictionary  Keyword id  Keyword  Synset id  Hypernym id  Hyponym id  Features  Related Collections + Association Strengths

33 (2) Establish hyperlinks between contents and objects Identify keywords in contents. Tag keywords with related object hyperlinks.

34 Develop hyperlink tagging tools Word segmentation tools – Resolve word segmentation ambiguities and identify keywords. – CKIP word segmentation system: http://ckipsvr.iis.sinica.edu.tw/ http://ckipsvr.iis.sinica.edu.tw/

35 Develop hyperlink tagging tools TELDAP keyword dictionary – Extract keywords from metadata and establish object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles, descriptions, authors, locations, eras etc. From each class of text file extract keywords by automatic word segmentation, keyword extraction, and manual post editing. – Current dictionary contains more than 120,000 Keywords.

36 Prototype system for hyperlink taggerhyperlink tagger Identify and select keywords from the input text

37 Prototype system for hyperlink tagger Produce text with hyperlinks

38 Prototype system for hyperlink tagger Hyperlinks point to the related digital collections

39 (3) Construct TELDAP ontology and thesaurus Establish association links between Chinese keywords and Getty AAT. Merge TELDAP keywords with Chinese AAT.

40 AAT Browsing trees of Taiwan Academy

41 AAT subject search of Taiwan Academy

42 Recommendation of related items

43 Outline  Introduction  Union catalog  Databases and metadata for digital contents and websites  Knowledge engineering  Future perspective

44 Future Perspective Technology development – Construct multi-lingua thesauri – Getty AAT. – Maintain the TELDAP keyword-and-object relation database. – Construct name authority files, gazetteers, and universal calendars. – Design hyperlink taggers and keyword extension tools. – Design an authoring tool which provides hyperlinks of keyword related digital contents automatically. – Design knowledge-based content retrieval system.

45 Future Perspectives Content enrichment – Within TELDAP : Standardize object metadata model and data format. Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with Wiki- like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and TELDAP collections. – Extend the knowledge sources : e.g. Wikipedia

46


Download ppt "Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow."

Similar presentations


Ads by Google