Presentation on theme: "1 M-CAST in libraries National library of the Czech Republic Marie"— Presentation transcript:
1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz
2 Multilingual Content Aggregation System based on TRUST Search Engine European project aim of the project is to develop a multilingual system will be applied in large digital collections of multilingual data –libraries hybrid digital (internet) –publishing houses –press agencies –scientific databases system is tested by two libraries, for which Multimedia Content Aggregation Portal (M-CAP) is created portal allows to find answers to natural language queries
5 Application, target users group M-CAP portals are developed and tested in two libraries –Polish Internet Library - PBI –Czech National Library - CNL to make their digital resources available online for finding answers to natural language queries in multilingual digital collections target users group of the M-CAST system can be categorized into 2 main classes –internet users –library users
6 Library users can search metadata about documents or entire documents or parts of documents (zones and fields) by entering words, phrases using database search one of the main objective of M-CAST system is to enable library and internet users to pose questions in natural language by offering them QA method according to prevalent information resources contained in library collections, two kinds of libraries in current online environment are defined –hybrid libraries –digital libraries with a subcategory of internet libraries
7 Hybrid libraries 'new' electronic information resources and 'traditional' hardcopy resources co-exist brought together in an integrated information service accessed via electronic gateways available both on-site and remotely via the Internet or local computer networks intention of hybrid libraries users is to get information about a document or piece of information extracted from metadata or the document itself is supposed that the portion of digitized or digital born documents in hybrid libraries will be growing, therefore the demand to formulate the natural language queries by the library users will increase as well
10 Digital Libraries organized collections of multimedia and other types of resources in electronic form acquisition, storage, preservation, retrieval is carried out through the use of digital technology access to the entire collection is globally available directly or indirectly across a network DL supports users in dealing with information objects and helps them in the organization and presentation of the objects via electronic/digital means Internet Libraries – a subcategory of Digital Libraries
11 Polish Internet Library intended to become a full presentation of Polish (and world) literature, containing works belonging to the sphere of fiction and non-fiction literature Polish Internet Library will constitute the basis for the creation of Polish educational and cultural resources on the Internet, whose lack creates one of the barriers to the development of an information society
15 Users’ requirements at PBI a users survey was performed to assess some users’ requirements and expectations for the M- CAST system significant survey results useful for the M-CAST system requirements: –85 % of users are interested in receiving responses in foreign languages, among which 94 % in English 28 % in French 18 % in Italian 12 % in Czech 6% in Portuguese –74 % of users would like to receive simplified responses translation to Polish
16 Users’ requirements at CNL a users survey conducted by the CNL assesses following users requirements regarding foreign languages: –70 % of library users are interested in providing searches in foreign languages, from which – 80 % in English – 25 % in French – 13 % in Polish (North Moravia and Silesia) – 10 % in Italian – 5 % in Portuguese
17 Survey results - conclusions current library and internet users prefer –to receive responses in different languages –to perform searches in foreign language literature –to receive a simplified translation of query results M-CAST system will enable –choosing a response language –searching among different repositories using peer to peer communication –choosing to translate query results to the query language
18 Question answering (QA) system is a type of information retrieval based on sophisticated natural language processing (NLP) techniques provides direct answers to user questions posed in natural language by consulting its knowledge base three main components of automated QA system –a retrieval/ search engine that handles retrieval requests –a query formulation mechanism that translates natural- language questions into queries for the IR engine in order to retrieve relevant documents from the collection –an answer extraction analyses these documents and extracts answers from them
19 QA system based on TRUST search engine system searches a set of plain text documents in a local hard disk returns a ranked list of sentences containing the answer to a given natural language question –in future a unique exact answer from the retrieved sentence will be extracted the question is submitted it is categorized according to special question typology trough an internal query a set of potentially relevant documents are retrieved each document contains a list of sequences which are assigned to the same category as the question sentences are weighted according to their semantic relevance and similarity with the question trough specific answer patterns are sentences examined again and the parts containing possible answers are extracted and weighted a single answer is chosen from among all candidates
20 Questions factoid/factaul/fact based questions –fact-based, short-answer questions such as "How much does the camel cost?“ – answer: typically a noun phrase opinionoid/opinion-oriented questions –involve opinions, evaluations, judgments, emotions, sentiments, or speculations specificity of questions –questions should be general enough to apply to more than one document on the topic –questions shouldn't be too specific, asking for details not likely to appear in other documents documents provided to the automatic system will be from the same period of time 86 types of questions
21 Formulating questions, answer string for each of the defined questions a set of answer strings/sentences is needed an answer string/sentence is a piece of text from a document that contains some words that correspond to the question each answer string/sentence should appear explicitly in the text, it MUST be wholly contained in a single sentence „explicit“ means that the answer string/sentence need not contain the same words as used in the question but it is NOT possible to bring in extra background knowledge to interpret the string as an answer there should be at least one document in the collection that contains an answer to defined question the answer string can NOT be longer than a whole sentence for a single question, it is possible that there may be more than one answer string in the document collection
22 M-CAST – searching scenario simple search user enters a query/question in a simple search form advanced search user enters a query/question in an advanced search form settings user can define: –results list size – maximum number of query results displayed on a single page –query language – language of a query entered by a user –response language – language of resources in which the M-CAST will perform a search, and display results –repositorie(s) – repositories among which M-CAST will perform a search –full-text and metadata option– if full-text is checked, search will be performed using a full-text of resources; if not checked – search will be performed using only resources’ metadata –best answers – if checked, only best search results will be displayed –spell checking – if checked, M-CAST will perform spell checking of a query entered by a user
23 Welcome page of M-CAST at CNL Settings Simple search Search term
24 Simple search - settings Response language Results list size Query language Repositories Highlight Keyword Full text
25 Results list site Question Number of results Ranked Result list, author, title, fragment containing the answer
26 CNL resources CNL makes following resources available for M-CAST search purposes: –ALEPH library catalogue system – about 100 000 catalogue records with full text abstracts of the documents and contents of documents – to be integrated in December –Manuscriptorium – about 50 000 catalogue records with document’s metadata –Kramerius – OCRed old monographs and periodicals
27 M-CAST at CNL - limits of testing few digitized and digital born documents free available questions/types of questions defined for retrieving information by using contemporary language „modern“ queries are less effective for retrieving documents containing historical terms (applied in historical texts) when formulating questions we have to overcome difficulties with –spelling –complex syntax –historical vocabulary factoid questions
28 Kolik rodin se vrátilo do Chebu? How many families did return to the city of Cheb?
29 Do Chebu se vrátilo 23 rodin 23 families returned to the city of Cheb
30 Kolik rodin se vrátilo do Lokte? How many families did return to the city of Loket?
31 Do Lokte se vrátilo 8 rodin Eight families returned to the city of Loket
32 How much does a camel cost? How much can a camel carry?
33 A camel can carry about 8-10 cents (old) A camel costs 120 guilders (zlaty)
34 Kolik starostů má Cařihrad? How many mayors are in Constantinople?
35 …mají 4 starosty There are four mayors in Constantinople
36 Kdy se mohou Turci ženit? When can Turcs get married?
37 Turci mohou se již ve třináctém roce ženiti Turcs can get married at the age of 13