M-CAST in libraries National library of the Czech Republic Marie

1 1 M-CAST in libraries National library of the Czech Republic Marie

2 2 Multilingual Content Aggregation System based on TRUST Search Engine European project aim of the project is to develop a multilingual system will be applied in large digital collections of multilingual data –libraries hybrid digital (internet) –publishing houses –press agencies –scientific databases system is tested by two libraries, for which Multimedia Content Aggregation Portal (M-CAP) is created portal allows to find answers to natural language queries

5 5 Application, target users group M-CAP portals are developed and tested in two libraries –Polish Internet Library - PBI –Czech National Library - CNL to make their digital resources available online for finding answers to natural language queries in multilingual digital collections target users group of the M-CAST system can be categorized into 2 main classes –internet users –library users

6 6 Library users can search metadata about documents or entire documents or parts of documents (zones and fields) by entering words, phrases using database search one of the main objective of M-CAST system is to enable library and internet users to pose questions in natural language by offering them QA method according to prevalent information resources contained in library collections, two kinds of libraries in current online environment are defined –hybrid libraries –digital libraries with a subcategory of internet libraries

7 7 Hybrid libraries 'new' electronic information resources and 'traditional' hardcopy resources co-exist brought together in an integrated information service accessed via electronic gateways available both on-site and remotely via the Internet or local computer networks intention of hybrid libraries users is to get information about a document or piece of information extracted from metadata or the document itself is supposed that the portion of digitized or digital born documents in hybrid libraries will be growing, therefore the demand to formulate the natural language queries by the library users will increase as well

10 10 Digital Libraries organized collections of multimedia and other types of resources in electronic form acquisition, storage, preservation, retrieval is carried out through the use of digital technology access to the entire collection is globally available directly or indirectly across a network DL supports users in dealing with information objects and helps them in the organization and presentation of the objects via electronic/digital means Internet Libraries – a subcategory of Digital Libraries

11 11 Polish Internet Library intended to become a full presentation of Polish (and world) literature, containing works belonging to the sphere of fiction and non-fiction literature Polish Internet Library will constitute the basis for the creation of Polish educational and cultural resources on the Internet, whose lack creates one of the barriers to the development of an information society

13 13 Example of searching at PBI

15 15 Users’ requirements at PBI a users survey was performed to assess some users’ requirements and expectations for the M- CAST system significant survey results useful for the M-CAST system requirements: –85 % of users are interested in receiving responses in foreign languages, among which 94 % in English 28 % in French 18 % in Italian 12 % in Czech 6% in Portuguese –74 % of users would like to receive simplified responses translation to Polish

16 16 Users’ requirements at CNL a users survey conducted by the CNL assesses following users requirements regarding foreign languages: –70 % of library users are interested in providing searches in foreign languages, from which – 80 % in English – 25 % in French – 13 % in Polish (North Moravia and Silesia) – 10 % in Italian – 5 % in Portuguese

17 17 Survey results - conclusions current library and internet users prefer –to receive responses in different languages –to perform searches in foreign language literature –to receive a simplified translation of query results M-CAST system will enable –choosing a response language –searching among different repositories using peer to peer communication –choosing to translate query results to the query language

18 18 Question answering (QA) system is a type of information retrieval based on sophisticated natural language processing (NLP) techniques provides direct answers to user questions posed in natural language by consulting its knowledge base three main components of automated QA system –a retrieval/ search engine that handles retrieval requests –a query formulation mechanism that translates natural- language questions into queries for the IR engine in order to retrieve relevant documents from the collection –an answer extraction analyses these documents and extracts answers from them

19 19 QA system based on TRUST search engine system searches a set of plain text documents in a local hard disk returns a ranked list of sentences containing the answer to a given natural language question –in future a unique exact answer from the retrieved sentence will be extracted the question is submitted it is categorized according to special question typology trough an internal query a set of potentially relevant documents are retrieved each document contains a list of sequences which are assigned to the same category as the question sentences are weighted according to their semantic relevance and similarity with the question trough specific answer patterns are sentences examined again and the parts containing possible answers are extracted and weighted a single answer is chosen from among all candidates

20 20 Questions factoid/factaul/fact based questions –fact-based, short-answer questions such as "How much does the camel cost?“ – answer: typically a noun phrase opinionoid/opinion-oriented questions –involve opinions, evaluations, judgments, emotions, sentiments, or speculations specificity of questions –questions should be general enough to apply to more than one document on the topic –questions shouldn't be too specific, asking for details not likely to appear in other documents documents provided to the automatic system will be from the same period of time 86 types of questions

21 21 Formulating questions, answer string for each of the defined questions a set of answer strings/sentences is needed an answer string/sentence is a piece of text from a document that contains some words that correspond to the question each answer string/sentence should appear explicitly in the text, it MUST be wholly contained in a single sentence „explicit“ means that the answer string/sentence need not contain the same words as used in the question but it is NOT possible to bring in extra background knowledge to interpret the string as an answer there should be at least one document in the collection that contains an answer to defined question the answer string can NOT be longer than a whole sentence for a single question, it is possible that there may be more than one answer string in the document collection

22 22 M-CAST – searching scenario simple search user enters a query/question in a simple search form advanced search user enters a query/question in an advanced search form settings user can define: –results list size – maximum number of query results displayed on a single page –query language – language of a query entered by a user –response language – language of resources in which the M-CAST will perform a search, and display results –repositorie(s) – repositories among which M-CAST will perform a search –full-text and metadata option– if full-text is checked, search will be performed using a full-text of resources; if not checked – search will be performed using only resources’ metadata –best answers – if checked, only best search results will be displayed –spell checking – if checked, M-CAST will perform spell checking of a query entered by a user

23 23 Welcome page of M-CAST at CNL Settings Simple search Search term

24 24 Simple search - settings Response language Results list size Query language Repositories Highlight Keyword Full text

25 25 Results list site Question Number of results Ranked Result list, author, title, fragment containing the answer

26 26 CNL resources CNL makes following resources available for M-CAST search purposes: –ALEPH library catalogue system – about catalogue records with full text abstracts of the documents and contents of documents – to be integrated in December –Manuscriptorium – about catalogue records with document’s metadata –Kramerius – OCRed old monographs and periodicals

27 27 M-CAST at CNL - limits of testing few digitized and digital born documents free available questions/types of questions defined for retrieving information by using contemporary language „modern“ queries are less effective for retrieving documents containing historical terms (applied in historical texts) when formulating questions we have to overcome difficulties with –spelling –complex syntax –historical vocabulary factoid questions

28 28 Kolik rodin se vrátilo do Chebu? How many families did return to the city of Cheb?

29 29 Do Chebu se vrátilo 23 rodin 23 families returned to the city of Cheb

30 30 Kolik rodin se vrátilo do Lokte? How many families did return to the city of Loket?

31 31 Do Lokte se vrátilo 8 rodin Eight families returned to the city of Loket

32 32 How much does a camel cost? How much can a camel carry?

33 33 A camel can carry about 8-10 cents (old) A camel costs 120 guilders (zlaty)

34 34 Kolik starostů má Cařihrad? How many mayors are in Constantinople?

35 35 …mají 4 starosty There are four mayors in Constantinople

36 36 Kdy se mohou Turci ženit? When can Turcs get married?

37 37 Turci mohou se již ve třináctém roce ženiti Turcs can get married at the age of 13

38 38 Kolik let je králi? How old is the king?

39 39 …jest mu přes 40 let The king is over 40 years old

40 40 Co nosí bulharské vojsko? What do Bulgarian troops wear?

41 41 Vojsko bulharské nosí přižloutlý oblek The Bulgarian troops wear yellowish clothes

42 42 Kdo jel v čele průvodu? Who rode at the head of the procession?

43 43 V čele průvodu jel Václav Vilém z Roupova At the head of the procession rode Václav Vilém z Roupova

44 44 Kdy se bude konat pohřeb? When will the funeral take place?

45 45 …o druhé hodině po polednách ….at 2 pm

46 46 Kde je jeskyně? Where is the cave?

47 47 Jeskyně se nachází na vrchu Eulinus The cave is situated on the hill Eulinus

48 48 Kam chodí Smyrčané v neděli? Where do the citizens of Smyrna go on Sundays?

49 49 Smyrčané chodí v neděli neb ve svátek na pivo On Sundays, the citizens of Smyrna go to beer

50 50 Čeho jsou Athény hlavním městem? Which country is Athens capital of?

51 51 Athény jsou hlavním městem Řecka Athens is capital of Greece

52 52 How much does the palace in the suburbs of Tofano cost?

53 53 The palace in the suburbs of Tofano costs 70 million

54 54 Co udělají Turci křesťanům? What will Turcs do for Christians?

55 55 Turci udělají křesťanům všechno, co mohou Turcs do for Christians everything they can

56 56 Kdy byla dobyta Roudnice? When was the city of Roudnice captured?

58 58 Kde se sjeli přívrženci Gustava Adolfa? Where did the adherents of Gustav Adolf II meet?

59 59 Všichni přívrženci Gustava Adolfa se sjeli v Halle All the adherents of Gustav Adolf II met In the city of Halle

60 60 Co píše kancléř Vilém Slavata? What does chancellor Vilém Slavata write?

62 62 What church have the Servits abandoned?

64 64 Kolik žen je povoleno Turkovi? How many women are allowed to a Turkish man?

65 65 Four women are allowed to a Turkish man

66 66 Kdy zemřel Jan Hus? When did Jan Hus die?

67 67

68 68 Thank you for your attention! Special thanks to my colleagues Irena Pilíková Narcisa Podhradská Magdalena Servítová Jana Vejražková

