Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič.

Similar presentations


Presentation on theme: "Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič."— Presentation transcript:

1 Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič Institute of Formal and Applied Linguistics Computer Science School Charles University in Prague, Czech Republic malach@knih.mff.cuni.cz | http://www.malach-centrum.cz malach@knih.mff.cuni.czhttp://www.malach-centrum.cz

2 21.11.2012 J. Hajic: CVHM & Language Technology 2 From Testimonies to Flexible Access  The USC VHI Archive  Testimonies of Holocaust survivors  Center for Visual History Malach  Access Point to the USC Archive  Activities of CVHM  Access using New Technology  Fulltext (transcript) Search  Cross-lingual Access  Thesaurus Translation  Status and Future Plans

3 21.11.2012 J. Hajic: CVHM & Language Technology 3 Center for Visual History Malach  Access Point to the USC VHI’s Archive, http://www.usc.edu/vhihttp://www.usc.edu/vhi

4 21.11.2012 J. Hajic: CVHM & Language Technology 4 Contents of the Archive  Testimonies recorded in the 1990s

5 21.11.2012 J. Hajic: CVHM & Language Technology 5 Recording the Testimonies  Visual History Foundation  California, USA (Universal Studios)  1990s  Analog video recording technology, 30 minute tapes  Teams of 3 people (moderator, video, audio)  Volume  56 countries, over 105,000 hours of video  32 languages, ~52,000 testimonies  Half of them in English

6 21.11.2012 J. Hajic: CVHM & Language Technology 6 Archiving the Testimonies  Digitization  100s of terabytes of data (NTSC/PAL quality)  Catalogization (indexing)  Thesaurus (55000 keywords)  hierarchical, timeline, places  Goal:  Access (search)  Material for projects

7 21.11.2012 J. Hajic: CVHM & Language Technology 7 Access (Search)  Search by keywords  At 1-minute segments, beginning of topic  Search by particular people, relations  Filter search by  Language spoken  Country of survivor  Experience (survivor/liberator/...)  Not possible: “fulltext” search  Video access: locally available, or on order  Player: usual controls, also by segment, search within video

8 21.11.2012 J. Hajic: CVHM & Language Technology 8 Access Points  Internet: only limited access so far  Throughput (technical limitations), legal & ethical issues,...  → Access Points  ~30 worldwide (USA; EU: Berlin, Budapest, Prague, Warsaw; secondary access available)  2 - 20% of full archive locally  Fast “Internet2” connection  Additional Services  Search and view: standard Internet browser

9 21.11.2012 J. Hajic: CVHM & Language Technology 9 Center for Visual History Malach  Charles University in Prague, est. 2009, coordinator: Jakub Mlynář

10 21.11.2012 J. Hajic: CVHM & Language Technology 10 Center for Visual History Malach  Supported by Charles University  Faculty of Mathematics and Physics, CS School  CS School Library & Institute of Formal and Applied Linguistics  Part of LINDAT-Clarin, Language Data Infrastructure  Clarin ERIC – Pan-European network of LTH Centers  12 workplaces, AV technology, materials  Technology (by Inst. of Formal and Applied Linguistics):  1 Gbit network locally, dataserver (for video cache)  2000 testimonies locally (all Czech, Slovak, Polish, many in English)  Geant connection, 5-10 min. for 30 min. video from USC

11 21.11.2012 J. Hajic: CVHM & Language Technology 11 Center for Visual History Malach: Activities  Seminars  Anniversary seminar (January)  Seminars for students, teachers  Also: foreign visitors (Ukraine – summer 2012)  Workshops  Co-organization of Raoul Wallenberg 100 th Anniversary workshop, Nov. 2012  w/Czech Parliament, Jewish Museum in Prague, Embassies  Tutorials  Using the Archive, How-To-...; Research on Language Technology (with Institute of Formal and Applied Linguistics)

12 21.11.2012 J. Hajic: CVHM & Language Technology 12 Center for Visual History Malach: Activities  Newsletter Web:

13 21.11.2012 J. Hajic: CVHM & Language Technology 13 Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal reasons, etc.) mid-2010 – fall 2012

14 21.11.2012 J. Hajic: CVHM & Language Technology 14 Why “Malach”?  Technology and UI Research Project  2002-2007  Multilingual Access to Large Audio arCHives  malach – “angel” in Hebrew  Support: NSF (National Science Foundation)  Visual History Foundation (predecessor of SFI/USC)  IBM Research, Yorktown Heights, NY, USA  Johns Hopkins Univ., Baltimore, MD, USA  Univ. of Maryland, College Park, MD, USA  Charles University in Prague, CZ (IFAL MFF UK)  Univ. of West Bohemia, Pilsen, CZ (Dept. of Cybernetics)

15 21.11.2012 J. Hajic: CVHM & Language Technology 15 Research in the Malach Project  Research in the area of  Automatic Speech Recognition (of the testimonies)  English, Czech, Slovak, Russian, Polish, Hungarian  Automatic Translation of Thesaurus  Keyword translation  Czech, English  Cross-lingual Audio/Voice Search  Part of the world-wide CLEF 2006, 2007 competition  User interfaces → current VHA search interface

16 21.11.2012 J. Hajic: CVHM & Language Technology 16 Research in the Malach Project  Research in the area of  Automatic Speech Recognition (of the testimonies)  English, Czech, Slovak, Russian, Polish, Hungarian  Automatic Translation of Thesaurus  Keyword translation  Czech, English  Cross-lingual Audio/Voice Search  Part of the world-wide CLEF 2006, 2007 competition  User interfaces → current VHA search interface

17 21.11.2012 J. Hajic: CVHM & Language Technology 17 Automatic Speech Recognition  Core “Front-end” Technology  Current State-of-the-Art: 95% in controlled conditions  Problems:  English: non-native speakers (virtually all 26,000!)  Czech: colloquial speech  All: emotions, elderly people, imperfect recording  Technology issues: not enough in-domain texts  Some improvement reached by 2007

18 21.11.2012 J. Hajic: CVHM & Language Technology 18 Automatic Speech Recognition  Core “Front-end” Technology  Current State-of-the-Art: 95% in controlled conditions  Problems:  English: non-native speakers (virtually all 26,000!)  Czech: colloquial speech  All: emotions, elderly people, imperfect recording  Technology issues: not enough in-domain texts  Some improvement reached by 2007

19 21.11.2012 J. Hajic: CVHM & Language Technology 19 The AMALACH Project  Applied research project, 2012-2015  Implement and integrate (some) MALACH project results  Czech National Cultural Heritage Funding  Partners: Charles Univ., Univ. of West Bohemia (and USC)  Selling point: improved access for local (Czech) researchers  USC Archive: 558 Czech-language testimonies  only a fraction (~ 12%) of 4613 Czech survivors!  Rest: mostly English spoken  Also: 12500 segments containing keyword “Czech”  Solution: cross-lingual fulltext-like search  Needs speech recognition, automatic translation, thesaurus

20 21.11.2012 J. Hajic: CVHM & Language Technology 20 Cross-lingual Search Scheme Mar. 7, 2012UFAL Intro20 Query in E  Archive transcript & query translation Query in A Translation to E Monolingual Search The archive: all audio ASR in multiple lang. Transcr.Z Transcr.A C B... Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER QUERY PROCESSING [OFFLINE] Archive Transcript, E Translation to E

21 21.11.2012 J. Hajic: CVHM & Language Technology 21 Phonetic and Word Search (monolingual)  Automatic Speech Recognition (Univ. of WB) Automatic Speech Recognition Transcript Database VHF04106-0047.18 VHF04167-0146.32 VHF05103-0192.98 ……………… Search System Word and Phonetic Lattice

22 21.11.2012 J. Hajic: CVHM & Language Technology 22 Machine Translation  State-of-the-Art  Cf. Google (currently best for most language pairs)  Still imperfect (applications need varying levels of quality)  Machine translation of speech transcripts  Big challenge: VERY noisy input -  Speech recognition errors  Ungrammatical, non-native, emotional language  Good news  Used in search only (will probably never be shown to users)

23 21.11.2012 J. Hajic: CVHM & Language Technology 23 Statistical Machine Translation Technology  The idea (1940s/1990s) - imagine this:  Translation by the reverse process: “decoding”  Probabilistic model of the translation process  And probabilistic model of the target language  Probabilities learned from (human) translations Czech textEnglish text “Coding”

24 21.11.2012 J. Hajic: CVHM & Language Technology 24 Speech and Language Technology in Search Mar. 7, 2012UFAL Intro24 Query in E Query in A Translation to E Monolingual Search The archive: all audio ASR in multiple lang. Transcr.Z Transcr.A C B... Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … USER QUERY PROCESSING [OFFLINE] Archive Transcript, E Translation to E

25 21.11.2012 J. Hajic: CVHM & Language Technology 25 Status and Future Plans  Czech testimonies  Monolingual Fulltext Search System operational  in CVHM, users can use both VHA and the UWB UI  English speech recognition of the testimonies  Work has started: data preparation ongoing  Translation to Czech  Thesaurus: manually (high quality necessary)  Will be used in the current interface as well  Data: work ongoing, data preparation  “Lattice” translation experiments underway  Cross-lingual search: work starts in 2013

26 21.11.2012 J. Hajic: CVHM & Language Technology 26 Thank you!  VHI http://www.usc.edu/vhi  Institute of formal and applied linguistics http://ufal.mff.cuni.cz http://ufal.mff.cuni.cz  Center for Visual History Malach http://malach-centrum.cz  Dept. of Cybernetics, Univ. of West Bohemia, Pilsen, CZ http://www.kky.zcu.cz  The project “Malach” http://malach.umiacs.umd.edu

27 21.11.2012 J. Hajic: CVHM & Language Technology 27 Closing  Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History, Education and Media Wednesday, Nov 21, 2012 11:00 Section A http://www.preserving-survivors-memories.org


Download ppt "Language Technology Research Serving eHumanities New Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual History Malach Jan Hajič."

Similar presentations


Ads by Google