Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.

Similar presentations


Presentation on theme: "TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages."— Presentation transcript:

1 TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages

2 Humanites Computing 人文資訊學 - Digital Humanities 1 Main applications (so far): ● Digitization of known tools: – encyclopedias 百科全書 – Dictionaries & glossaries 辭典 – bibliographies, indices 參考書目, 索引 ● But also new types of knowledge bases (datasets/interfaces etc.)

3 Humanites Computing 人文資訊學 - Digital Humanities 2 ● New forms of information production & dissemination: wiki, blog, social network applications, twitter... ● New methods research questions: – authorship attribution & stylistic analysis – literary analysis – linguistic analysis, corpus linguistics

4 Example: authorship attribution 1 ● Mosteller and Wallace (1964): Inference and Disputed Authorship – The Federalist ● 1787-8: 85 papers, Hamilton, Madison, Jay ● 12 of disputed authorship: either Hamilton or Madison

5 Example: authorship attribution 2 ● Count Sentence Length 句子長度 ☹ ● Vocabulary usage 詞彙使用量化性分析: ☺ – compare frequency for 30 marker words – e.g. “upon”: Hamilton (2.93 per 1000), Madison (0.16 per 1000)

6 Example: analysing literary texts 分析文學 ● Estrella Irizarry (1992) compares two Mexican writers (O. Paz ♂ and Rosario Castellanos ♀) on language use & gender ● ♀ uses more and longer questions ● ♂ uses more words like ‘always’ and ‘absolutely’, expressions of certitude ● Words of compassion (taken from a thesaurus) appear only in ♀ work

7 Example: corpus linguistics 語言資料庫語言學 1 ● British National Corpus (BNC) (http://www.natcorp.ox.ac.uk/) ● 100 mil. words (一億詞), in samples of 45,000 words ● Markup with TEI (P3) ● Automated Part of Speech (PoS) tagging

8 Example: corpus linguistics 語言資料庫語言學 2 ● The BNC is: ● balanced 平衡的: written, spoken material from divers sources ● monolingual 單語的: only English ● synchronic 同時的/同步的: 20 th century

9 Core Technologies ● Relational databases ● For the Humanities: XML-technologies (XSLT, XQUERY, SVG...) & Mark-up (TEI) ● Web interfaces (HTML, JS, CSS, AJAX, RSS...)

10 4 stages in the production of high- quality digital texts 1. Input 輸入 2. Add Value 加值: Basic Markup 基本標記 Deep Markup 詳細標記 3. Content Delivery 內容發行 4. Archiving 保存

11 1. Input 輸入 ● Basic data input ● Texts: – Keyboarding (Double Keying evtl. with Access TEI) – Scanning 掃描 (OCR: Optical Character Recognition 光學字元辨識機) ⇨ a file (perhaps a.txt file)

12 2a. Basic Markup 基本標記 ● File management (formats, file-names etc.) 檔 案處理系統 (格式, 檔名 etc.) ● Metadata management 關於數位化過程的 Metadata (e.g. teiHeader) ● Basic structural content markup基本架構的內 容標記 (e.g. with TEI) ⇨ probably an.xml file

13 2b. Scholarly in-depth markup & Documentation 學術標準標記與使用說明 ● Value adding through encoding 以標記加值 ● Encode (with TEI) what you wish to say about the text ⇨ valid TEI.xml file ● Document your markup procedures ● Documentation is indispensable

14 3. Content Delivery 內容發行 ● Making the content available. E.g. – online-interface – webservice – as DVD or FlashDrive – Source data as archive ● This needs skills beyond markup.

15 4. Archiving 保存 ● Stability = Referencability ● Make sure your edition finds its way into larger collections, repositories or archives ● E.g.: – Internet Archive – GRETIL, Gutenberg Project ● Be happy if other projects transform and reuse your content.

16 © marcus bingenheimer 2006-2010


Download ppt "TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages."

Similar presentations


Ads by Google