Presentation is loading. Please wait.

Presentation is loading. Please wait.

Constructing Your Own Corpus from Written Language.

Similar presentations


Presentation on theme: "Constructing Your Own Corpus from Written Language."— Presentation transcript:

1 Constructing Your Own Corpus from Written Language

2 Some likely sources for your corpus 1. From MS Word files 2. From the World Wide Web 3. From scanned books 4. From speech audio files

3 What you need MS Word Notepad PDF to Text Conversion Program (Simpo PDF to Text is a very good one)

4 Convert your files into plain text file Prefer UTF-8 Encoding, for it can represent all characters in every language, such as Chinese, Russian, Turkish, and so on. Give your files and folders a meaningful name (consistent and systematic)

5 1. From MS Word Files to Text Open your MS Word Document.  [File or MS Word Symbol on top left corner]  Save as  Other Formats  Save as Type = Plain Text  Save File Conversion window will pop up.  Select Other Encoding  Highlight Unicode (UTF-8)  OK  Close MS Word  Go back to the folder and find the text file you just saved.  Double click on it and it will open in Notepad. Check how it looks.

6 An easier way Use a WordToText converter For example, Zilla http://www.pdfzilla.com/zilla_word_to_text_c onverter.html http://www.pdfzilla.com/zilla_word_to_text_c onverter.html

7 Clean the tables Check the file and clean the parts that you do not want to include in your research. For example, you might want to exclude the names of the students, tables, figures, and references.

8 2. From the World Wide Web to Notepad, and Notepad to Text Find an article on the internet, may be from an online newspaper Using the mouse, left click and highlight the part of the text, then press ctrl + c. Open Notepad. Press ctrl + v to paste it.  File  Save as  Encoding = UTF-8  Save

9 3. From scanned books  Scan every page and save as Searchable PDF files.  Convert your PDF files to text files (You can use Simpo PDF to Text, Adobe Reader, PDF Creator)  Correct the mistakes (Sometimes there are tons of them)  Save the text files in UTF-8 Encoding

10 Tag Your Corpus for Other Information You may want to tag your corpus for information that is different from POS. For example, hedges, pauses, disagreement, metaphors, grammar mistakes, and so on. You need to do this by entering the annotations by hand. Or, you can use a software program that is especially designed for making this process faster for you.

11 Scenario 1 You have decided to create a corpus out of your students’ papers. You asked your students to email their papers to you in MS Word format and they did. You want to study the types of contexts they prefer passive voice.

12 Scenario 2 You have decided to create a corpus out of the applied linguistics books and articles that you have read. You want to compare lexical bundles in them with the ones you use in your academic papers. Luckily some of the articles were already in PDF format but you had to scan some of your books.

13 Scenario 3 You want to create a corpus of newspaper headlines from New York Times and USA Today to compare their lengths.

14 Scenario 4 You have decided to create a corpus out of your own writing. You want to use all of the academic papers you wrote during your MA years.

15 Is there a faster way to follow these procedures? Yes! If you know a programming language, such as PERL, you can write a code and make most of the above mentioned procedures automatic.


Download ppt "Constructing Your Own Corpus from Written Language."

Similar presentations


Ads by Google