Presentation is loading. Please wait.

Presentation is loading. Please wait.

C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University.

Similar presentations


Presentation on theme: "C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University."— Presentation transcript:

1 C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University of Zagreb Tutorial Part 1

2 O UTLINE 1. Text.not Create a new text Import a text in any file format (ex. pdf / xml) Open text file 2. Corpus.noc Create a corpus Open corpus 3. Query NooJ RegEx 4. Concordances Build concordances Export concordances 5. Statistical analysis LREC NooJ Tutorial: Corpus Processing

3 T EXT U NITS = TU S Independant areas inside which linguistic resources are applied. *.not LREC NooJ Tutorial: Corpus Processing

4 T EXT : O PEN LREC NooJ Tutorial: Corpus Processing

5 T EXT : I MPORT - PDF LREC NooJ Tutorial: Corpus Processing

6 T EXT : I MPORT - XML LREC NooJ Tutorial: Corpus Processing

7 T EXT : I MPORT - XML LREC NooJ Tutorial: Corpus Processing

8 T EXT : C REATE N EW LREC NooJ Tutorial: Corpus Processing Type new text. Copy -> paste text from another document.

9 C ORPUS Collection of text files that share the same linguistic resources. *.noc LREC NooJ Tutorial: Corpus Processing

10 C ORPUS : C ONSTRUCTING LREC NooJ Tutorial: Corpus Processing

11 C ORPUS : C ONSTRUCTING LREC NooJ Tutorial: Corpus Processing

12 C ORPUS : O PEN E XISTING LREC NooJ Tutorial: Corpus Processing

13 C ORPUS : N EW LREC NooJ Tutorial: Corpus Processing

14 L OCATE P ATTERN – 4 TYPES OF PATTERNS LREC NooJ Tutorial: Corpus Processing

15 L OCATE P ATTERN – 1. STRING OF CHARACTERS LREC NooJ Tutorial: Corpus Processing 15 lady

16 L OCATE P ATTERN – 2. PERL R EG E X LREC NooJ Tutorial: Corpus Processing 16 [0-9] [1-2][0-9][0-9][0-9] [aeiou] [aeiou] [aeiou] [aeiou]

17 L OCATE P ATTERN – 3. N OO J R EG E X LREC NooJ Tutorial: Corpus Processing 17 lady young lady AND concatenation ( a | the ) lady parenthesis lady | girl OR disjunction |

18 E XAMPLES : WRITE A N OO J R EG E X LREC NooJ Tutorial: Corpus Processing 18 that will find all the Mr, Mrs and Miss followed by aName that will find all the words written in upper cases followed by any string of digits If = empty string write a NooJ RegEx that will find all the examples where ‘is’ is followed by 0, 1 or 2 any words that are followed by ‘the, this or that’ instead of ‘is’ recognize any form of verb ‘to be’ between ‘to be’ forms and ‘the, this, that’ there can be any number of word forms (Mr.|Mrs.|Miss) ( )* is ( | | ) (the|this|that) ( | | ) (the|this|that) * (the|this|that)

19 LREC NooJ Tutorial: Corpus Processing 19

20 L OCATE P ATTERN – 4. N OO J G RAMMAR LREC NooJ Tutorial: Corpus Processing 20

21 L OCATE P ATTERN ? LREC NooJ Tutorial: Corpus Processing Will probably see Shall probably never see Is probably going to see Are probably about to see

22 S TATISTICAL A NALYSIS LREC NooJ Tutorial: Corpus Processing

23 S TATISTICAL A NALYSIS LREC NooJ Tutorial: Corpus Processing

24 S TATISTICAL A NALYSIS LREC NooJ Tutorial: Corpus Processing

25 L INGUISTIC U NITS AND ANNOTATIONS Max Silberztein University of Besançon Next Tutorial Part 2


Download ppt "C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University."

Similar presentations


Ads by Google