Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant.

Similar presentations


Presentation on theme: "Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant."— Presentation transcript:

1 Digital Text and Data Processing Tokenisation

2 Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant reading □ Research projects and assignment 1

3 Revision □ Regular expressions □ Simple sequences of characters □ Character classes, e.g. \w, \d or. □ Quantifiers, e.g. {2,4} or ?, +, * □ Anchors, e.g. \b, ^, $

4 Match variables □ Parentheses create substrings within a regular expression □ In perl, this substring is stored as variable $1 □ Example: $keyword = “quick-thinking” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “quick” }

5 Three types of variables □ Scalars: a single value; start with $ □ Arrays: multiple values; start with @ @titles = (“Ullyses”, “Dubliners”, “Finnegan’s Wake”) ; □ Hashes: Multiple values which can be referenced with ‘keys’; start with % %isbn ; $isbn{“9782070439713”} = “Ullyses”;

6 $line = "If music be the food of love, play on" ; @array = split(" ", $line ) ; # $array[0] contains "If" # $array[4] contains "food" Basic tokenisation

7 Looping through an array foreach my $w ( @words ) { print $w ; } Looping through an array

8 my %freq ; $freq{"if"}++ ; $freq{"music"}++ ; print $freq{"if"}. “\n" ; Creating a hash Assigning / updating a value

9 Calculation of frequencies my %freq ; foreach my $w ( @words ) { $freq{ $w }++ ; }

10 foreach my $f ( keys %freq ) { print $f. "\t". $freq{$f} ; } Looping through a hash

11 foreach my $f ( sort { $freq{$b} $freq{$a} } keys %freq ) { print $f. "\t". $freq{$f} ; } Sorting a hash

12 But she returned to the writing-table, observing, as she passed her son, "Still page 322?" Freddy snorted, and turned over two leaves. For a brief space they were silent. Close by, beyond the curtains, the gentle murmur of a long conversation had never ceased.

13 Is it actually a word? foreach my $w ( @words ) { if ( $w =~ /(\w)/ ) { $freq{ $1 }++ ; } }


Download ppt "Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant."

Similar presentations


Ads by Google