Presentation is loading. Please wait.

Presentation is loading. Please wait.

Practical Text Mining With Perl 데이터베이스연구실 김 민 흠. 3.7 Two Text Application This section discusses two applications, which are easy to program in Perl thanks.

Similar presentations


Presentation on theme: "Practical Text Mining With Perl 데이터베이스연구실 김 민 흠. 3.7 Two Text Application This section discusses two applications, which are easy to program in Perl thanks."— Presentation transcript:

1 Practical Text Mining With Perl 데이터베이스연구실 김 민 흠

2 3.7 Two Text Application This section discusses two applications, which are easy to program in Perl thanks to hashes. The first illustrates an important property of most texts, one that has consequences later in this book. The second develops some tools that are useful for certain types of word games.

3 3.7.1 Zipf’s Law for A Chiristmas carol Program 3.2 A concordance program that finds matches for a regular expression. The file name, regex, and text extract radius are given as command line arguments.

4 3.7.1 Zipf’s Law for A Chiristmas carol As discussed in section 2.4.2 and 2.4.3, hyphens and apostrophes cause problems. Using program 3.2, we can find all instances of potentially problematic punctuation. These cases enable us to decide how to handle the punctuation so that the words in the novel change as little as possible.

5 3.7.1 Zipf’s Law for A Chiristmas carol First dashes Command line argument 사용 C:\>perl 78ex.pl A_Christmas_Carol.txt -- 30

6 3.7.1 Zipf’s Law for A Chiristmas carol  Second single hypens.  C:\>perl 78ex.pl A_Christmas_Carol.txt “\w-\w” 30

7 3.7.1 Zipf’s Law for A Chiristmas carol Third apostrophes. Apostrophes are used for quotes within quotations as well as for possessive nouns. The latter produces one ambiguity due to possessives of plural nouns ending in s for example, seven years'. Another possible ambiguity is a contraction with an apostrophe at either the beginning or the end of a word.

8 3.7.1 Zipf’s Law for A Chiristmas carol Perl 78ex.pl A_Christmas_Carol.txt “\w’\W” 30Perl 78ex.pl A_Christmas_Carol.txt “\W’\w” 30

9 Program 3.3 This program counts the frequency of each word in A Christmas Carol. The output is sorted by decreasing frequencies. CSV 파일 ( 쉼표구분 파일 ) 프린트 됩니다.

10 3.7.2.1 An Aid to Crossword Puzzles 가로세로 퍼즐에 맞는 단어를 찾음. CROSSWD.TXT 가 라인당 하나의 단어를 가지고 있기때문에 REGEX 가 작동한 다. C:\>Perl 85ex.pl “^\w{2}j\w{2}n\w$” REGEX 에 ^ 과 $ 를 사용해서 7 문자를 표시.

11 3.7.2.2 word Anagram 아나그램 dictionary 를 만든다. 알파벳순서로 정렬된 각각의 단어들로 기재되어 있다. 예 ) bdac 는 abcd 의 index 를 문자열을 가지고 있다.

12 3.7.2.3 Finding Words in a Set of Letters 한 그룹뿐아니라 서브그룹도 고려. 예 ) 8 개의 글자로 255 개의 subset 을 만들수 있음 Program 3.6 This program finds all words formed from subsets of a group of letters.

13 3.8.1 References and Pointers 예)예) $wordref 가 the 의 메모리 위치를 저장  디레퍼런스 : 저장된 위치의 값을 검색하는 방법 레퍼런스앞에 $ 를 붙이거나 -> 를 사용 레퍼런스 : 변수등이 지정되어 있는 위치

14 3.8.1 References and Pointers 레퍼런스를 사용하는 법 백슬러시, 대괄호 [ ] (anonymous array) 배열이나 연상배열을 디레퍼런스 : 레퍼런스앞에 각각 @ 와 % 를 붙임

15 3.8.1 References and Pointers 해시배열 ( 연상배열 ) => 은, 대신에 사용 Anonymous 해시는 중괄호 사용

16 3.8.2 Arrays of Arrays and Beyond Arrays of Arrays Anonymous array 의 리스트 세가지 모두 동일한 표현 By putting $data[0] into @{ } this is dereferenced. [ ] [ ] 사이에 arrow 를 포함

17 3.8.2 Arrays of Arrays and Beyond Code 3.31 $#data 는 @data 의 마지막 index 부여

18 3.8.2 Arrays of Arrays and Beyond Code 3.32

19 3.8.2 Arrays of Arrays and Beyond Code 3.33

20 감사합니다


Download ppt "Practical Text Mining With Perl 데이터베이스연구실 김 민 흠. 3.7 Two Text Application This section discusses two applications, which are easy to program in Perl thanks."

Similar presentations


Ads by Google