Presentation is loading. Please wait.

Presentation is loading. Please wait.

R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

Similar presentations


Presentation on theme: "R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations."— Presentation transcript:

1 R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations

2 R&D Group 开发 以人为本 交流 创造价值 Utilities C/C++ Library Perl (Active Perl) Regular Expression Edit Plus / Ultra Edit Excel

3 R&D Group 开发 以人为本 交流 创造价值 C/C++ Language Standard library: Read a line Remove a CR or LF Split a line C++ Boost Library Case Conversion Trimming Replace Algorithm Finding Algorithm Split

4 R&D Group 开发 以人为本 交流 创造价值 C/C++: Read a Line Though it’s simple, it’s useful! Three methods:

5 R&D Group 开发 以人为本 交流 创造价值 C/C++: Remove CR/LF Get a line under Windows and Linux platform

6 R&D Group 开发 以人为本 交流 创造价值 C/C++: Remove CR/LF (cont.) The noising CR Carriage Return

7 R&D Group 开发 以人为本 交流 创造价值 C/C++: Split a Line Split a line by a specific character HELLOWORLD! HELLOWORLD!

8 R&D Group 开发 以人为本 交流 创造价值 C/C++: Split a Line (cont.) Split a line

9 R&D Group 开发 以人为本 交流 创造价值 C++ Boost: Case Conversion to_upper: Convert a string to upper case to_lower: Convert a string to lower case

10 R&D Group 开发 以人为本 交流 创造价值 C++ Boost: Trimming & Replace

11 R&D Group 开发 以人为本 交流 创造价值 C++ Boost: Split split(): splits the input into parts

12 R&D Group 开发 以人为本 交流 创造价值 Regular Expression Regular expression is a powerful tool for string operations. operatorExplainExample *0 or more timesb, be, bee, beee, …  be* ?0 or one time be,b  be? +1 or more times be, bee, beee …  be+ []any of enclosed[A-Z] ^none of any char[^a-z] ()group(abc)+

13 R&D Group 开发 以人为本 交流 创造价值 An Example *\([0-9/ ]+\) *[0-9\.\?]+%  empty ^( *)([0-9]+)( *)  \2\t

14 R&D Group 开发 以人为本 交流 创造价值 An Introduction to Perl Excels at pattern search and text manipulation (Practical Extraction and Reporting Language) Open source / free software Cheap! Free and available for all systems can use and install without restriction open source promotes portability vastly expandable through freely available modules (add- on libraries at CPAN repository) fewer restrictions/lower cost for commercial use can buy fancy development tools if desired centralized source, linear development path avoids vendor vicissitudes and incompatibilities!

15 R&D Group 开发 以人为本 交流 创造价值 Perl is not compiled #include int main() { float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x); } 100011101100110001 110111000011101110 001101110110001110 001101110101001101 110010110011011011 010101010001110011 100011010101011010 101010010111010111 01100011111000... C Compiler C Compiler #!/usr/bin/perl $x = 6e9; print “Hello world!\n”; printf “All %d of you!\n”, $x; Perl Interpreter Perl Interpreter Hello world! All 6000000000 of you! Source Code Plain text (ASCII) Human readable Human editable Platform Independent C (compiled) Binary Executable NOT human readable NOT human editable NOT platform independent! C Compiler C Compiler Perl is not compiled

16 R&D Group 开发 以人为本 交流 创造价值 A Taste of Perl: print a message #!/usr/bin/perl -w - command interpretation header $x = 6e9; - variable assignment statement print “Hello world!\n”; printf “All %d of you!\n”, $x; } - function calls (output statements) perltaste.pl: Greet the entire world.

17 R&D Group 开发 以人为本 交流 创造价值 Scalar Values Numerical Values integer:5, “3”, 0, -307 floating point: 6.2e9, -4022.33 hexadecimal/octal:0x0d4f, 0477 NOTE: all numerical values stored as floating-point numbers (usu. “double” precision)

18 R&D Group 开发 以人为本 交流 创造价值 String Values Double-quoted: interpolates (replaces variable name/control character with it’s value) Single-quoted: no interpolation done (as-is) Quoting operators: qq//, qw//, etc. $day = “Monday”; “Happy Monday!\n” Happy Monday! “Happy $date!\n” Happy Monday! ‘Happy Monday!\n’ Happy Monday! ‘Happy $date!\n’ Happy $date!\n

19 R&D Group 开发 以人为本 交流 创造价值 String Manipulation Concatenation $dna1 = “ACTGCGTAGC”; $dna2 = “CTTGCTAT”; juxtapose in a string assignment or print statement $new_dna = “$dna1$dna2”; Use the concatenation operator ‘.’ $new_dna = $dna1. $dna2; Add segments serially using incremental concatenation: $new_dna = $dna1; $new_dna.= $dna2; (shorthand for: $new_dna = $new_dna. $dna2; )

20 R&D Group 开发 以人为本 交流 创造价值 Substitution DNA transcription: T  U Substitution operator s//: $dna = “GATTACATACACTGTTCA”; $rna = $dna; $rna =~ s/T/U/;# “GAUUACAUACACUGUUCA” Exercise: Start with $dna =“gattACataCACTgttca”; and do the same as above. Print out $rna to the screen.

21 R&D Group 开发 以人为本 交流 创造价值 transcribe.pl: $dna =“gattACataCACTgttca”; $rna = $dna; $rna =~ s/T/U/g; print "DNA: $dna\n"; print "RNA: $rna\n"; Does it do what you expect? If not, why not? Patterns in substitution are case-sensitive! What can we do? Convert all letters to upper (or lower) case (preferred when possible) If we want to retain mixed case, use transliteration operator tr// $rna =~ tr/tT/uU/;

22 R&D Group 开发 以人为本 交流 创造价值 Case conversion $string = “acCGtGcaTGc”; Upper case: $dna = uc($string);# “ACCGTGCATGC” or $dna = uc $string; or $dna = “\U$string”; Lower case: $dna = lc($string);# “accgtgcatgc” or $dna = “\L$string”; Sentence case: $dna = ucfirst($string) # “Accgtgcatgc” or $dna = “\u\L$string”;

23 R&D Group 开发 以人为本 交流 创造价值 Perl in NLP Look up in Dictionary Word Frequency Chinese Word Segmentation POS …… Whatever you could need

24 R&D Group 开发 以人为本 交流 创造价值 Case study

25 R&D Group 开发 以人为本 交流 创造价值 Thanks for your attention


Download ppt "R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations."

Similar presentations


Ads by Google