Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Regular Expressions in Perl By Josue Vazquez. What are Regular Expressions? A template that either matches or doesn’t match a given string. Often called.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Asp.NET Core Vaidation Controls. Slide 2 ASP.NET Validation Controls (Introduction) The ASP.NET validation controls can be used to validate data on the.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions In ColdFusion and Studio. Definitions String - Any collection of 0 or more characters. Example: “This is a String” SubString - A segment.
Scripting Languages Chapter 8 More About Regular Expressions.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Social Web Design & Research 社群網站設計 & 研究 Social Web Design & Research 1.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1.
Social Web Design 社群網站設計 1 Darby Chang 張天豪 Social Web Design 社群網站設計.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Social Web Design & Research 社群網站設計 & 研究 Social Web Design & Research 1.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Tutorial 14 Working with Forms and Regular Expressions.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
WDV 331 Dreamweaver Applications Find and Replace Dreamweaver CS6 Chapter 20.
Regular Expression JavaScript Web Technology Derived from:
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Social Web Design 社群網站設計 1 Darby Chang 張天豪 Social Web Design 社群網站設計.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming regular expressions.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
20-753: Fundamentals of Web Programming 1 Lecture 10: Server-Side Scripting II Fundamentals of Web Programming Lecture 10: Server-Side Scripting II.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Looking for Patterns - Finding them with Regular Expressions
Advanced Find and Replace with Regular Expressions
Selenium WebDriver Web Test Tool Training
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
PHP –Regular Expressions
Presentation transcript:

Molecular Biomedical Informatics 分子生醫資訊實驗室 Web Programming 網際網路程式設計 1

Bot 機器人 2 Social Web Design 社群網站設計

Bot, spider or crawler

Bot is useful Extract data from other sources –e.g. random numbers from random.org –e.g. word correction by google Commit data to other destinations –e.g. vote cheating or Search Engine Optimization ( 搜尋引 擎最佳化, SEO) Automation –e.g. backup (daily archive user data and ftp it out) It could be performed by-request or periodically Social Web Design 社群網站設計 4

You found a good service

You want to automate it

What’s Social Web Design 社群網站設計 7 the next step you do

Watch here

Just get the content In Perl –use LWP::Simple; my $content = get " In PHP –$content = file_get_contents(" In JavaScript (node.js) –http.request({ host: ' path: '/integers/?num=100...' }, function(response){}) Social Web Design 社群網站設計 9

A more general way curl -s ' –try the powerful Unix commands –-s indicates silent mode (up to you) In Perl, the backquote –`curl...` In PHP –exec('curl...') In JavaScript, everything is asynchronous –require('child_process').exec('curl', function(){}) Social Web Design 社群網站設計 10

After this, what’s your next step?

How many Social Web Design 社群網站設計 12 code you need

One line Social Web Design 社群網站設計 13 /"data">(.+?)</s and print $1;

Regular Expression 正規表示式 14 學會它,一輩子受用無窮 Social Web Design 社群網站設計

Regular expression A regular expression, or regex for short, is a pattern describing a set of strings that satisfy the pattern –e.g. ^A means all strings that start with an ‘A’ –in this slide, a string and its matched part is highlighted as Adam –in most regex engines (JavaScript, Perl, vi…), a regex is specified in between two slashes like /^A/ Social Web Design 社群網站設計 15

Literal character a matches a a matches Jack is a boy –match the first occurrence –not Jack is a boy Characters with special meanings –[ ] \. ^ $ | ? * { } + ( ) –meta-characters –use backslash to escape them –matches 1+1=2 by 1\+1=2 Social Web Design 社群網站設計 16

Character class/set Matches only one out of several characters –[ae] matches a or e –gr[ae]y matches gray or grey –gr[ae]y does not match graey A hyphen inside a character class indicates range –[0-9] matches a single digit between 0 and 9 –combine with single characters [0-9a-fxA-FX] A caret after the opening square bracket indicates not –q[^x] matches question but not Iraq Social Web Design 社群網站設計 17

Shorthand character classes \d matches a digit character \w matches a word character (alphanumeric characters and underscore) \s matches a whitespace character (space, tab and line break) \D, \W and \S are the complement sets Social Web Design 社群網站設計 18

The dot. matches (almost) any char, except the line break –short for [^\n] Most regex engines have a single line mode that makes dot also match line break –/regex/s Social Web Design 社群網站設計 19

Anchors Anchors do not match any characters but match a position ^ matches at the start of the string $ matches at the end of the string \b matches at a word boundary –\bis\b matches This island is nice Social Web Design 社群網站設計 20

Alternation The regular expression equivalent of “or” –cat|dog matches About cats and dogs –can be more than two cat|dog|mouse|fish Social Web Design 社群網站設計 21

Repetition The question mark indicates optional (zero or one time) –colou?r matches colour or color The asterisk indicates zero or more times – matches an HTML tag without any attributes The plus indicates one or more times – is easier to write but matches invalid tags such as Curly brace indicates specific amount of repetition –\b[1-9][0-9]{3}\b matches 1000 to 9999 –\b[1-9][0-9]{2,4}\b matches 100 to Social Web Design 社群網站設計 22

Greedy and lazy repetition The repetition is greedy by default – matches This is good A question mark make it lazy – matches This is good Social Web Design 社群網站設計 23

Grouping and back-references Round bracket indicates group –Set(Value)? matches Set or SetValue Round bracket creates back-references –how to access the back-references depends on the regex engine –$0 = SetValue and $1 = Value or $0 = Set and $1 is null in Perl Social Web Design 社群網站設計 24

Any Questions? Social Web Design 社群網站設計 25 about regex

How many Social Web Design 社群網站設計 26 meta-characters you remember [ ] \. ^ $ | ? * { } + ( )

One line Social Web Design 社群網站設計 27 /"data">(.+?)</s and print $1;

Content in between "data"> and < Any characters, use lazy repetition to prevent spanning tags The single line mode is required for spanning lines Use brackets to create the back-reference and print it

Painful? 29 But, before parsing the content Social Web Design 社群網站設計

Parse the request m=100&min=1&max=100&col=5&base=10& format=html&rnd=new Social Web Design 社群網站設計 30

Parse the request m=100&min=1&max=100&col=5&base=10& format=html&rnd=new This is usually done manually Social Web Design 社群網站設計 31

Occurrences in Google my $kw = 'color'; my $ua = '-A "Mozilla/5.0"'; $kw =~ s/\s+/+/g; my $url = " $_ = `curl $ua -s '$url'`; /"resultsStats".*? (\S+)/ and print $1; Social Web Design 社群網站設計 32 The keyword should come from command line. Here we just keep the code simple.

Occurrences in Google my $kw = 'color'; my $ua = '-A "Mozilla/5.0"'; $kw =~ s/\s+/+/g; my $url = " $_ = `curl $ua -s '$url'`; /"resultsStats".*? (\S+)/ and print $1; Social Web Design 社群網站設計 33 Replace spaces to +, globally. Figure out this rule manually. Google’s advanced web techniques make this harder. Watch the substitute operator with three slashes. The global mode indicates replace all occurrences rather than the first one.

Occurrences in Google my $kw = 'color'; my $ua = '-A "Mozilla/5.0"'; $kw =~ s/\s+/+/g; my $url = " $_ = `curl $ua -s '$url'`; /"resultsStats".*? (\S+)/ and print $1; Social Web Design 社群網站設計 34 Get the results. The user agent string $ua is required to cheat Google.

Occurrences in Google my $kw = 'color'; my $ua = '-A "Mozilla/5.0"'; $kw =~ s/\s+/+/g; my $url = " $_ = `curl $ua -s '$url'`; /"resultsStats".*? (\S+)/ and print $1; Social Web Design 社群網站設計 35 Design the regex to parse the result.

Tip A demo, the backup of zoro crontab –if you want to perform some works periodically – 例行性工作排程的建立 例行性工作排程的建立 Net::Telnet –if you want to deal with BBS rather than web platforms –Re: 又出現一個 PTT 備份站了Re: 又出現一個 PTT 備份站了 Social Web Design 社群網站設計 36

Today’s assignment 今天的任務 Social Web Design 社群網站設計 37

Extract something from other sites Adopt any way mentioned today to enrich the content your site. If you have no ideas, try providing hot news from news sites or hot keywords from search sites. Use regular expression Reference –Regular-Expressions.infoRegular-Expressions.info Your web site ( ex9) will be checked not before 23:59 6/18 (Mon). You may send a report (such as some important modifications) to me in case I did not notice your features. Or, even better, just explain your modifications in the homepage. Social Web Design 社群網站設計 38