CSCE 590 Web Scraping Lecture 5

CSCE 590 Web Scraping Lecture 5
Topics Regular Expressions II Sub, findall, finditer, replace Readings: Python tutorial – section 6.2 January 24, 2017

Overview Last Time: Lecture 3 slides 1-27 Today: References Webpage
Reading and Writing files; CSV files Regular Expressions Search and match Today: Regular Expressions Again References Webpage

Regular Expressions again
Algebraic expressions 3*x + y represent numeric values Regular expressions a(b|c)*a What do regular expressions represent?

Unicode UTF - Unicode Transformation Format
UTF-32 – four bytes for each character 4G characters UTF two bytes per character 2644characters UTF-8 -- UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different number of bytes. For ASCII characters (A-Z, &c.) UTF-8 uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII.

Unicode (default) vs 8-bit strings
Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; Similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string. As the str and bytes types cannot be mixed, you must always explicitly convert between them. str.encode() to convert str  bytes, and bytes.decode() to convert bytes  str.

Match objects match.start match.end match.group match.groups
match.expand

Named groups - (?P<name>...) syntax
(?P<name>...) syntax, the groupN arguments may also be strings identifying groups by their group name. A moderately complicated example: >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") >>> m.group('first_name') 'Malcolm' >>> m.group('last_name') 'Reynolds'

re.sub(pat, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

re.sub(pattern, replacement, text)
sub() replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters: >>> def repl(m): ... inner_word = list(m.group(2)) ... random.shuffle(inner_word) ... return m.group(1) + "".join(inner_word) + m.group(3) >>> text = "Professor Abdolmalek, please report your absences promptly." >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Pythex.org

Displaymatch helper function
def displaymatch(match): if match is None: return None return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Displaymatch examples
>>> valid = re.compile(r"^[a2-9tjqk]{5}$") >>> displaymatch(valid.match("akt5q")) # Valid. "<Match: 'akt5q', groups=()>" >>> displaymatch(valid.match("akt5e")) # Invalid. >>> displaymatch(valid.match("akt")) # Invalid. >>> displaymatch(valid.match("727ak")) # Valid. "<Match: '727ak', groups=()>“ >>> pair = re.compile(r".*(.).*\1") >>> displaymatch(pair.match("717ak")) # Pair of 7s. "<Match: '717', groups=('7',)>" >>> displaymatch(pair.match("718ak")) # No pairs. >>> displaymatch(pair.match("354aa")) # Pair of aces. "<Match: '354aa', groups=('a',)>"

re.findall findall() matches all occurrences of a pattern, not just the first one as search() does Find the Adverbs example >>> text = "He was carefully disguised but captured quickly by police." >>> re.findall(r"\w+ly", text) ['carefully', 'quickly']

re.finditer >>> text = "He was carefully disguised but captured quickly by police." >>> for m in re.finditer(r"\w+ly", text): ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 07-16: carefully 40-47: quickly

Raw strings (again) r“string” helps keep regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. >>> re.match(r"\W(.)\1\W", " ff ") <_sre.SRE_Match object; span=(0, 4), match=' ff '> >>> re.match("\\W(.)\\1\\W", " ff ") >>> re.match(r"\\", r"\\") <_sre.SRE_Match object; span=(0, 1), match='\\'> >>> re.match("\\\\", r"\\")

Phonebook example >>> text = """Ross McFluff: Elm Street Ronald Heathmore: Finley Avenue ... Frank Burger: South Dogwood Way ... Heather Albrecht: Park Place"""

Using split to remove extra lines
>>> entries = re.split("\n+", text) >>> entries ['Ross McFluff: Elm Street', 'Ronald Heathmore: Finley Avenue', 'Frank Burger: South Dogwood Way', 'Heather Albrecht: Park Place']

maxsplit parameter of split()
>>> [re.split(":? ", entry, 3) for entry in entries] [['Ross', 'McFluff', ' ', '155 Elm Street'], ['Ronald', 'Heathmore', ' ', '436 Finley Avenue'], ['Frank', 'Burger', ' ', '662 South Dogwood Way'], ['Heather', 'Albrecht', ' ', '919 Park Place']]

Now maxsplit = 4 >>> [re.split(":? ", entry, 4) for entry in entries] [['Ross', 'McFluff', ' ', '155', 'Elm Street'], ['Ronald', 'Heathmore', ' ', '436', 'Finley Avenue'], ['Frank', 'Burger', ' ', '662', 'South Dogwood Way'], ['Heather', 'Albrecht', ' ', '919', 'Park Place']]

Using reg exps to process files
from urllib.request import urlopen #Retrieve HTML string from the URL html = urlopen(" print(html.read()) from urllib.request import urlopen #Retrieve HTML string from the URL html = urlopen(" for line in html: line.strip print (line)

from urllib.request import urlopen #Retrieve HTML string from the URL html = urlopen(" outf = open("outfile", "w") for line in html: line.strip print (line) outf.write(line.decode("utf-8"))

html = urlopen("https://cse. sc. edu/~matthews/
html = urlopen(" outf = open("outfile", "w") regexp = re.compile('<a\\s+href') count = 0 for line in html: #print (line.decode("utf-8").strip()) if regexp.search(line.decode("utf-8")) != None: print (line.decode("utf-8").strip()) #outf.write(line.decode("utf-8").strip()) count = count+1 print (count)

Findall on webpage from urllib.request import urlopen import re #Retrieve HTML string from the URL html = urlopen(" page=html.read() matches=re.findall('href=".*“',page.decode()) print(matches)

Browser  Web Server Http(get (“ …”) TCP[ Http(get (“ …”) ] IP [ TCP[ Http(get (“ …”) ]] Ethernet [IP [ TCP[ Http(get (“ …”) ]]] … several Ethernet transfers to server

#Chapter 2 - 1-selectByClass
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") nameList = bsObj.findAll("span", {"class":"green"}) for name in nameList: print(name.get_text())

#Chapter 2 - selectByAttribute
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") allText = bsObj.findAll(id="text") print(allText[0].get_text())

BeautifulSoup Findall revisited
find_all(name, attrs, recursive, string, limit, **kwargs) The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

BeautifulSoup Findall name argument
find_all(name, attrs, recursive, string, limit, **kwargs) Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match. soup.find_all("title") # [<title>The Dormouse's story</title>]

BeautifulSoup Findall attrs argument
find_all(name, attrs, recursive, string, limit, **kwargs) Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute: soup.find_all(id='link2') # [<a class="sister" href=" id="link2">Lacie</a>] If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute: soup.find_all(href=re.compile("elsie")) # [<a class="sister" href=" id="link1">Elsie</a>]

Keyword Attribute value=True
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True. This code finds all tags whose id attribute has a value, regardless of what the value is: soup.find_all(id=True) # [<a class="sister" href=" id="link1">Elsie</a>, # <a class="sister" href=" id="link2">Lacie</a>, # <a class="sister" href=" id="link3">Tillie</a>]

Multiple Attributes to BS4-findall
You can filter multiple attributes at once by passing in more than one keyword argument: soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href=" id="link1">three</a>] Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments: data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression You can use these attributes in searches by putting them into a dictionary and passing the dictionary as the attrs argument data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]

Searching by CSS class The name of the CSS attribute, “class”, is a reserved word in Python. Search by CSS class using the keyword argument class_: soup.find_all("a", class_="sister") # [<a class="sister" href=" id="link1">Elsie</a>, # <a class="sister" href=" id="link2">Lacie</a>, # <a class="sister" href=" id="link3">Tillie</a>]

As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True: soup.find_all(class_=re.compile("itl")) ## title # [<p class="title"><b>The Dormouse's story</b></p>] def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters) # [<a class="sister" href=" id="link1">Elsie</a>, # <a class="sister" href=" id="link2">Lacie</a>, # <a class="sister" href=" id="link3">Tillie</a>]

BS4 Findall The string argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples: soup.find_all(string="Elsie") # [u'Elsie'] soup.find_all(string=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie']

soup. find_all(string=re
soup.find_all(string=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"] def is_the_only_string_within_a_tag(s): """Return True if this string is the only child of its parent tag.""" return (s == s.parent.string) soup.find_all(string=is_the_only_string_within_a_tag) # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

BS4 Findall the limit argument
find_all() returns all the tags and strings that match your filters. But you can pass in a number for limit. This works just like the LIMIT keyword in SQL There are three links in the “three sisters” document, but this code only finds the first two: soup.find_all("a", limit=2) # [<a class="sister" href=" id="link1">Elsie</a>, # <a class="sister" href=" id="link2">Lacie</a>]

BS4 Findall the recursive argument
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False. See the difference here: soup.html.find_all("title") # [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False) # [] <html> <head> <title> The Dormouse's story </title> </head> ...

BS4-Calling a tag is like calling find_all()
find_all() is the most popular method in the Beautiful Soup search API, shortcut - treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent: soup.find_all("a") soup("a") These two lines are also equivalent: soup.title.find_all(string=True) soup.title(string=True)

Find revisited find(name, attrs, recursive, string, **kwargs) The find_all() method scans the entire document, but what if there is just one occurrence soup.find_all('title', limit=1) # [<title>The Dormouse's story</title>] soup.find('title') # <title>The Dormouse's story</title> print(soup.find("nosuchtag")) # None

CSCE 590 Web Scraping Lecture 5

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping Lecture 5"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 590 Web Scraping Lecture 5

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping Lecture 5"— Presentation transcript:

Similar presentations

About project

Feedback