Download presentation
Presentation is loading. Please wait.
1
Methods using simple tools in advanced ways.
Data Cleanup Methods using simple tools in advanced ways.
2
Start with what you have
Iterative changes Use the tool that works Have a plan to achieve the format you need
3
The Setonian Transforming ~5,000 news stories.
4
This is what we started with.
5
This is what we have to get to
6
Rows: One story per row All fields enclosed by quotation marks
All fields separated by commas This is harder than you think.
7
Columns: id,title,text,status,date_added,date_start,author_id,school_id,category,tags,show_gallery,national,large_image_url,medium_image_url,small_image_url,score,subtitle,byline,external_id,main_homepage,homepage_thumbnail,main_category,category_thumbnail "post_type","post_title","post_content","post_status","post_date","post_author","post_category","post_tags","post_thumbnail","post_excerpt","Byline"
8
Some similarities, some differences
title = post_title text = post_content date_start = post_date author_id = post_author category = post_category tags = post_tags large_image_url = post_thumbnail subtitle = post_excerpt id & external_id date_ added school_id show_gallery national medium_image_url & small_image_url ? score main_homepage & main_category homepage_thumbnail & main_category_thumbnail Byline
9
Start with some low-hanging fruit
What exactly is “” ? It’s a “soft hyphen.” Do a find-and-replace with any text editor to replace with “” (nothing). 3,777 replacements made.
10
And more… What exactly is “ ” ? It’s a “non-breaking space.”
Do a find-and-replace with any text editor to replace with with “ ” (just a space). 28,177 replacements made.
11
And while we’re on the subject…
Do a find-and-replace with any text editor to replace a double space with a single space. 15,064 replacements made. But try it again: 4,448 replacements. Then 1,043… Then 456… Then 76… Then 32… Then 16… Then 2… Finally …. 0
12
So what are all these &…; things anyway?
HTML character entities Mathematical symbols θ written as θ or θ ≈ written as ≈ or ≈ Accented characters ú written as ú or ú Special punctuation, curly quotes and apostrophes, dashes, actual ampersands “ as “ or “ Non-English characters þ written as þ or þ
13
While we’re cleaning house
Tabs Maybe a search-and-replace Maybe Word’s special expansion Tabs are useless in HTML because the get compressed into spaces No matter what kind, multiple spaces are treated as one. Replace tabs with spaces … another 54,000 instances. Then repeat our double- to single-space search: another 6,000+.
14
Now it gets more interesting
Matching something that’s not always the same.
15
What about these blocks?
Font tags are evil <font face=""Times"">She had a time of </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Freshman Eloisa Parades finished in 147th place with a time of </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Freshman Hughnique Rolle finished in 157th place with a time of </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Junior Madison Wrest finished in 161st place with a time of </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt <font color=""#221e1f"" size=""7""><font color=""#221e1f"" size=""7""><span _fck_bookmark=""1"" style=""display: none""> </span></font></font> These don’t bring anything to the party. But every one is different! Fonts and faces, colors, sizes and more. Also – look at the <span> and <div> tags
16
Regular expressions Essentially, pattern matching Using a special set of meta-characters and wild cards Found in Python, R, Perl, PHP and some common (and free) text editors, all using the POSIX standard But not in Excel, sorry. How do we find a <font…> tag when we don’t know what’s in it?
17
Basic matches and anchors
/any/ Matches “any,” “many”, “anymore” /^any/ Matches “any,” “anymore” /any$/ Matches “many,” “any”
18
Optionals, multiples, wildcards
/an*y/ Matches “any,” “canny”, “cay”, “annnnnnny” /an?y/ Matches “many,” “may” /an+y/ Matches “any,” “anymore” but not “may” /an{1,2}y/ Matches “many,” “any”, “canny” but not “day” SPECIAL SYMBOLS: ? – may or may not be there (lazy) + - one or more instances (greedy) * - zero or more (greedy) {0,2} – zero to two instances
19
Character sets /\d/ /\w/ /\s/ /\D/ /\W/ /\S/ /\n/ /[any]/
Any digit /\w/ Any word character, including digits and underscores /\s/ Any whitespace character /\D/ Any non-digit /\W/ Any non-character /\S/ Any non-whitespace character /\n/ New line /[any]/ Matches either “a” or “n” or “y” SPECIAL SYMBOLS: . - Any character except a newline \ - used to escape special characters \. – a period \\ - a backslash [ - start of a character set ( - start of a capturing group
20
Using Sublime Text’s RegEx Search
21
Matching a font tag <font color=“”black”” face=“”Tahoma”” size=“”1””> is easy But it only finds 34 matches We have 680 font tags in this file
22
Build out a regular expression Sublime Text
Build out a regular expression Sublime Text. shows you a preview as you add to it This: <font <font\scolor= <font\s[a-z]+= But we can’t stop with that <font\s[a-z]+=[a-z]+> <font\s[a-z]+=['"]+[a-z]+['"]+> <font\s[\w]+=['"]+[\w]+['"]+> Matches: 691 395 680 Nothing 184 - ['"] is different from [‘”] 191 – expands to include digits, “-”
23
Pressing onwards There are multiple attributes to find
This: <font(\s[\w]+=['"]+[\w]+['"]+)+> What are we missing? Matches: 311 <font color=""#0066cc"" style=""list-style-type: none; list-style-position: initial; list-style-image: initial; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; ""> Pound signs, colons, semicolons and frankly who knows what else.
24
Sorry, this was just to show examples There’s a far simpler way: the “negation” pattern.
[^"] [^89] What we’ve been looking for is <font[^>]+> <font> <font[^>]*> Matches: Anything BUT a double quote Anything BUT an 8 or 9 But it only finds 680!?! 11 matches Finds all 691
25
Replace them with “” (nothing)
And then, get rid of </font> Just check – yup, 691 of them.
26
Get rid of other junk tags But it’s not so easy
<span <span class=“something”> <span[^>]+(class=['"]+[^'"]+['"]+)[^>]+> <span id=“something”> </span> <span> isn’t necessarily a junk tag 31,263 instances 4,470 – Maybe keep those Replace with <span $1>? Fortunately doesn’t occur. Remove in some cases
27
Get rid of the junk attributes
Inline styles \sstyle=['"]+[^'"]+['"]+ -- 9,392 matches
28
Get rid of the junk attributes
Proprietary tags \sdata-scayt_word=['"]+[^'"]+['"] ,643 matches
29
And so on data-scaytid – 23,646 matches …
Come to think of it, let’s just get rid of all the <span> tags. But still, get rid of whatever other junk you find <o:p></o:p> -- 1,036 of these <p>\s+</p> (empty paragraphs) – 4,537 of these
30
But just when you thought it was safe…
31
Dealing with line breaks
The “\n” character is special It matches the end of a line, not a character but the carriage return Synonymous with “\r” You can search for it. \n([^\d]) replaced with $1 yields 12,989 matches But we want to run it multiple times…10,000 more matches. We’ve gone from 16.8mb to 14.4mb!
32
Are we there yet? Let’s try opening it in Excel to see where we are.
34
How did we do? Sort the spreadsheet by the last column…
35
Only one wrong one – that’s not so bad.
36
Fortunately it’s just the one
We can delete all the junk Microsoft Word code Re-save it and try again If there were more, it would be easy enough to write regular expressions to track them down.
37
But it still didn’t work. Corrupted data
38
Here’s the line … prevention metho",,,,,,,,,,,,,,,,,,,,s," """"The""",1,9/17/2009 0:00 … Who knows where those commas came from. The most efficient way is to just edit it manually.
39
Different formats The data export put the “tags” into curly braces:
"{tag1},{tag2}" We need to get them out somehow. Let’s make an assumption: Replace ,“{ with ,“ 676 matches Replace },{ with , 1060 matches Replace }”, with “, 676 matches! That will work! Opening into Excel, a quick Data::Sort shows good results.
40
Some other easy things The story ID won’t exist
Change a number at the beginning of a line to the word “post” ^\d+ Change the column headers Change the status from “1” to “publish” Get rid of the doubled quotation marks now.
41
So far so good. Now we can remove the columns we don’t need.
42
And clear out the first column too.
43
Some other harder things
Consolidate the thumbnail columns Large – Medium – Small You want the first one, the largest one, that has a value. Excel is kind of great for this Create a new column Paste in =IF(ISNA(VLOOKUP("*",I3:K3,1,FALSE)),"",VLOOKUP("*",I3:K3,1,FA LSE)) Drag it down to the bottom…
44
Be sure to capture the values
If you try to delete the three columns you’ll get a reference error Make a new column Copy the previous column Use the Paste::Values function Then delete the small, medium and large columns, plus your formula. There’s a few more steps you could take, but not now…
45
Where Excel Falls Down Would you be surprised to learn that Excel doesn’t follow the standards for CSV files? It gives you this: post,HRL alters housing fee after University reactions,"<p> The Department of Housing and Residence Life … When what you need is this: "post","Poetry-in-the-Round features literary legend","<p>As a special presentation by Seton Hall … Open Office Calc is the tool for this job. Open your saved file into it, and export it from there as a CSV
46
Montrose Park Historic District
Turning a book into SQL queries
47
Starting point
48
What to capture? Street as the category Address as the “title” Block #
Lot # Outbuildings – numbers and types Descriptions Styles? Images?
49
Some basic transformations
Preparing for automatic reading For this step I used to like the Windows Store app “Code Writer” It handled whitespaces, especially new lines, much better But it’s pretty unstable – save your work often! Newer versions of Sublime Text work well.
50
\s*Key\s*\n\s*Outbuildings:\s*(.*) => \n$1\n
\s*Non-Contributing\s*\n\s*Outbuildings:\s*(.*) -> \n\n$1 \s*Contributing\s*\n\s*Outbuildings:\s*(.*) -> \n$1\n\n \s*Block\s(\d*)\s*Lot\s(.*) -> \n$1\n$2 These change this: 470 Berkeley Avenue Block Lot 1 Key Outbuildings: 1 stylistically similar detached carriage house (C) Into this: 470 Berkeley Avenue 506 1 1 stylistically similar detached carriage house (C)
51
(\d*\s([A-z]*\s[A-z]*)\sis.*) => \n$1\n_ watch that _! Transforms:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring stone keystones and sills. The projecting enclosed portico features a segmentally arched brick surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and Berkeley Avenues, in an estate setting. To: Berkeley Avenue post
52
\"post\",\"\",\"\" -> "post","“
\n -> “\n” Adds quotation marks to beginning and end of each line \n -> , This takes out all the line feeds, and makes a confusing mess ,\"_\",\"\", -> ,""\n Makes sense of it again \"post\",\"\",\"\" -> "post","“ Clears an empty field at the end of the line
53
A useful comma-quote delimited entry
"","BERKELEY AVENUE","","470 Berkeley Avenue","506","1","1 stylistically similar detached carriage house (C)","","470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring stone keystones and sills. The projecting enclosed portico features a segmentally arched brick surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and Berkeley Avenues, in an estate setting. ","Berkeley Avenue","post",""
54
You could do something like this:
\"(\d*)\",\"Block\s(\d*)\nLot\s(\d*)\n([^:]*):\s*(.*) -> a ready-to-run MySQL query -> INSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'Block','$2');\nINSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'Lot','$3');\nINSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'$4','$5');\n
55
Extracting GeoCoordinates for Mapping
address Town query latitude longitude 264 Walton Ave. South Orange, NJ <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:21: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='264 Walton Ave.,South Orange, NJ' polygon='false' exclude_place_ids=' ' more_url=' <place place_id=' ' place_rank='30' boundingbox=" , , , " lat=' ' lon=' ' display_name='264, Walton Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America' class='place' type='house' importance='0.401'/></searchresults> 400 South Orange Ave. <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='400 South Orange Ave.,South Orange, NJ' polygon='false' exclude_place_ids=' , , , ' more_url=' <place place_id=' ' osm_type='way' osm_id=' ' place_rank='26' boundingbox=" , , , " lat=' ' lon=' ' display_name='South Orange Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America' class='highway' type='primary' importance='0.6'/></searchresults> 191 Parker Ave. Maplewood, NJ <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='191 Parker Ave.,Maplewood, NJ' polygon='false' exclude_place_ids=' ' more_url=' <place place_id=' ' place_rank='30' boundingbox=" , , , " lat=' ' lon=' ' display_name='191, Parker Avenue, Maplewood, Essex County, New Jersey, 07040, United States of America' class='place' type='house' importance='0.311'/></searchresults> 6016 Morrow Dr. Brook Park, OH <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='6016 Morrow Dr.,Brook Park, OH' polygon='false' exclude_place_ids=' ' more_url=' <place place_id=' ' place_rank='30' boundingbox=" , , , " lat=' ' lon=' ' display_name='6016, Morrow Drive, Brook Park, Cuyahoga County, Ohio, 44142, United States of America' class='place' type='house' importance='0.501'/></searchresults> Carrer del Duc, 4 2-1 Barcelona, Catalonia 08002 <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:12: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='Carrer del Duc, 4 2-1,Barcelona, Catalonia 08002' polygon='false' exclude_place_ids=' ' more_url=' <place place_id=' ' osm_type='way' osm_id=' ' place_rank='26' boundingbox=" , , , " lat=' ' lon=' ' display_name='Carrer del Duc, el Gòtic, Ciutat Vella, Barcelona, BCN, CAT, 08002, España' class='highway' type='pedestrian' importance='0.51'/></searchresults> 2 Lafayette Street Fairhaven, MA 02719 <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:12: ' attribution='Data © OpenStreetMap contributors, ODbL querystring='2 Lafayette Street,Fairhaven, MA 02719' polygon='false' exclude_place_ids=' ' more_url=' <place place_id=' ' place_rank='30' boundingbox=" , , , " lat=' ' lon=' ' display_name='2, Lafayette Street, Fairhaven, Bristol County, Massachusetts, 02719, United States of America' class='place' type='house' importance='0.411'/></searchresults> Using the Excel WEBSERVICE and FILTERXML functions Column C: =WEBSERVICE(CONCATENATE(" B2)) Column D: Column E:
56
For further study… Matching addresses to images Folder: GlensideDr
Files: Dykeman19Glenside.jpg Fenrich23Glenside.jpg Finlay8Glenside.jpg \w+(\d+)([^.]+)\.jpg Automatically generate an “image” field?
57
Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.