Presentation is loading. Please wait.

Presentation is loading. Please wait.

PolyAnalyst Web Report Training

Similar presentations


Presentation on theme: "PolyAnalyst Web Report Training"— Presentation transcript:

1 PolyAnalyst Web Report Training
Manipulating Text Data in PolyAnalyst - Text Extraction and Regular Expressions PolyAnalyst Web Report Training Megaputer Intelligence © 2014 Megaputer Intelligence Inc.

2 Outline Agenda Extract Terms node Basics of Regular Expression
Example of Regex with PolyAnalyst

3 Outline Agenda Extract Terms node Basics of Regular Expression
Example of Regex with PolyAnalyst

4 Extract Terms Node Extract text segments from a column using Regular Expressions

5 Extract Terms Node Extract text segments from a column using Regular Expressions

6 Extract Terms Node Select Text or String Columns

7 Extract Terms Node Add a new rule

8 Extract Terms Node Simplest Regex Rule Case Insensitive

9 Extract Terms Node

10 Outline Agenda Extract Terms node Basics of Regular Expression
Example of Regex with PolyAnalyst

11 Outline Basics of Regular Expression
The simplest regex is simply a string of characters: Simplest Regex Rule

12 Outline Basics of Regular Expression If we expand it to:
Then it fails!

13 Basics of Regular Expression
Outline \s represents a space

14 Basics of Regular Expression
Outline PDL Phrase(parking, lot)

15 Outline Basics of Regular Expression Vertical Bar | represents “or”
Parentheses () represent grouping

16 Outline Basics of Regular Expression \d matches for any digit (0 to 9)
Plus sign + denotes one or more matches

17 Basics of Regular Expression
Outline

18 Outline Basics of Regular Expression
Question mark ? denotes: zero or one match Asterisk * denotes: zero or more matches

19 Basics of Regular Expression
Outline

20 Outline Other Useful Syntax
Dot . matches for any character except newline Caret ^ denotes beginning of string Dollar sign $ denotes end of string Curly brackets {} denotes exact number of match. For example: w{3} match for www p{1,5} match for happy or happpppy

21 { } [ ] ( ) ^ $ . | * + ? \ \$\d+\.\d+ = $19.99 Outline Metacharacters
Some characters are reserved for use in regex notation The metacharacters are: { } [ ] ( ) ^ $ . | * + ? \ For example: \$\d+\.\d+ = $19.99

22 Outline More? PolyAnalyst Help Manual Online Resources
Test and see the highlights

23 Outline Agenda Extract Terms node Basics of Regular Expression
Example of Regex with PolyAnalyst

24 Outline Extract [Age] of Suspect
Other than groupings, parentheses () are also used for storing

25 Extract and Sort [Age] Outline

26 Clean Up Text / String Columns
Outline

27 Outline Clean Up Text / String Columns
.* matches for any number of characters except newline

28 Clean Up Text / String Columns
Outline

29 Clean Up Text / String Columns
Outline

30 Delimiter and Extraction
Outline

31 Outline Delimiter and Extraction
\w matches for any alpha numeric character and the underscore character: [A-Z] [a-z] [0-9] _

32 Delimiter and Extraction
Outline

33 Delimiter and Extraction
Outline

34 Delimiter and Extraction
Outline

35 Outline Delimiter and Extraction
Other than groupings, parentheses () are also used for storing

36 Delimiter and Extraction
Outline

37 Replace Terms Node Find and replace patterns of characters in one or more string or text columns.

38 Data Redaction Outline

39 Regex in Replace Terms Node

40 Data Redaction Outline

41 Regex in Replace Terms Node

42 Regex in Replace Terms Node

43 Contacting Megaputer Questions?

44 An Example of Regular Expression with a Web Scraping Project
Appendix: An Example of Regular Expression with a Web Scraping Project of Glassdoor Data Contacting Megaputer

45 Polish the Information

46 Remove Unnecessary Info
(?s) denotes “treat everything on the same line”

47 Find a Delimiter For forums or blogs with multiple posts in one webpage Find ways to identify common patterns

48 Separate Records of Info

49 Find a Delimiter

50 Find a Delimiter

51 Records Separated!

52 Different Ways to Extract Data
Right from the parsed text Option to work on raw HTML codes

53 Data Extraction – Parsed Text
Title of Review Location Job Title Date & Time

54 Data Extraction – Parsed Text

55 Data Extraction – Raw HTML Codes
Title of Review Job Title Location

56 Resulting Dataset Outline

57 Making Good Use of the Info

58 Contacting Megaputer Questions?


Download ppt "PolyAnalyst Web Report Training"

Similar presentations


Ads by Google