Presentation on theme: "Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text Dr Dale Chant, Red Centre Software Pty Ltd ASC Conference:"— Presentation transcript:
Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text Dr Dale Chant, Red Centre Software Pty Ltd ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: Gamification, Text Analysis, and Data Visualisation Friday 6 th and Saturday 7 th September 2013, University of Winchester
The Problem Coding open-ended verbatims takes a long time Inconsistent coding judgements can wreak havoc on small weekly samples Some bodies of free text, such as Twitter feeds, are beyond human capacity to digest due to sheer volume Machine coding by string matching assumes well- formed text – variant morphologies difficult to accommodate
Naïve Auto-Coding Read all source words (or complete strings) into an array Sort alphabetically Assign codes from 1 to N, where N is the number of unique words (or unique strings) Write the assigned codes in the original word (or string) order
Naïve Auto-Coding Well-formed published text is code-complete The first line of Wuthering Heights The complete code frame has 9,201 items
Netting to a Theme With the code frame defined, themes can be netted from individual words abandonment = abandon/abandoned/abandonment/reject/rejected/rejecting Theme(1) = Text(3/5,6530/6532) Coded Decoded
Code Incomplete Open-ended tracker Brand Awareness questions, time-dependent blog or social media exchanges Can never be code-complete, because forthcoming data may throw up unanticipated variations Dog/dogs/mongrel/mongrels/mutt/mutts/dingo/wolf/…
Damerau-Levenshtein One approach is Approximate String Matching Match a source string to a target string by combinations of i) Insert ii) Delete iii) Replace iv) Transpose The edit distance is the number of transforms needed to get from the source to the target
The Algorithm in Action There is an interactive implementation of Damerau-Levenshtein at http://fuzzy-string.com/Compare/
Scaling the Algorithm (1) To be useful, the allowable distance for a positive match needs to scale against the length of the target strings ox to fox has distance 1 (insert at head) This would be a false positive megalomania to megalomaniacs has distance 2 (insert twice at tail) This is a good match
Scaling the Algorithm (2) Short strings need a distance of zero Intermediate strings need 1 or 2 Longer strings can bear 2 or 3 or more The thresholds for short/intermediate/long and allowed distances for a positive match are here termed the fuzz parameters Fuzz parameters are determined empirically, and will vary with the body of text being analysed.
What is Gained The target string megalomaniac at an edit distance of 1 will match on: 12 * 26 in situ typos (negalomaniac) + 12 missing (megaomaniac) + 12*26 extraneous (megaloomaniac) + 11 transpositions (meglaomaniac) + 2*26 extra pre/post character (mmegalomaniac) = 699 possible variations
The Procedure Code the source text, one code per unique word Run a sorted frequency count to expose recurrent themes and concepts Review actual instances of these words in situ to determine appropriate fuzz parameters and the thematic and conceptual contexts Devise a compact target code frame which maps the themes and concepts words of interest to synonym and variant lists Process the source text against the targets, to create a categorical variable which can be tabulated in the normal manner against any other variable
Exposure and Quantification: Romeo and Juliet Since this text is bounded, code-complete and well-formed, the fuzz parameters can all be set to zero The Exposure step reveals dominance for i) love and related ii) misery and despair iii) conflict and death
Love dominates, then diminishes Romeo Romeo, wherefore art thou Romeo?
Ill-formed Text Tweets on Australian Federal Politics From 1 June 2013 to 31 July 2013 Search term: #auspol OR #auspoll OR #ausvotes OR #ozcot 927,190 cases Average between 10 to 20,000 per day Huge spike on 26 June
Data Sources Two commercial data source providers were used: Gnip and ScraperWiki The Gnip data was collected in a single 28 hour run conducted on 15 Aug 2013 ScraperWiki provides user-initiated searches for up to the prior seven days Because ScraperWiki is near real time, accounts later banned or suspended by 15 th August, and hence not in the Gnip data, remain present The ScraperWiki data is used below only to demonstrate this point. http://gnip.com/https://scraperwiki.com/
Australian Federal Politics since 2007 Rudd defeats Howard at general election Howard, Conservative Rudd, Labor Gillard, Labor Abbott, Conservative 07 08 09 10 11 12 13 Gillard challenges and defeats Rudd, calls election, hung parliament Rudd challenges and defeats Gillard, calls election for 7 Sept Timeline:
The Grand Narrative Assange Senate Bid With the Pretender to the Throne (Gillard) summarily dispatched The True and Rightful King (Rudd), triumphantly returned from (backbench) exile Now faces the Great Adversary (Abbott) in a battle to the (political) death for control of the realm Warning: Aussie Vernacular Alert
HashTags Much more than just a metatag Function as a message tokens too: – Commentary on current affairs (#1000BoatDeaths, #20000JobCuts) – Calls to action (#2013electiondateplease, #AbolishParliament) – Political attack (#AbbotLies) – Take a position (#AgeOfEntitlement) – Make a joke or pun (#calmdownbirdie, #fraudband)
Hashtag Spawn Quantification should capture as many variants as possible.
Sort on 19 July, Zoom The PNG Solution is more punitive than anything the Conservatives have tried
But many instances missed Three dominant tags are clear, but the variants will be lost under a search on just asylum OR asylumseeker/s Ditto Refugee/Refugees, etc.
Quantification Smoothed percentage chart of all instances exposes the narratives, but to quantify them accurately, we cannot forego counting the variants. To get a more precise read, we apply Damerau-Levenshtein. Recalling the four transformation rules (insert, delete, replace, transpose), the following matches (among many others) will be made to the dominant forms at run time: batt el rort->batt le rort (transpose once) calmdownb ir die->calmdownb ri die (transpose once) asylumseeke ->asylumseeke rs (insert twice) asylumseeke e rs->asylumseekers (delete once) asyl y mseekers->asyl u mseekers (replace once)
Prepare the synonym/variants lists for the dominant tags The procedure is: Code the hashtags, one code per unique tag Generate a sorted frequency count table Choose a cut-off point - I have used 30 Review all items > 30, define and initialise a coded synonym/variants list with the dominant tags Sort the table alphabetically by label Review label blocks for any variants which are too coarse for Damerau-Levenshtein, and add to the relevant synonym/variants target list
Define Coded Categories against Targets CodeCategorySynonym/ Variant Targets 1BattleRort#BattleRort/#BattleRortAbbott/#BattleRortGate/#BattleRortMovies/#BattleRortSongs 2 CalmDownBri die #CalmDownBridie/#CalmDownTony/#CalmDownAbbott/#CalmDown 3 Any Media #qanda/#abcnews24/#abcnews/#abc24/#Insiders/#730report/#abc730/#730/#lateline/ #thedrum/#pmlive/#pmagenda/#amagenda/#4corners/#contrarians/#abc/#MSMfail/ #MSM/#contrarians/#theboltreport/#viewpoint/#media/#Murdoch/#ABC1/ #mediawatch/#datelineSBS 4 Any Border Protection #asylumseekers/#refugees/#boatpeople/#asylum/#PNGSolution/#PNG/#Nauru/ #ManusIsland/#Manus/#humanrights/#stoptheboats/#Indonesia/#immigration/#boats/ #operationsovereignborders 5Islam#islamophobe/#islamist/#islamlaw/#islamic/#Islam/#muslim 6NBN#NBNCo/#NBN/#fraudband 7 Any Environment #climate/#coal/#fracking/#carbon/#energy/#CSG/#climatetax/#climatechange/ #environment/#ETS/#carbonscam/#carbontaxscam/#climatescam/#climatecon/#green/ #AGWHoax/#globalwarming/#greenarmy/#renewables/#votegreen/#climategate/ #naturalcsg/#naturalgas 8pinkbatts#pinkbatts
Confirm it Works Set fuzz parameters as distance=0 for strings 4 characters or less, distance=1 for 9 characters or less, and distance=2 for 10 characters or more Run the source hashtags against these targets to create a new variable comprising eight categorical codes To confirm, run a table of the eight coded categories against the original raw hashtag text The source strings batt el rort and calmdownb ir die are both correctly captured and coded.
Doing Likewise for the Tweet Text CodeCategorySynonym/Variant Targets the government 1Rudd Kevin Rudd/KevinRudd/Kevin13/Kevin747/Kevin/@KRuddMP/KRudd/Rudd/ CrudDudd/KR/milky bar/messiah 2AlbaneseAlbanese/Albosleaze/Albo 3GillardJulia Gillard/JuliaGillard/Gillard/Juliar/Julia/Jules 4ShortenBill Shorten/Shorten 5 Labor Party Labor Party/Labor/#ALP/ALP/the government/the govt 6GreensGreens/Milne/Bob Brown 7Unionsunions/faceless men/AWU/HSU/Bill Ludwig/Ludwig/Paul Howes/Piggy Howes the opposition 8Abbott @TonyAbbotMHR/Tony Abbott/TonyAbbott/TAbbott/Abbott/Tony/TA/budgie smuggler/ mad monk 9Turnbull Malcolm Turnbull/@TurnbullMalcolm/Turnbull 10CoalitionCoalition/liberal/the opposition/Libs/#LNP/LNP etc to code 46
Continued to Code 46 CodeCategorySynonym/variant Targets 34Corruption corruption/corrupt/fraud/sleaze/dishonest/stealing/steal/greed/shonky/crooks/criminals/ thuggish/thugs/thieves/thief 35Treachery treachery/treacherous/back stab/backstab/back stabbing/backstabbing/stabbed/stabbing/knifing/knifed/plot/betrayal/betrayed/betray/spill/ leadership coup/ousting/oust 36Insanity insanity/insane/nutter/crazy/lunatic/unstable/psychopath/psychotic/psycho/narcissist/ delusional/delusion/egotist/egotistic/egotistical/egomaniac/ego/ power mad/powermad/madness/deviant 37Stupidity stupidity/stupid/wanker/numpty/imbecilic/imbecile/zombie/clueless/moron/retarded/retard/ idiotic/idiot/bogan 38Incompetenceincompetent/mismanagement/dysfunctional/waste/inefficient/chaotic/chaos/destructive/inept 39Cowardicecowardice/coward/gutless/ticker/frightened/scared 40Hypocrisyhypocrisy/hypocrit/bigoted/bigot 41Arrogancearrogance/arrogant/smart arse/smartarse/smart ass/self indulgent/smugness/smug scandals 42Scandalsscandalous/scandal 43 AWU Slush Fund AWU slush fund/slush fund/Bruce Wilson 44ALP ScandalsPeter Slipper/Slipper/Craig Thompson/Thompson/Eddie Obeid/Obeid 45Heiner policy 46Policyagenda/policies/policy
Share of Voice All synonym and variant matches for Rudd, Abbott, Gillard, as percentages of the sum of their total mentions per day:
Compared to Topsy Sentiment Score Not much agreement here. Who is right? http://www.couriermail. com.au/news/special- features/ruddeffect-on- the-wane-as-abbott- retains-the- people8217s-trust/story- fnho52jo- 1226683181964
Performance Machine: Standard business Dell laptop, dual core, 4 gig RAM, nothing fancy, no accelerations The bottleneck is the Damerau-Levenshtein step on the tweet text, which for the above 46 categories over 113 meg of plain text, takes about 15 hours Performance is linear to the number of individual target synonyms/variants Damerau-Levenshtein on the hashtags, a much smaller set of targets, completes in about 20 minutes The major time commitment from a human is in devising the target synonym and variants lists, here several hours For more routine applications of the technique, such as open-ended brand lists, preparing the target lists is trivial [end of document]