Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.

Similar presentations


Presentation on theme: "1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge."— Presentation transcript:

1 1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge from data(bases): Text Mining II Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 28 December 2011

2 2 Berendt: Advanced databases, first semester 2011, 2 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

3 3 Berendt: Advanced databases, first semester 2011, 3 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

4 4 Berendt: Advanced databases, first semester 2011, 4 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

5 5 Berendt: Advanced databases, first semester 2011, 5 Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart,...) Where to put: spaghetti, butter?

6 6 Berendt: Advanced databases, first semester 2011, 6 What makes people happy?

7 7 Berendt: Advanced databases, first semester 2011, 7 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

8 8 Berendt: Advanced databases, first semester 2011, 8 News and social media, in particular tweets

9 9 Berendt: Advanced databases, first semester 2011, 9 Recall: CRISP-DM CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

10 10 Berendt: Advanced databases, first semester 2011, 10 Business understanding

11 11 Berendt: Advanced databases, first semester 2011, 11 Data understanding

12 12 Berendt: Advanced databases, first semester 2011, 12 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

13 13 Berendt: Advanced databases, first semester 2011, 13 What are the relations between these text (parts)?

14 14 Berendt: Advanced databases, first semester 2011, 14 Or these?

15 15 Berendt: Advanced databases, first semester 2011, 15 A list of possible (and interesting) text relations in the News/Blogs/Tweets domain (relation Tweet -> news art.) Repetition (could be more interesting if repeated repetition /retweet -> rep. Weights?) Repetition of the headline ? Pointing to interesting links (diff. To identify? – need to process the link / might have redirection) Pointing to the article … anything becomes more important if it‘s retweeted (endorsement?) … … that may depend on WHO (re)tweets it – measured e.g. by no. Of followers … Comment Reference to event or topic via a hashtag (Obama election) … -- hashtags can be used to identify a topic that might also be present in NAs (being-about-the-same-topic)  learn from the words around the texts, and co-occurring hashtags use SentiStrength to determine if a text has a positive or negative relationship with a tweet (endorsement; criticism)

16 16 Berendt: Advanced databases, first semester 2011, 16 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

17 17 Berendt: Advanced databases, first semester 2011, 17 What is Content Analysis? n A form of textual analysis *usually* n Categorizes chunks of text according to Code n Blend of qualitative and quantitative Schwandt, Thomas A. Dictionary of Qualitative Inquiry. 2nd ed. Sage Publications: Thousand Oaks, CA, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

18 18 Berendt: Advanced databases, first semester 2011, 18 Rough History - 1 Classical Content Analysis n Used as early as the 30’s in military intelligence n Analyzed items such as communist propaganda, and military speeches for themes n Created matrices searching for the number of occurrences of particular words/phrases Roberts, C.W. "Content Analysis." International Encyclopedia of the Social and Behavioral Sciences. Elsevier: Amsterdam, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

19 19 Berendt: Advanced databases, first semester 2011, 19 Rough History - 2 (New) Content Analysis* n Moved into Social Science Research n Study trends in Media, Politics, and provides method for analyzing open ended questions n Can include visual documents as well as texts n More of a focus on phrasal/categorical entities than simple word counting *My own terminology, more generally referred to as simply “Content Analysis” From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

20 20 Berendt: Advanced databases, first semester 2011, 20 Procedure 1. Identifying a corpus of texts and Sample Pop. 2. Determine unit of analysis 3. Find Themes (inductive or deductive) 4. Build a Codebook 5. Mark the texts 6. Analyze the code from texts quantitatively Denzin, Norman K. Handbook of Qualitative Research. Sage Publications: Thousand Oaks, CA, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

21 21 Berendt: Advanced databases, first semester 2011, 21 Coding Analyzing the archived content. Includes: 1. Identifying units of analysis (e.g., individual user posts, game characters) 2. Creating a codebook 3. Creating coding sheets (may be electronic now) 4. Training, coding, intercoder reliability assessment, etc. From Paul Skalski (n.d.) Content Analysis of Interactive Media. p. 10

22 22 Berendt: Advanced databases, first semester 2011, 22 Examples To be shown in class: Overview of an example from the Social Web: see pp. 11ff. Further resources include: A detailed example of a codebook for Content Analysis of Stories about Protest Events: More examples of codebooks and coding schemes:

23 23 Berendt: Advanced databases, first semester 2011, 23 Thus …

24 24 Berendt: Advanced databases, first semester 2011, 24 A first project plan (for HWs 4 and 6) – HW 4 PHASE Data understanding / initial data collection of the class attribute n in different teams: 1. come up with different possible relations between texts 2. find a small number of examples nNB: Sampling strategy? 3. develop a codebook and coding scheme 4. have several coders code a larger number of examples nNB: Sampling strategy? 5. measure inter-rater agreement nhttp://en.wikipedia.org/wiki/Krippendorff%27s_Alphahttp://en.wikipedia.org/wiki/Krippendorff%27s_Alpha PHASE Pause – revisit the literature and re-evaluate it (not really a CRISP-DM phase …) 1. Compare your results! 2. In the light of all this, revisit (as an example from the literature) the Sentistrength coding procedure and discuss it critically

25 25 Berendt: Advanced databases, first semester 2011, 25 A first project plan (for HWs 4 and 6) – HW 6 PHASE Data preparation 1. You may skip most of this phase. Take the data as prepared by Ilija! PHASE Modelling 1. Understand / develop [depending on time and interest] formal measures of such text relations 2. Calculate the measures for the corpora 3. Calculate the accuracy of classification PHASE Evaluation 1. Do an error analysis. Be critical with yourself, the results, and their meaning for the initial question ;-) PHASE Deployment 1. Produce final report

26 26 Berendt: Advanced databases, first semester 2011, 26 First round of relations R1: summary R2: repetition of headline R3: (the tweet is a) link (to the article) R4: (the tweet is a) link to another article on the same topic R5: comment on the article‘s content R6: comment on a topic related to the article R7: comment on the article (note: only if there is a link to the article!) R8: hashtag-about-topic R9: endorsement of the article R10: endorsement of its content R11: criticism of the article R12: criticism of its content

27 27 Berendt: Advanced databases, first semester 2011, 27 Problems/observations n Non-English tweets n TODO: language classification or different story selection n Overlapping categories: headline repetition + link to article (this is likely to happen on sites that have an automatic tweet generator) n TODO: new category n Hashtag „#“ missing (sometimes – only in the Oil Spill data?) n Comment on a tweet that commented on an article (in these tweets, there is a syntactic indicator of retweeting: RT) AND most retweets are comments on the retweeted text n TODO: new category „indirect comment“ n Use the article as an argument n TODO: new category „link works as repetition“ – simplify to category „link to the article (a suggestion to someone to read it) = answer; RT = spread in your own network; both may contain commenting (but answer presupposes the recipient knows what this is about)  exclude answer tweets?!

28 28 Berendt: Advanced databases, first semester 2011, 28 Problems/observations (2) Overlapping (s.a.) Found an instance of only-link (not yet clear to what; some links don‘t work) Headline + sentence (as far as the 140-chars allow) + link No relation – topic too big (Iraq war vs. Iraqi economy)

29 29 Berendt: Advanced databases, first semester 2011, 29 Second round of relations: „the manually annotated tweet is a … of some news article / other text“ R1: Summary w link R2: Headline w link R3: Summary or headline wo link R4: Endorsement w link R5: Endorsement wo link R6: Criticism w link R7: Criticism wo link R8: Otherwise emotionally charged text R9: Just a link R10: Comment on another tweet [rule: always involves RT R11: Enriching another tweet [rule: always involves RT] Rule: if there is a link, try to check it to see whether the tweet text repeats the headline R12: OTHER

30 30 Berendt: Advanced databases, first semester 2011, 30 Outlook Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations Notes about language modelling and about Inference on/with/for the Semantic Web

31 31 Berendt: Advanced databases, first semester 2011, 31 References / background reading n Stemler, Steve (2001). An overview of content analysis. Practical Assessment, Research & Evaluation, 7(17). This describes, among other things, the classic book in the field: Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Newbury Park, CA: Sage. n The CRISP-DM manual can be found at n „Our“ twitter study: Subašić, I. & Berendt, B. (2011). Peddling or Creating? Investigating the Role of Twitter in News Reporting. In Proceedings of ECIR 2011 ( ). Berlin etc.: Springer. LNCS endt_2011.pdf


Download ppt "1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge."

Similar presentations


Ads by Google