Presentation is loading. Please wait.

Presentation is loading. Please wait.

– Campione d'Italia, 2011, May the 30th MailOfMine – Analyzing Mail Messages for Mining Artful Collaborative Processes Claudio Di.

Similar presentations


Presentation on theme: "– Campione d'Italia, 2011, May the 30th MailOfMine – Analyzing Mail Messages for Mining Artful Collaborative Processes Claudio Di."— Presentation transcript:

1 (cdc@dis.uniroma1.it) – Campione d'Italia, 2011, May the 30th MailOfMine – Analyzing Mail Messages for Mining Artful Collaborative Processes Claudio Di Ciccio ¹, Massimo Mecella ¹, Monica Scannapieco ², Diego Zardetto ², Tiziana Catarci ¹ ¹ ²

2 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 2 / Outline An index of what follows The context – Artful processes and knowledge workers – E-mail messages as process traces – Declarative workflows The MailOfMine approach – Object Matching for clustering Preliminary tests results – Information Extraction for activity identification – Pattern mining for constraints retrieval Declarative workflow definition of artful processes – The Process Describing Grammar ( PDG ) Conclusions

3 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 3 / Motivation (1) Artful processes [HillEtAl06]HillEtAl06 – informal processes typically carried out by those people whose work is mental rather than physical (managers, professors, researchers, engineers, etc.) “knowledge workers” [ACTIVE09]ACTIVE09 Knowledge workers create artful processes “on the fly” Though artful processes are frequently repeated, they are not exactly reproducible, even by their originators, nor can they be easily shared. Artful processes and knowledge workers

4 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 4 / Motivation (2) In collaborative contexts, knowledge workers share their information and outcomes with other knowledge workers – E.g., a software development mgr. Typically, by means of several e-mail conversations – E-mail conversations are actual traces of running processes that knowledge workers adhere to E-mail conversations

5 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 5 / Motivation (3) From the collection of e-mail messages, you can extract the processes that lay behind – Related e-mail conversations are traces of their runs Valuable advantages for users – Automated discovery of formal representations with no effort for knowledge workers – Tidy organization for naïve best practices kept only in mind – Opportunity to share and compare the knowledge on methodologies – Automated discovery of bottlenecks, delays, structural defects from the analysis of previous runs Processes from e-mail conversations

6 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 6 / Motivation (4) Personal information management (PIM) – how to organize one’s own activities, contacts, etc. through the usage of software [CatarciEtAl07, ACTIVE09]CatarciEtAl07ACTIVE09 Information warfare – in supporting anti-crime intelligence agencies Enterprise engineering – for knowledge-heavy industries, where preserving documents making up product data is not enough [SmVortex, Heutelbeck11]SmVortexHeutelbeck11 Some areas of applicability

7 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 7 / The approach Representation – Declarative workflows [vanDerAalstEtAl09] for representing artful processesvanDerAalstEtAl09 – Regular grammars to express declarative workflows constraints Mining – Object Matching [ZardettoEtAl10] forZardettoEtAl10 clustering e-mail conversations finding the matching between activity and tasks instances – Regular expression mining [GarofalakisEtAl99] for inferring constraintsGarofalakisEtAl99 – Supervised learning to group activities into processes – Text mining information extraction to determine tasks out of e-mail messages [CohenEtAl04, SakuraiEtAl05]CohenEtAl04SakuraiEtAl05 How to represent and infer artful processes

8 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 8 / A glossary MailOfMine: the system Task: an elementary unit of work – The Leucippus (and Democritus') concept of atom, here for processes Activity: a collection of tasks or activities Process: a set of activities Instance: a traced execution of a task, an activity, or a process Indicium: any communication trace, or part of it, attesting the execution of a task, an activity, or a process instance Key part: each unique piece of text belonging to the e-mail messages exchanged in a communication trace – The quoted text of a replying message from the previous e-mail is not a key part for the new e-mail message – Your signature is not a key part of your e-mail messages Some key words explained

9 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 9 / Algorithm (1) From the e-mail archive to key parts Mail archiveMail DatabaseConversations Key Parts Multi-format mail storage plug-in based crawlers [ZardettoEtAl10]-based clustering algorithm [CarvalhoEtAl04] -based filter

10 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 10 / Object Matching for Similarity Clustering (1) Object Matching (OM) is the problem of identifying pairs of data objects coming from different sources and representing the same real world object. – Input: data objects (DB rows, XML files, etc.) If data objects are records (e.g. rows in a Database), the problem is known in literature as Record Linkage – Output: homogeneous objects mapping sets We take advantage of it in collecting e-mail messages that are related to the same argument Similarity Clustering (SC) is the problem of determining whether two objects belong to the same group. OM and SC both: – share the common objective of classifying data-object pairs according to an hidden (i.e. not self-evident) property – rely on pairwise distance (or, equivalently, similarity) measures How to cluster e-mail messages into conversations

11 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 11 / Object Matching for Similarity Clustering (2) The used technique is based on [ZardettoEtAl10] for OMZardettoEtAl10 Four-step procedure: 1. Representation format for e-mail messages as processable data objects 2. Definition of a specific distance metric to compare e-mail objects Based on: – Message-Id, Sender, Receivers, Subject header fields – Body – Names of attached files 3. Run of a decision algorithm The output is a set of e-mail pairs declared as “matches”, i.e. – in the present context – belonging to the same cluster 4. Application of a function performing a transitive closure among “match” pairs in order to build e-mail clusters How to cluster e-mail messages into conversations

12 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 12 / Object Matching for Similarity Clustering (3) 101 e-mail messages processed – selected from a larger collection of about 10 GByte e-mail messages 98 e-mail-pairs declared to belong to the same group – out of the 5050 pairs generated 21 non overlapping and non trivial (that is containing at least 2 e-mail messages) clusters determined – The output clusters size ranges from 2 to 11 e-mail messages – The e-mail messages involved in the clusters are found to be 68 out of 101. To assess the quality of the obtained result, we manually analyzed the input e-mail messages and the output clusters – only 1 false positive cluster – no false negatives Preliminary test results

13 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 13 / Algorithm (2) From key parts to the activities Activity indicium Tasks Key Parts Concatenation [ZardettoEtAl10]-based Activities [GarofalakisEtAl99] -based pattern miner [CohenEtAl04, SakuraiEtAl05] -based task extractor

14 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 14 / Algorithm (3) From activities to the processes [GarofalakisEtAl99] -based pattern miner Supervised learning Process indicium Process

15 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 15 / Representation of artful processes An example of expected outcome Existence constraints Relation constraints Tasks Notation based on [vanDerAalstEtAl06] (DecSerFlow)vanDerAalstEtAl06

16 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 16 / On the representation of artful process schemata Each constraint in the set which can be used to define an artful mined process is expressible through regular grammars, where: – tasks are terminal characters, building blocks of constraints on tasks. – constraints are regular expressions, equivalent to regular grammars. – activities are regular expressions which are equivalent to the intersection of the constraints’ regular grammars. – constraints can be formulated on top of activities, being regular expressions themselves. – the process scheme is the intersection of constraints defined on top of activities. The process scheme defines a Process Describing Grammar ( PDG ) Regular grammars expressing declarative workflows

17 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 17 / On the representation of constraints Some examples

18 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 18 / On the usage of regular grammars The rationale: why not LTL for declarative workflows? Temporal logic is a formalism for describing sequences of transitions between states in a reactive system Linear Temporal Logic (LTL, [Pnueli77]) describes events along a single computation pathPnueli77 LTL formulæ are verified over semi-infinite runs – defined over Kripke structures They are good for automatically checking the correct work of circuits or server programs – Not for human processes which have both a starting point and an end “In the long run, we are all dead’' (John Maynard Keynes) Regular grammars are verified by Finite State Automata – working with less complex algorithms, in terms of computational effort A PDG describes the language spoken by collaborative organisms in terms of activities

19 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 19 / Conclusions Summing up what we have discussed about so far and what is to be done next Objective: – Mining artful processes out of e-mail conversations Approach: – Multi-disciplinary, involving Object Matching Text Mining Process (Pattern) Mining Outcome: – Declarative workflows expressed as regular grammars Ongoing work – Extending the current testbed to larger e-mail archives – Building an integrated prototype – Defining a visual representation for the constraints in declarative workflows, understandable by users (knowledge workers themselves)

20 2011/05/30 SIMPDA 2011 (Campione, Italy) P. 20 / References [HillEtAl06] Hill, C., Yates, R., Jones, C., Kogan, S.L.: Beyond predictable workflows: Enhancing productivity in artful business processes. IBM Systems Journal 45(4), 663–682 (2006) [ACTIVE09] Warren, P., Kings, N., et al.: Improving knowledge worker productivity - the active integrated approach. BT Technology Journal 26(2), 165–176 (2009) [CatarciEtAl07] Catarci, T., Dix, A., Katifori, A., Lepouras, G., Poggi, A.: Task-centred information management. Proc. DELOS Conference, LNCS 4877 (2007) [SmVortex] Smart vortex – management and analysis of massive data streams to support large- scale collaborative engineering projects. FP7 IP Project: http://www.smartvortex.eu/http://www.smartvortex.eu/ [Heutelbeck11] Heutelbeck, D.: Preservation of enterprise engineering processes by social collaboration software (2011), personal communication, to appear in Proc. COLLIN2011 - 2nd Symposium on Collective Intelligence [vanDerAalstEtAl09] van der Aalst, W.M.P., Pesic, M., Schonenberg, H.: Declarative workflows: Balancing between flexibility and support. Computer Science - R&D 23(2), 99–113 (2009) [ZardettoEtAl10] Zardetto, D., Scannapieco, M., Catarci, T.: Effective automated object matching. Proc. ICDE 2010 [CohenEtAl04] Cohen, W.W., Carvalho, V.R., Mitchell, T.M.: Learning to classify email into “speech acts”. Proc. EMNLP 2004 [SakuraiEtAl05] Sakurai, S., Suyama, A.: An e-mail analysis method based on text mining techniques. Appl. Soft Comput. 6(1), 62–71 (2005) [CarvalhoEtAl04] Carvalho, V.R., Cohen, W.W.: Learning to extract signature and reply lines from email. Proc. CEAS 2004 [GarofalakisEtAl99] Garofalakis, M.N., Rastogi, R., Shim, K.: Spirit: Sequential pattern mining with regular expression constraints. Proc. VLDB 1999 [vanDerAalstEtAl06] van der Aalst, W.M.P., Pesic, M.: Decserflow: Towards a truly declarative service flow language. Proc. WS-FM 2006 [Pnueli77] Pnueli, A.: The Temporal Logic of Programs. Proc. 18th Annual Symposium on Foundations of Software Technology and Theoretical Computer Science, 1977 Cited articles and resources, in order of appearance


Download ppt "– Campione d'Italia, 2011, May the 30th MailOfMine – Analyzing Mail Messages for Mining Artful Collaborative Processes Claudio Di."

Similar presentations


Ads by Google