Presentation is loading. Please wait.

Presentation is loading. Please wait.

:: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff.

Similar presentations


Presentation on theme: ":: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff."— Presentation transcript:

1 :: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

2 2/24:: DIAsDEM :: Content Introduction and data mining context DIAsDEM - functioning New extensions

3 3/24:: DIAsDEM :: Introduction :: problems ::

4 4/24:: DIAsDEM :: Introduction known: data in databases (DB2, Oracle,...) unproblematically to analyse, for example with SQL, self-brewed programmes or data miners but in enterprises: 80% of data in text documents (MS Word, plain text files, text archives,...) knowledge there, but „useless“

5 5/24:: DIAsDEM :: Introduction example (same meaning, other structure): Mr. Schröder earns EUR per month. Mister Schröder earns 20000,- €/month. What does it mean? How to compare? How to analyse? Does this mean the same?

6 6/24:: DIAsDEM :: Introduction :: data mining context ::

7 7/24:: DIAsDEM :: Introduction necessary to make knowledge analysable desirable: –semantically structured knowledge –queryable knowledge possible solution: XML –semantic tagging –analysable (XPath, XQuery, Tamino,...)

8 8/24:: DIAsDEM :: Introduction for humans: Mr. Schröder earns EUR per month. = Mister Schröder earns 20000,- €/month. „useless“ for computational analyse only useful informations: –Mister Schröder –20000 Euro –month

9 9/24:: DIAsDEM :: Introduction need to –„find“ important information –mark important information Mr. Schröder earns EUR per month.

10 10/24:: DIAsDEM :: DIAsDEM :: DIAsDEM ::

11 11/24:: DIAsDEM :: DIAsDEM DIAsDEM: Datenintegration von Altlastdaten und semistrukturierten Dokumenten mit Mining-Verfahren (integration of legacy data and semi-structured documents with data mining techniques) project of the Deutsche Forschungs- gemeinschaft (German Research Society) necessary: domain specific knowledge (!!!)

12 12/24:: DIAsDEM :: DIAsDEM :: functioning ::

13 13/24:: DIAsDEM :: DIAsDEM 2-phase-model 1.knowledge discovery –iterative process (with expert knowledge) –training phase with training text archive –finding of segments (clusters) and semi-automatic annotation –deduction of an unstructured XML DTD 2.semantic tagging –usage of found clusters on new archives –„intelligent“ tagging of new, unknown texts of the same domain

14 14/24:: DIAsDEM :: DIAsDEM Fig.: Winkler 2003b, page 6

15 15/24:: DIAsDEM :: DIAsDEM to achieve „good“ semantic tagging, expert knowledge necessary What is needed? Mr. Schröder or Mr. Schröder

16 16/24:: DIAsDEM :: DIAsDEM steps in DIAsDEM: 1.finding segments (for example sentences) in training texts by using thesauri and knowledge of named entities (persons,...) 2.building an unstructured XML DTD 3.clustering of similar text elements (cluster name = in cluster dominating descriptors) 4.renaming of clusters by experts 5.annotation of training texts 6.building a final XML DTD (for querying, XML based databases like Tamino, data miner,...)

17 17/24:: DIAsDEM :: Extensions :: new extensions ::

18 18/24:: DIAsDEM :: Extensions main goal: –searching documents from the internet, concerning user specification –downloading hypertext documents –extracting plain text from hypertext documents –importing plain text into DIAsDEM collection

19 19/24:: DIAsDEM :: Extensions :: querying Google ::

20 20/24:: DIAsDEM :: Extensions - Google 1. declaration of search words by user (panel) 2. querying of Google using the Google-API with reference to the search words 3. result: list of URLs (now only 10, limited by Google) automatic exported as list into a text file

21 21/24:: DIAsDEM :: Extensions :: processing and import ::

22 22/24:: DIAsDEM :: Extensions - Processing and Import 1. reading url list (exported text file) 2. downloading hypertext files into a directory and renaming the files (enumeration) 3. detagging the files -cleaning hypertext documents -deleting comments an tags -replacing special characters (not yet implemented) 4. importing files into the DIAsDEM collection

23 23/24:: DIAsDEM :: Questions? ?

24 24/24:: DIAsDEM :: Literature Graubitz,H., Spiliopoulou,M. & Winkler,K. (2001). „The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques“. In Proceedings of the First IEEE International Conference on Data Mining, pages , San Jose, CA, USA, November / December IEEE Computer Society, Los Alamitos. Winkler,K. & Spiliopoulou,M. (2003a). „Text Mining in der Wettbewerberanalyse: Konvertierung von Textarchiven in XML-Dokumente“. In Proceedings der 6. Konferenz der SAS Anwender in Forschung und Entwicklung, pages , Shaker Verlag, Aachen, Germany. Winkler,K. (2003b). „Technical Report - Getting Started with DIAsDEM Workbench 2.1“. A Case-Based Approach Technical Report, 121 pages. HHL - Leipzig Graduate School of Management.


Download ppt ":: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff."

Similar presentations


Ads by Google