Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes.

Similar presentations


Presentation on theme: "Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes."— Presentation transcript:

1 Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes

2 2 Judgment File for Testing Data is Ready to Download http://www.mtholyoke.edu/courses/xli/341/ project/test04-complete-rel.txt

3 3 What’s Uue on Monday (May 7 th ) Your output files for the 25 testing queries Your output files for the 25 testing queries –by 12:00am Monday (two electronic files via email) 15-minute PowerPoint presentation 15-minute PowerPoint presentation –In class A written (typed) report A written (typed) report –In class Codes (a hard copy of the programs you wrote for this project) Codes (a hard copy of the programs you wrote for this project) –In class

4 4 http://www.mtholyoke.edu/courses/xli/341/project/ sameple_output.txt 76 Q0 NYT19980722.0197-19 1 24.5831 1 76 Q0 NYT19991019.0443-14 2 23.4941 1 76 Q0 APW20000405.0059-15 3 21.891 1 76 Q0 APW20000405.0092-17 4 21.891 0 76 Q0 APW20000406.0056-26 5 21.891 0 76 Q0 NYT19980722.0197-22 6 21.7889 0... 100 Q0 APW20000115.0009-6 12 19.3636 1 100 Q0 APW20000114.0238-6 13 19.3636 1 100 Q0 APW20000114.0263-6 14 19.3636 1 100 Q0 NYT20000421.0179-10 15 19.2492 1 100 Q0 APW20000406.0056-6 16 18.963 0

5 5 A Written Report on the Final Project Problem Statement Problem Statement Problem Analysis Problem Analysis Techniques/Approaches (what, why and how) Techniques/Approaches (what, why and how) Implementation Implementation Experimental Results and Analysis Experimental Results and Analysis Lessons Learned Lessons Learned

6 6 Good Luck!

7 7

8 8 Final project Goal: Apply available data mining techniques to solve real world problem. Goal: Apply available data mining techniques to solve real world problem. Requirement: Requirement: –Apply two techniques/algorithm and implement at least one algorithm. (Find existing codes online or team up with your classmates) Problem: Relevant sentence retrieval Problem: Relevant sentence retrieval –Retrieve the set of relevant sentences given a query and a collection of documents

9 9 An Example Query: India and Pakistan Nuclear Tests India and Pakistan Nuclear Tests Description: Description: »On May 11 and 13, 1998 India conducted five nuclear tests; Pakistan responded by detonating six nuclear tests on May 28 and 30th. This nuclear testing was condemned by the international community. Narrative : Narrative : »Relevant documents mention the nuclear testing conducted in May 1998 by both India and Pakistan. Historical information about the antagonism and rivalry between the two countries is not relevant. Mention of the furor created around the world by these detonations is relevant.

10 10 An Example Document XIE19980529.0045-1 XIE19980529.0045-1 XIE19980529.0045 XIE19980529.0045 XIE19980529.0045-2 XIE19980529.0045-2 1998-05-29 1998-05-29 XIE19980529.0045-3 XIE19980529.0045-3 France, Canada Condemn Pakistani Nuclear Tests France, Canada Condemn Pakistani Nuclear Tests XIE19980529.0045-4 XIE19980529.0045-4 NEW YORK, May 28 (Xinhua) -- More countries have come out to condemn Pakistan's nuclear tests. NEW YORK, May 28 (Xinhua) -- More countries have come out to condemn Pakistan's nuclear tests.

11 11 An Example Document XIE19980529.0045-5 XIE19980529.0045-5 The French Foreign Ministry issued a communique Thursday to deplore and condemn the nuclear tests conducted by Pakistan on the same day. The French Foreign Ministry issued a communique Thursday to deplore and condemn the nuclear tests conducted by Pakistan on the same day. XIE19980529.0045-6 XIE19980529.0045-6 France calls on both India and Pakistan not to conduct any more nuclear tests but to sign the Comprehensive Test Ban Treaty and join talks on the banning of production of fissile materials that can be used to produce nuclear arms, said the communique. France calls on both India and Pakistan not to conduct any more nuclear tests but to sign the Comprehensive Test Ban Treaty and join talks on the banning of production of fissile materials that can be used to produce nuclear arms, said the communique.

12 12 An Example Document XIE19980529.0045-7 XIE19980529.0045-7 Canadian Foreign Affairs Minister Lloyd Axworthy said in a statement released Thursday, "We continue to urge Pakistan and India to renounce their nuclear weapons programs and to sign the Nuclear Non-Proliferation Treaty and the Comprehensive Test Ban Treaty." Canadian Foreign Affairs Minister Lloyd Axworthy said in a statement released Thursday, "We continue to urge Pakistan and India to renounce their nuclear weapons programs and to sign the Nuclear Non-Proliferation Treaty and the Comprehensive Test Ban Treaty." XIE19980529.0045-8 XIE19980529.0045-8 He also announced a series of sanctions against Pakistan, which he said are consistent with those imposed on India after its nuclear tests. He also announced a series of sanctions against Pakistan, which he said are consistent with those imposed on India after its nuclear tests.

13 13 Training and Testing Training set Training set –25 queries –Collections of documents –Relevance judgment of each sentence Testing set Testing set –Another 25 queries –Collections of documents

14 14 Questions to Be Considered: How do you represent queries and sentences? How do you represent queries and sentences? What features may be considered for the task? What features may be considered for the task? What data mining techniques can be used and how? What data mining techniques can be used and how? How do you evaluate the performance of your system? How do you evaluate the performance of your system?

15 15 Evaluation: Precision and Recall Precision and Recall The F measure The F measure –F = 2*Precision*Recall/(Precision + Recall) Average Precision for query Average Precision for query –Calculate by averaging precision as recall increases. Mean Average Precision Mean Average Precision

16 16 Data Mining Techniques Classification Classification Clustering Clustering Association Rules Association Rules

17 17 Fuzzy c-means clustering Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster For each point x we have a coefficient giving the degree of being in the kth cluster uk(x) For each point x we have a coefficient giving the degree of being in the kth cluster uk(x)

18 18 Fuzzy c-means clustering The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: The degree of belonging is related to the inverse of the distance to the cluster The degree of belonging is related to the inverse of the distance to the cluster The coefficients are normalized with a real parameter m > 1 so that their sum is 1. The coefficients are normalized with a real parameter m > 1 so that their sum is 1.

19 19 Fuzzy c-means clustering Choose a number of clusters. Choose a number of clusters. Assign randomly to each point coefficients for being in the clusters. Assign randomly to each point coefficients for being in the clusters. Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : –Compute the centroid for each cluster, using the formula above. –For each point, compute its coefficients of being in the clusters, using the formula above.

20 20 Next Class (Wednesday April 11 th ) A 10-minute presentation on your project design, including A 10-minute presentation on your project design, including –(1) Problem statement(s) –(2) Your technical approaches (I & II) –(3) A work plan –(4) Anticipated results –(5) Evaluation metrics


Download ppt "Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes."

Similar presentations


Ads by Google