Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell

2 Slide 2 EE3J2 Data Mining Course structure  Two main topics –Information retrieval –Data mining (including statistical modelling)

3 Slide 3 EE3J2 Data Mining Revision questions  Last year’s course basically same as this year, except: –In 2004 did more on sequence analysis (‘best matching subsequences’, etc) (small part of 1 question in 2004) –In 2005 included Gaussian Mixture Models (not included in 2004)  2004 exam is a good source of revision questions  See also 2004 revision questions (on website)

4 Slide 4 EE3J2 Data Mining Information retrieval  Basic structure of IR system  Zipf’s law –‘bundle of words’ approach –Significance of Zipf for IR –‘Resolving power’ of words  Stemming –What is it? How is it done?  Stop words –What are they?

5 Slide 5 EE3J2 Data Mining IR (Continued)  Matching queries with documents –Similarity –Inverse Document Frequency (IDF) –Term frequency –Term frequency – IDF weight –Definitions –Intuitive understanding –Know how to calculate these quantities –Document length

6 Slide 6 EE3J2 Data Mining TF-IDF Similarity  Define the similarity between query q and document d as: Sum over all terms in both q and d ‘Length’ of document d ‘Length’ of query q

7 Slide 7 EE3J2 Data Mining IR (Continued)  The Document Index –What is it? –Why is it important?  Assessing IR –Recall –Precision –Know the definitions, understand intuitively –Know how to calculate these –Precision-Recall graph –ROC curves

8 Slide 8 EE3J2 Data Mining IR (Continued)  Topic spotting –Usefulness –Salience –What’s the difference? Relationship between them?  Query expansion –Synonyms, hyponyms, hypernyms, antonyms –Thesauri, WORDNET –How to incorporate Query Expansion into similarity calculation

9 Slide 9 EE3J2 Data Mining IR (Continued)  LSA (Latent Semantic Analysis) –Vector representation of texts –Cosine similarity between vectors –Word-document matrix –Know what LSA is –Know how to do LSA (Singular Value Decomposition) –Understand LSA intuitively (so that you can explain it in words)

10 Slide 10 EE3J2 Data Mining Data Mining  Basic data analysis –Sample (vector) mean and variance –Covariance matrix –Interpretation of covariance matrix  Principle Components Analysis (PCA) –What is it? –What is it for? –How do you do it? –Interpretation

11 Slide 11 EE3J2 Data Mining Data Mining (Continued)  Statistical modelling –Random variable –Discrete and continuous variables  Gaussian PDFs –Definition, multivariate (vector) Gaussian PDF  Maximum Likelihood (ML) parameter estimation –ML estimation of Gaussian PDF parameters  Gaussian Mixture PDFs –Definition –The E-M algorithm

12 Slide 12 EE3J2 Data Mining Data Mining (Continued)  Clustering –Motivation –Metrics (Euclidean metric) –Distortion (definition, intuition) –Centroids  Agglomerative Clustering –What is it? –Simple example. –Strengths and weaknesses

13 Slide 13 EE3J2 Data Mining Data Mining (Continued)  Divisive Clustering –What is it? –Simple example. –Strengths and weaknesses  Decision tree interpretation –‘top down’ vs ‘bottom up’ –Local optimality

14 Slide 14 EE3J2 Data Mining Data Mining (Continued)  K-Means clustering –What is it? What does it do? –Know and understand the algorithm –Know how to write down the algorithm (maths!) –Know how to apply it to actual numerical data –Local optimality

15 Slide 15 EE3J2 Data Mining Data Mining (Continued)  Sequence analysis –Why is it important (examples)  Distance between sequences –Insertion, deletion, substitution –Alignment path –Accumulated distance along a path –Optimal path –Dynamic programming

16 Slide 16 EE3J2 Data Mining Data Mining (Continued)  Dynamic Programming (DP) –Know the algorithm –Know how to write down the algorithm (maths) –Know how to apply DP to example sequences –Know: –Path trellis –Distance matrix –Accumulated distance matrix –Path matrix –Edit distance (what is it, how do you calculate it?)

17 Slide 17 EE3J2 Data Mining Data Mining (Continued)  Hidden Markov Models (HMMs) –What is a HMM? Definition (maths) –State-symbol trellis (know how to draw it for a given HMM and sequence) –Alignment paths (state sequences) –Optimal state sequence –Viterbi decoder –Know definition –Know how to apply it to example data –HMM training, local optimality


Download ppt "Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell."

Similar presentations


Ads by Google