Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU

Presentation 18 Brief Introduction to Text Mining And The Singular Value Decomposition Research Assistant: Chad Baldwin

Full SVD for a Term-Document Frequency Matrix

The i-th column of the matrix A, say, contains the counts that occur for each of the m terms in the i-th document, i = 1,2,…,n. The m left-singular vectors of the A matrix are in the columns of the U matrix. Equivalently the left-singular vectors in U are the unit-length eigenvectors of the AA T matrix. The right- singular vectors of the A matrix are represented by the rows of the V T matrix. Equivalently, these right- singular vectors are the unit-length eigenvectors of the A T A matrix. Finally, the singular values arrayed along the diagonal of the Σ matrix are the square roots of the eigenvalues from AA T or A T A. The eigenvalues of these two matrices are the same. These eigenvalues are arranged in the Σ matrix from highest value to lowest value running down the diagonal.

What we want to do is approximate the A matrix by using the first k (k < n) columns of U resulting in a m x k matrix, the first k singular values of Σ resulting in a k x k matrix and the first k rows of resulting in a k x k matrix. Therefore, the k-dimensional SVD approximation of A is

Approximating the Term-Document Frequency Matrix

Let denote the i-th column of the Term- Document Frequency matrix. In our SVD analysis, we want to compress the m x 1 Term- Count vector for the i-th document,,to a k x 1 dimensional vector, say, whose first element is, second element is, …, and k-th element is resulting in the k-dimensional representation of the i-th document as.

By this SVD decomposition we have reduced the m x n Term- Document Frequency matrix to the k x n matrix Then these columns of SVD k can be used as k explanatory variables in classification models (like the logistic model) to help us to discriminate between documents, whether they be binary, nominal, or ordinal classified documents, e.g., (0 = no fraud, 1 = fraud document), (1 = sports article, 0 = non-sports article), or (0 = document containing no praise, all criticism, 1 = document containing mixed praise and criticism, and 2 = document containing all praise).

Some Comments The SVD scores are often supplemented with a few specific (“pet”) words and their document frequency counts. For example, the Wall Street Journal has a text mining index made up of a frequency count of the usage of the word “recession” in a set of financial publications. If the frequency count of the recession word goes up in these certain financial documents, then possibly there is a greater chance of an actual recession developing in the future. Like any input variable, one can initially apply a variable- selection routine to determine which of the “pet” words to include along with the SVD scores.

10 An Example Application of the SVD to Scoring the Federalist Papers of Unknown Authorship SVD Transformation Logistic Regression

Primary References “Taming Text with the SVD” by Russ Albright, Ph.D., SAS Institute Inc., Cary, NC, russell.albright@sas.ocm, January, 2004 russell.albright@sas.ocm “Mining Textual Data Using SAS Text Miner for SAS 9 Course Notes”, SAS Inc., Cary, NC

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Similar presentations

Presentation on theme: "Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Similar presentations

Presentation on theme: "Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU."— Presentation transcript:

Similar presentations

About project

Feedback