Machine Learning for Public Health in R

Machine Learning for Public Health in R
Mike Dolan Fliss

Main tasks of Machine Learning
Clustering Classification Prediction / Regression  We do this sometimes, but it isn’t CI! …But also: web scraping, text mining, data viz, etc.

Today: Methods sampler & applications
More Unsupervised More Supervised K-means – we’ll do some! Hierarchical clustering Principal Component Analysis Latent Class Analysis (^ not covering, but for categoricals) K-nearest neighbors (KNN) Support vector machines (SVM) Recursive partitioning trees (RPT…& and friends)

Many methods rely on distance. Distance how?
Continuous Numeric? Euclidean (Manhatten, etc.) – dist() Text? Levenshtein - adist(), etc. Geometry? Spatial – st_distance(), or just x/y Euclidean in a pinch …. Categorical Binary (Jaccard distance) Dummify – library(dummies). Really! Basically indicator coding.

Scaling 2 pounds vs. 2 feet vs. 2 inches Standardization
… but may not want to artificially spread data Custom weighting These decisions have consequences. Content knowledge helps!

Unsupervised Clustering

K-means

K-means Wikipedia, yay

K-means

K-means Demo Find county clusters of exposure and outcome (two variables only). Assume 4 clusters.

K-means Demo Let’s cluster on % of births % preterm % PNC
Do you have an intuition about what’s going ton happen?

K-means Demo Let’s cluster on % of birthsn,
Selecting the number of clusters can be done somewhat statistically… … but also there’s the whole utility / communication thing.

K-means Demo Have an intuition about what happened here?

Hierarchical Clustering
h_clust_data = county_profile[,1:3] row.names(h_clust_data) = county_results$county_name h_clusters = hclust(dist(h_clust_data)) county_profile$h_clusters_cut = cutree(h_clusters, k=4) # ^ k groups; h= for height cut ggplot(county_profile, aes(pct_pnc5, pct_preterm, color=as.factor(h_clusters_cut), group=as.factor(h_clusters_cut)))+ geom_point() + geom_density2d() head(county_profile) Random starts, get nearest neighbors

Principle Component Analysis
Reduce high dimensionality Get linearly uncorrelated orthogonal vectors county_pca = prcomp(county_profile[,1:3], center=T, scale. =T) plot(county_pca, type="l") summary(county_pca) predict(county_pca) library(ggfortify) autoplot(county_pca, data=county_profile, label=T, loadings = T)

Support Vector Machines
Partition space using linear vectors Can be single or “knotted” linear splines May not be linearly separable with hard margin, may need “soft” margin Least squares and equivalents can maximize the margin and minimize the error. Extendable as Support Vector Clustering (with kernals) to curve lines. Wikipedia and

Why “fish” for patterns? Why clustering?
Identification of hard to notice sub-groups for targeting interventions! Imagine including median household income in the county model – what low MHHI counties “perform well”? Which high MHHI counties perform low, and why? EMM and EMM-like questions Collapse dimensions for Story-telling!

Story-telling example: multi-drug ED visits

Story-telling example: multi-drug ED visits
fviz_cluster(rates_kmeans, data=rates_for_means)+ geom_text(aes(label = rates$County, color=cluster))

Supervised Learning Examples

K Nearest Neighbors Majority vote of your … K nearest neighors!

Supervised Learning / Prediction Examples

How does our CI model do in prediction?
Not well. Nearly a coin flip on preterm birth for many people. Why? Short talk about it leading into Sara next week…

Reasonable question: Can we predict preterm birth?
Not well with our CI GLM. Nearly a coin flip on preterm birth for many people. Why? Could try… Naïve kmeans RPT More from Sara next week… no pressure. ;)

to_model = births[,c("pnc5_f", "preterm_f", "smoker_f", "raceeth_f", "cores", "mage")]
to_model = na.omit(to_model) head(to_model) scale_01 = function(x){ x = as.numeric(x); x = (x-min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T)) return(x) } to_model_01 = data.frame(lapply(to_model, FUN=scale_01)) summary(to_model_01) km = kmeans(to_model_01[,names(to_model_01) != "preterm_f"], 2) to_model$cluster = km$cluster table(to_model$cluster, to_model$preterm_f) prop.table(table(to_model$cluster, to_model$preterm_f), margin = 2) # ^ meh. cluster 1 is a little more preterm. km2 = kmeans(to_model_01[,names(to_model_01) != "preterm_f"], 10) to_model$cluster2 = km2$cluster prop.table(table(to_model$cluster2, to_model$preterm_f), margin = 1)*100 #meh. cluster 1 is a little more preterm. to_model %>% group_by(cluster2) %>% summarise(n = n(), pct_preterm=sum(preterm_f=="preterm", na.rm=T)/n)

str(to_model) tree1 = rpart(data = to_model, preterm_f ~ pnc5_f + smoker_f + raceeth_f + mage , method="class", parms=list(split="information"), control=rpart.control(minsplit=2, minbucket=1, cp=0.0001)) #no interaction terms, though. dropped cores. need minbucket. summary(tree1) plot(tree1) fancyRpartPlot(tree1) to_model$rpart_pred = predict(tree1, to_model[,c("pnc5_f","smoker_f","raceeth_f", "mage")], "class") to_model$rpart_pred table(to_model$rpart_pred, to_model$preterm_f) table(to_model$rpart_pred) DUH

Real World ML Projects Get more of a sense in practice – you can do these!

Predicting Tobacco Retailer Characteristics
Web scraping – text mining – mTurk – …and Tree-based classifiers

Background Tobacco retailer licensing is a main mechanism of control.
TRL enables area based-policies (e.g. no retailers within distances of schools, parks, certain kinds of stores, maximum density, etc.) with evidence-based effects on health Unlike alcohol, not all states have a census of retail shops. Vaping is a new challenge here. Building and maintaining custom lists of

Background Tobacco retailer list generation and validation process. Web-scraped retailers augment the dataset after machine learning classification and imputation tasks, matching against known retailers, and validation of low- confidence data through Amazon MTurk.

Actual Data Characteristics of Counter Tools Tobacco Retailer Dataset. 16,544 subset of over 19,000 surveyed retailers with complete store type, store name, and tobacco selling status at time of analysis, representing 14 US states.

Web Scraping Web-Scraped Stores vs. Counter Tools (CT) Dataset in Durham County. n=220, n=218 respectively, with 52% of the scraped retailers linked to the CT dataset on individual hand review . Search results were generated centered around reverse-geocoded county subdivision centroids.

Text Mining Term-Frequency, Inverse Document Frequency (TF-IDF) is a measure of statistical “unlikeliness” of words (or n- gram tokens) within a subset of a larger set, sometimes called the corpus. Feature engineering is the process of combining data together into the features to feed a model. For instance, in this case: how to combine TD- IDF scores for tokens into a score for the full store name.

Text Mining in R #_____________________________________________ # Create n-grams #### create_ngram_df = function(names_df, max_ngrams){ results_df = data.frame(line = integer(0), token = character(0), n = integer(0), stringsAsFactors = F) for (n in 1:max_ngrams){ df = names_df %>% unnest_tokens(token, name, token = "ngrams", n=n) #note expects tibble! df$n = n results_df = bind_rows(results_df, df) } return(results_df) all_tokens = create_ngram_df(names_df, 5); head(all_tokens) table(all_tokens$n) # token counts # Might be good to drop single character words (N, E, I, 1, tc.) # line is really id. might be nice to clarify that. # Previous note: problem, unnest even characters within words for some reason (???) # Create token count df #### token_counts = count(all_tokens, token, sort=T) %>% arrange(desc(nn)) %>% mutate(rank = 1:n()); head(token_counts) # count and sort # Tokens > 100: Cloud... wordcloud(token_counts$token, freq = token_counts$nn, min.freq = 300, colors = brewer.pal(9, "Blues")[5:9], scale=c(7, 1)) tidytext package makes this very easy! Works in dplyr. Also see tm and wordcloud packages.

Classify Many tools! Multinomial logistic regression Recursive partitioning trees Random forest models Smart adjustments to address overfitting and bad models Minimum leaf / branch size Pruning branches by hand or code Ensemble techniques (combine multiple weaker learners) Bagging (Bootstrap AGGregation) runs many small models, random sample w replacement, and combines Boosting – upweighting what you get wrong Random forest

m1 = rpart(str_typ ~., data = df_train , method = 'class',
control=rpart.control(minsplit=2, cp=0)) predictTest(m1, df_test) conf_mat_rpart = predictTest(m1, df_test, confuMat = T) Classify Example, Simplified Decision Tree for Store Type Classification Based on Modified TF-IDF Scores. Note that some tree-based classifiers may run multiple trees, or have too many branches and leaves to effectively visualize.

Classify in R Multinomial & RPT regression store type classification model based on cleaned and tokenized store names. Includes sensitivity (Sn), specificity (Sp), and accuracy (Acc) metrics for store types and overall model sensitivity. Specificity and accuracy metrics uninterpretable until multiple category model and omitted.

Linkage (Sometimes called matching*)
Linkage can be considered a classification task: classifying pairs of observations, using multiple distance measures as the features, as matches or not. Can be done unsupervised, or after some unsupervised is done to curate a human review set, can be rerun as supervised with a true training and test set. Smart techniques to decrease complexity / maximize efficiency. Again, be careful with assumptions, feature engineering and training. More on this on the next project!

Validate / MTurk Give small tasks to human reviewers
Inter-Rater Reliability (IRR) and other tools are useful for assessing survey (and surveyor!) quality

Death by “Legal Intervention” / by Law Enforcement
Again, linkage as a classification problem

Research Intent Public attention is focused on Legal Intervention* due to concerns over racial bias and unnecessary force. AIM 1: describe the level of agreement between NVDRS ( ) and crowd-sourced data: Mapping Police Violence (MPV) Guardian’s “The Counted” Washington Post’s “Fatal Force” AIM 2: describe demographics/circumstances where there is disagreement

Pilot First in NC VDRS Data
Year NCVDRS MPV 2013 31 34 2014 26 39 Small numbers, but sufficient to pilot our techniques Awaiting updated 2014 NC-VDRS data Used model to inform suggestions for national datasets …For more meaningful comparisons of difference, need cross-data Linking

Linking Required to assess characteristics of overlap and unknown / differently characterized relationships. Building a linking model now on NC VDRS, National Counted and Mapping Police Violence data (with full names) Using content-aware distance matrices Feeding an eventual supervised machine-learning / tree model Will then apply to National VDRS data (no name here for linkage)

Distance Matrices For all elements of dataset A and B, fill an AxB matrix of the pairwise distance comparisons by some method. The minimum distances along a row or column suggest linking on that method A trivial example would me a matrix of 1s and 0s for exact string match, 0 being a perfect match (0 distance) and 1 representing not-a-match. Efficiency note: for large Data A1n Data B1m

Distance Matrices Simpler but non-trivial linking methods use approximate text distance linking (Levinschtein distance or others) on a fully concatenated string. Example: A=“John Q Smith 2017/02/12 White Raleigh NC” B= “John P Smith 2016/12/21 White Raleigh NC” Number of substitutions, additions, deletions to get from A to B = 5 But this doesn’t take advantage of the content (e.g. date distance or place distance).

Distance Matrices First build individual content-specific distance matrices for each element … (example distance range in parentheses) Name: text distance (0->32) Race-eth: binary 0-1 Date: days different (0-707) City: text distance (0-19) geospatial distance State: binary 0-1

Distance Matrices … Then collapsing them into a single n-dimensional distance by any of a number of methods. These aggregate link indices performs well by themselves. Name Race-eth Date City State etc… Unweighted sum scaled sum (0-1) Normalized sum (~-3,~3)

Exploratory Tree Models
Predictive trees can also be used for categorization. They benefit from Maintain separate distance matrix information as decision nodes Being able to “reuse” covariates if useful (natural interaction) Do not require single a priori or parameterized generalized linear relationships

Exploratory Tree Models
With violent deaths / deaths by police being relatively rare at the local level, even without decedent name in the tree model, date distance (in days) and the approximate text distance in the city name, by themselves, correctly categorize 99% of links between the Mapping Police Violence and The Counted datasets. Name is more useful when using the entire violent death dataset. Treating dates numerically instead of as strings in a concatenated ID may have application to other death linking projects.

Future Work Expand from MPV Tweaks / Stabilization for performance:
See how model performs on The Counted and Fatal Force Tweaks / Stabilization for performance: if single aggregate index, consider stabilizing (e.g. dates >30 are all 30, long name distances are cut off) Package Helpers Built this “by hand” - RecordLinkage package in R uses regressive trees. Easy to get started with! Good documentation, associated journal paper.

Machine Learning for Public Health in R

Similar presentations

Presentation on theme: "Machine Learning for Public Health in R"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning for Public Health in R

Similar presentations

Presentation on theme: "Machine Learning for Public Health in R"— Presentation transcript:

Similar presentations

About project

Feedback