Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr John Mitchell (Chemistry, St Andrews, 2019)

Similar presentations


Presentation on theme: "Dr John Mitchell (Chemistry, St Andrews, 2019)"— Presentation transcript:

1 Dr John Mitchell (Chemistry, St Andrews, 2019)
Random Forest Dr John Mitchell (Chemistry, St Andrews, 2019)

2 Random Forest A Machine Learning Method

3 Random Forest A Machine Learning Method This is a decision tree.

4 Random Forest A decision tree is like a flow chart

5 Random Forest A Machine Learning Method
Let’s visualise the decision tree ...

6 Random Forest A Machine Learning Method ... as a flow chart.

7 Random Forest A Machine Learning Method In detail, it looks like this.
... as a flow chart. In detail, it looks like this.

8 Random Forest A Machine Learning Method I came across Random Forest in the context of its application to chemical problems, that is chemoinformatics (or cheminformatics, the variant spellings are equivalent).

9 Encoding structure as features
Mapping features to property

10 I refer to the entities about which predictions are to be made as items. In the context of chemistry, they are usually molecules. Each row of this matrix represents an item.

11 Each item is actually encoded by its descriptors
Each item is actually encoded by its descriptors. The terms feature and descriptor are synonymous. Each column of the matrix contains the values of one descriptor for each of the different items.

12 And each row of the matrix contains all the descriptors for one item.

13 Mapping features to property
The thing being predicted for each item is the property (output property). In this picture, it’s aqueous solubility. Mapping features to property

14 Classification is when the possible outputs of the prediction are discrete classes, so we are trying to put items into the correct pigeon hole, like [TRUE or FALSE] , or like [RED, GREEN, or BLUE]

15 Regression is when the possible outputs of the prediction are continuous numerical values, so we are trying to predict as accurately as possible. We normally measure this with the root mean squared error, and also look at the correlation coefficient.

16 Random Forest A Machine Learning Method
A single decision tree can indeed make a decent classifier, but there’s an easy way to improve upon this …

17 Video explanation of wisdom of crowds

18 Wisdom of Crowds Francis Galton (1907) described a competition at a country fair, where participants were asked to estimate the mass of a cow. Individual entries were not particularly reliable, but Galton realised that by combining these guesses a much more reliable estimate could be obtained.

19 Wisdom of Crowds Guess the mass of the cow: Median of individual guesses is a good estimator: Francis Galton, Vox populi, Nature, 75, (1907).

20 Wisdom of Crowds This is an ensemble predictor which works by aggregating individual independent estimates, and generates a result that is more reliable than the individual guesses and more accurate than the large majority of them.

21 Random Forest A Machine Learning Method
Rather than having just one decision tree, we use lots of them to make a forest.

22 Random Forest Multiple trees are only useful if not identical!
For regression, predictions of trees are averaged. So make them randomly different.

23 Random Forest So we randomly choose the data items for each tree;
Randomly choose data for each tree. Unlike a cup draw, it’s done with replacement; We choose N items out of N for each tree, but an item can be repeated; The resulting set of N non-unique items is known as a bootstrap sample.

24 Randomly choose data for each tree.

25 Random Forest This kind of with replacement selection gives what’s known as a bootstrap sample. On average, e-1 or ~37% of items are not sampled by a given single tree. Randomly choose data for each tree. These form the “out of bag” set for that tree. The “out of bag” data are useful for validation.

26 Random Forest We also randomly choose questions to ask of the data.
At each node, this is based on a fresh random sample of mtry of the descriptors. The descriptor used at each split is selected so as to optimise splitting or minimise training error, given that the split values (e.g. MW > 315) have already been optimised for each of the mtry available descriptors. Randomly choose data for each tree.

27 Random Forest Here’s an example of such a tree. LogP SMR VSA MW nHbond
Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Here’s an example of such a tree.

28 Random Forest Here’s an example of such a tree.
LogP SMR VSA MW nHbond Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Here’s an example of such a tree. Question at Node 1: Is the molecular weight > 315? If true go to Node 4; if false go to Node 3.

29 Random Forest The building of the decision trees is the training phase of the Random Forest algorithm. Once the trees are built, the query items are passed through each decision tree. Which node they end up at depends on their descriptor values. This node determines the tree’s individual prediction. Query items are passed through each tree.

30 A B D C E John Mitchell, Machine learning methods in chemoinformatics, WIREs Comput. Mol. Sci., 4, (2014)

31 Random Forest: Consensus
For a classification problem, the trees vote for the class to assign the object to. For classification, trees vote for class.

32 Random Forest: Consensus
For a regression problem, the trees each predict a numerical value, and these are averaged. For regression, predictions of trees are averaged.

33 Random Forest So let’s summarise what we’ve said about Random Forest.

34 Random Forest Introduced by Leo Breiman and Adele Cutler (early 2000s)
Development of Decision Trees (Recursive Partitioning): Random Forest can be used for either classification or regression, in the latter case the trees are regression trees. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J Chem Inf Comput Sci, 43, (2003)

35 Random Forest Introduced by Breiman and Cutler (2001)
Development of Decision Trees (Recursive Partitioning): Dataset is partitioned into consecutively smaller subsets (of similar property value) Each partition is based upon the value of one descriptor The descriptor used at each split is selected so as to minimise the error Tree is not pruned. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J Chem Inf Comput Sci, 43, (2003)

36 Random Forest (Classification)
Coupled Ensemble of Decision Trees Each tree is trained: from a bootstrap sample of the data in situ out-of-bag cross-validation without pruning back; for classification typically nodesize=1 from subset of descriptors at each split; Advantages: improved accuracy method for descriptor selection no overfitting easy to train human interpretable not a black box for classification typically mtry= SQRT(no. of descriptors) ntree=500 ...

37 Random Forest (Regression)
LogP SMR VSA MW nHbond Node 1 <=3.05 >3.05 >255 >7.4 <=255 <=7.4 >3 <=3 >315 <=315 Node 2 Node 3 Node 4 Node 5 Node 6 Coupled Ensemble of Regression Trees Each tree is trained: from a bootstrap sample of the data in situ out-of-bag cross-validation without pruning back; for regression typically nodesize=5 from subset of descriptors at each split; mtry=(no. of descriptors)/3 Advantages: improved accuracy method for descriptor selection no overfitting easy to train human interpretable not a black box for regression typically ntree=500 ...

38 Random Forest (Summary)
Random Forest is a collection of Decision Trees grown with the CART algorithm. Standard Parameters: Needs a moderately large number of trees. I’d suggest at least 100; generally 500 trees is plenty. No pruning back: Minimum node size > 5 (for regression) mtry descriptors tried at each split Can quantify descriptor importance: Incorporates descriptor selection Incorporates “Out-of-bag” validation

39 Random Forest (variants)
Bagging If we allow each split to use any of the available descriptors, rather than a randomly chosen subset, then Random Forest is equivalent to Bagging.

40 Random Forest (variants)
ExtraTrees The ExtraTrees variant (“extremely randomized trees”) uses all N items for each tree. It also chooses possible splits using a given descriptor at a node randomly (RF in contrast makes them as good as possible); ExtraTree does however then choose the best descriptor to carry out the split with.

41 Research Application: Computing Solubility

42 Which would you Prefer ... or ?

43 or ? Which would you Prefer ...
Solubility in water (and other biological fluids) is highly desirable for pharmaceuticals!

44 Solubility is an important issue in drug discovery and a major cause of failure of drug development projects Expensive for the pharma industry Patients suffer lack of available treatments A good computational model for predicting the solubility of druglike molecules would be very valuable.

45 Solubility You might think that “How much solid compound dissolves in 1 litre of water” is a simple question to answer. However, experiments are prone to large errors. Solution takes time to reach equilibrium, and results depend on pH, temperature, ionic strength, solid form, impurities etc.

46 Humankind vs The Machines
Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017) Image: scmp.com

47 Humankind vs The Machines
Challenge is to predict solubilities of 25 molecules given 75 as training data.

48

49 Humankind vs The Machines
Sent 229 ed invitations to subject experts and students. Obtained 22 anonymous responses, of those 17 made full sets of predictions.

50 Humankind vs The Machines
10 machine learning algorithms were given the same training & test sets as the human panel.

51 0.99 0.94 Difference not significant Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

52 Machine Learning Algorithms Ranked
1st 2nd

53 Another Layer of Wisdom of Crowds
We don’t know in advance which predictors will be good and which will be poor. However, we can make an algorithm that will allow us to generate a good (consensus) prediction without prior knowledge of results.

54 Wisdom of Crowds: Human Consensus Predictor
Guess for the solubility of the molecule: Median of all (between 17 & 21) individual human guesses of logS0 for a given compound.

55 Wisdom of Crowds: Machine Consensus Predictor
Guess for the solubility of the molecule: Median of all 10 individual machine guesses of logS for a given compound.

56 1.09 Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

57 1.14 1.09 Difference not significant Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017)

58 Conclusions: Humans v ML
Best humans and best algorithms perform almost equally; Consensus of humans and consensus of algorithms perform almost equally; Less effective individual human predictors are notably weaker. Both humans and ML numerically clearly better than a physics-based first principles theory approach.* * On a similar but non-identical dataset; David Palmer, James McDonagh, John Mitchell, Tanja van Mourik & Maxim Fedorov, First-Principles Calculation of the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules, J Chem Theory Comput, 8, (2012)

59 RF & other ML Methods for Solubility
Expt data: errors unknown ( logS0 units?) but limit possible accuracy of models; Differences in dataset size and composition often hinder comparisons of methods; ML numerically better than first principles (but FP not widely validated), at the cost of less insight.

60 Descriptor Importance
Replace each descriptor in turn with random noise. Measure how much worse randomising this descriptor makes the prediction error. The more damaging the loss of the descriptor’s information, the higher its importance. We can also measure the same effect by looking instead at node purity.

61 Platforms Photo by Richard Webb

62 Platforms I use the randomForest package in R.
Random Forest implementations in Python are also widely available. You’ll probably find available implementations for your own favourite language and platform.

63 Thanks Tanja van Mourik (St Andrews), Neetika Nath, James McDonagh (now IBM), Rachael Skyner (now Diamond, Oxford), Sam Boobier (now Leeds), Will Kew (now Edinburgh) Maxim Fedorov, Dave Palmer (Strathclyde) Laura Hughes (now Stanford), Toni Llinas (AZ), Anne Osbourn (JIC, Norwich)


Download ppt "Dr John Mitchell (Chemistry, St Andrews, 2019)"

Similar presentations


Ads by Google