Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Similar presentations


Presentation on theme: "Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU"— Presentation transcript:

1 Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

2 Outline Sequence encoding Overfitting Method evaluation
How to represent biological data Overfitting cross-validation Method evaluation

3 Encoding of sequence data
Sequence encoding Encoding of sequence data Sparse encoding Blosum encoding Sequence profile encoding Reduced amino acid alphabets

4 Sparse encoding Inp Neuron AAcid A R N D C Q E

5 BLOSUM encoding (Blosum50 matrix)
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

6 Sequence encoding (continued)
Sparse encoding V: L: V.L=0 (unrelated) Blosum encoding V: L: V.L = (highly related) V.R = (close to unrelated)

7 Sequence encoding (continued)
Each amino acids is encoded by 20 variables This might be highly ineffective Can this number be reduced without losing predictive performance? Use reduced amino acid alphabet Charge, volume, hydrophobicity, Chemical descriptors Appealing, but in my experience it does not work -)

8 Evaluation of predictive performance
A prediction method contains a very large set of parameters A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem Temperature years

9 Evaluation of predictive performance
Train PSSM on raw data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

10 Evaluation of predictive performance
Train PSSM on Permuted data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method AND Same performance as one the original data AAAMAAKLA AAKNLAAAA AKALAAAAR AAAAKLATA ALAKAVAAA IPELMRTNG FIMGVFTGL NVTKVVAWL LEPLNLVLK VAVIVSVPF MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

11 Repeat on large training data (229 ligands)

12 Cross validation Train on 4/5 of data Test/evaluate on 1/5 =>
Produce 5 different methods each with a different prediction focus

13 Method evaluation Use cross validation Evaluate on concatenated data and not as an average over each cross-validated performance And even better, use an external evaluation set, that is not part of the training data

14 Method evaluation How is an external evaluation set evaluated on a 5 fold cross-validated training? The cross-validation generates 5 individual methods Predict the evaluation set on each of the 5 methods and take average

15 Method evaluation

16 Method evaluation

17 5 fold training Which PSSM to choose?

18 5 fold training. Use them all

19 Define with encoding scheme to use
Summary Define with encoding scheme to use Sparse, Blosum, reduced alphabet, and combinations of these Deal with over-fitting Evaluate method using cross-validation Evaluate external data using method ensemble


Download ppt "Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU"

Similar presentations


Ads by Google