Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Outline Sequence encoding Overfitting Method evaluation
How to represent biological data Overfitting cross-validation Method evaluation

Encoding of sequence data
Sequence encoding Encoding of sequence data Sparse encoding Blosum encoding Sequence profile encoding Reduced amino acid alphabets

Sparse encoding Inp Neuron AAcid A R N D C Q E

BLOSUM encoding (Blosum50 matrix)
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence encoding (continued)
Sparse encoding V: L: V.L=0 (unrelated) Blosum encoding V: L: V.L = (highly related) V.R = (close to unrelated)

Sequence encoding (continued)
Each amino acids is encoded by 20 variables This might be highly ineffective Can this number be reduced without losing predictive performance? Use reduced amino acid alphabet Charge, volume, hydrophobicity, Chemical descriptors Appealing, but in my experience it does not work -)

Evaluation of predictive performance
A prediction method contains a very large set of parameters A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem Temperature years

Train PSSM on raw data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

Train PSSM on Permuted data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method AND Same performance as one the original data AAAMAAKLA AAKNLAAAA AKALAAAAR AAAAKLATA ALAKAVAAA IPELMRTNG FIMGVFTGL NVTKVVAWL LEPLNLVLK VAVIVSVPF MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

Repeat on large training data (229 ligands)

Cross validation Train on 4/5 of data Test/evaluate on 1/5 =>
Produce 5 different methods each with a different prediction focus

Method evaluation Use cross validation Evaluate on concatenated data and not as an average over each cross-validated performance And even better, use an external evaluation set, that is not part of the training data

Method evaluation How is an external evaluation set evaluated on a 5 fold cross-validated training? The cross-validation generates 5 individual methods Predict the evaluation set on each of the 5 methods and take average

Method evaluation

5 fold training Which PSSM to choose?

5 fold training. Use them all

Define with encoding scheme to use
Summary Define with encoding scheme to use Sparse, Blosum, reduced alphabet, and combinations of these Deal with over-fitting Evaluate method using cross-validation Evaluate external data using method ensemble

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Similar presentations

Presentation on theme: "Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Similar presentations

Presentation on theme: "Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU"— Presentation transcript:

Similar presentations

About project

Feedback