Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.

Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman Jaime Carbonell

The Segmentation Problem Segment protein sequence according to secondary structure Related to secondary structure prediction Often viewed as a classification problem Best performance so far is 78% Large portion of the problem lies with the boundary cases

Limited Domain: GPCRs G-Protein Coupled Receptors One of the largest superfamily of proteins known 2955 sequences, 1654 fragments found so far Transmembrane proteins Plays a central role in many diseases Only 1 protein has been crystallized

Distinguishing Characteristic of GPCRs Order of segments are known N-terminus Helix Intracellular loop Extracellular loop C-Terminus

Methodology: Topicality Measures Based on “Statistical Models for Text Segmentation” by D. Beeferman, A. Berger, and J. Lafferty Topicality measures are log-ratios of 2 different models Short-range model versus long-range model in topic segmentation in text Models of different segments in proteins

Short-Range Model vs. Long-Range Model

Problem - Not Enough Data! Family NameNumber of Proteins Class A1081 Class B83 Class C28 Class D11 Class E4 Class F45 Drosophila Odorant Receptors31 Nematode Chemoreceptors1 Ocular Albinism Proteins2 Orphan A35 Orphan B2 Plant Mlo Receptors10  Total of 1333 Proteins  Over 90% are shorter than 750 amino acids  Average sequence length is 441 amino acids  Average segment length is 25 amino acids

3 Topicality Models in GPCRs Previous segmentation experiments with mutual information and Yule’s measures have shown a similarity between All helices All intracellular loops and C-terminus All extracellular loops and N-terminus No two helices or loops occur consecutively 3 models instead of 15, trained across all families of GPCRs

Model of a Segment Each model is an interpolated model of 6 basic probability models Unigram model (20 amino acids) Bi-gram model (20 amino acids) Tri-gram model (20 amino acids) 3 Tri-gram models on reduced alphabets 11, 3, 2 amino acids LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P LVIMFYAGCW, KREDH, STNQP LVIMFYAGCW, KREDHSTNQP

Why Use Reduced Alphabets? Figure 1. Snake-like diagram of the human 2 adrenergic receptor.

Interpolation Oddity weights were trained so that sum of the probability assigned to the amino acid at each position in the training data is a max First attempt: all weight to the tri-gram model with the smallest reduced alphabet Reason: smaller vocabulary size causes the probability mass to be not as spread out

Interpolation Oddity, Take 2 Normalize the probabilities from reduced alphabet models E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P P(L |  ) / 4 P(F |  ) / 2 All of the weight went to the tri-gram model with the normal 20 amino acid alphabet

An Example: D3DR_RAT Class A dopamine receptor Figure 3 - Graph of the Log Probability of the Amino Acid at Each Position in the D3DR_RAT Sequence from the 3 Segment Models. The 3 segment models fluctuate frequently in their performance, making it difficult to detect which model is doing best and where the boundaries should be drawn.

D3DR_RAT @ Position 0-100 Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions 0-100. The true segment boundaries are marked in dotted vertical lines. N-TerminusHelixIntracellularHelix

Running Averages & Look-Ahead Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-size of  2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92. N-TerminusHelixIntracellularHelix

Predicted Boundaries for D3DR_RAT Window-size  2 from current amino acid Look-ahead interval of 5 amino acids Predicted Boundaries 38 53 65 88 107 135 150 171 188 212 374 394 413 431 6 2 1 4 3 9 1 1 3 3 1 3 1 3 32 55 66 92 104 126 149 172 185 209 375 397 412 434 Synthetic True Boundaries

The Only Truth: OPSD_HUMAN The only GPCR that has been crystallized so far Predicted Boundaries 37 61 72 97 113 130 153 173 201 228 250 275 283 307 1 0 1 1 0 3 1 3 1 2 2 1 1 2 36 61 73 98 113 133 152 176 202 230 252 276 284 309 True Boundaries Average offset for protein is 1.357 a.a.

Evaluation Metrics Accuracy Score 1 – perfect match Score 0.5 – offset of  1 Score 0.25 – offset of  2 Score 0 otherwise Offset – absolute difference between the predicted and true boundary position 10-fold Cross Validation

Results: Trained Interpolated Models Figure 6 -Results of Our Approach using Trained Interpolation Weights. Window-size:  2 Look-ahead interval: 5

Distribution of Offset between Predicted and Synthetic True Boundary

Removing 10% of the proteins with the worst average offset causes the average offset for the dataset to drop to 10.51.

Results: Using All Probability Models Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a window-size of  5 and a look-ahead interval of 4 was used.

Results: Using Only Tri-gram Models Figure 8 -Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.25 for each of the tri-gram models. Window-size of  4 and a look-ahead interval of 4.

Conclusions Average accuracy of 0.241 ~ offset of  2 on average But average offsets are much higher Missing a boundary has detrimental effects on prediction of remaining boundaries in the sequence, especially with a small segment Large offsets with a small number of proteins

Future Work Cue words Unigrams, bi-grams, tri-grams, 4-grams in a window of +/- 25 amino acids from boundary Long range contact Distribution tables of how likely 2 amino acids are in long-range contact of each other Evaluation How much homology is needed between training and testing data

References 1.Doug Beeferman, Adam Berger, and John Lafferty. “Statistical Models for Text Segmentation.” Machine Learning, special issue on Natural Language Learning, C. Cardie and R. Mooney eds., 34(1-3), pp. 177-210, 1999. http://www-2.cs.cmu.edu/~lafferty/ps/ml-final.ps 2.F. Campagne, J.M. Bernassau, and B. Maigret. Viseur program (Release 2.35). Copyright 1994,1995,1996, Fabien Campagne, All Rights Reserved.

Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.

Similar presentations

Presentation on theme: "Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.

Similar presentations

Presentation on theme: "Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman."— Presentation transcript:

Similar presentations

About project

Feedback