Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.

Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group

Wellcome Trust Case Control Consortium rheumatoid arthritis data

Sort Genes to see candidates

Case control consortium rheumatoid arthritis data, type1 diabetes and bipolar disorder. National Institute of Mental Health bipolar disorder in US and German populations (different scale).

In the long term we hope to import data from GAIN and dbGAP and other sources as well.

28-way multiple alignment Still based on Penn State/UCSC blastz/chain/net/multiz pipeline. Have added “syntenic” filtering for high coverage genomes and reciprocal-best filtering for 2x genomes to reduce artifacts from paralogs.

PhyloP vs. PhastCons Existing conservation track uses PhastCons algorithm, which computes probability that a region is conserved. As more species are added this converges to 0 or 1. PhyloP track instead shows degree of conservation of a base

UCSC Genes Goals Include noncoding as well as coding genes Increase sensitivity of gene set in general. Increase coverage of alternative splice forms (but not too much). Apply comparative genomics to protein (CDS) prediction. Create permanent accessions for transcripts. Include noncoding as well as coding genes Increase sensitivity of gene set in general. Increase coverage of alternative splice forms (but not too much). Apply comparative genomics to protein (CDS) prediction. Create permanent accessions for transcripts.

Make graph Snap soft ends to hard end within 6 bp

Extend soft ends to hard ends

Consensus of soft ends weighted 3/4 of way towards long

Weigh edges by number of transcripts that make them 3233 3 1 1 2 4 1

Make graphs from various other sources: 4354 5 3 2 5 6 3 3233 3 1 1 2 4 1 exoniphy ests Mouse splicing Merge in weights from other graphs:

Walk graph to get nonredundant transcripts, starting with first transcript and continuing until all edges in graph of weight above a threshold are emitted. 4354 2 5 6 B C D E A A 3 3 5 Initial transcripts (ordered by exon count)

Walk graph to get nonredundant transcripts, starting with first transcript and continuing until all edges in graph of weight above a threshold are emitted. 4354 2 5 6 B C D E A A 3 5 3

Walk graph to get nonredundant transcripts, starting with first transcript and continuing until all edges in graph of weighted above a threshold are emitted. 4354 2 5 6 B C D E A A 3 5 B >= 3 >= 2 3

Walk graph to get nonredundant transcripts, starting with first transcript and continuing until all edges in graph of weighted above a threshold are emitted. 43545 6 B C D E A A 3 5 B >= 3 >= 2 3 2 DONE

Evidence type and weights refSeq RNA100 Other Genbank RNA2 Genbank spliced EST graph edges from at least 2 ESTs 1 Orthologous splicing graph in mouse mapped to human 1 Exoniphy exon predictions1 Minimum total weight of 3 for spliced transcripts, 4 for unspliced.

Assigning Coding Regions Take top scoring ORF using a program, txCdsPredict, that considers: –Length of ORF –Kozak consensus sequence –Nonsense mediated decay –Upstream open reading frames –Length of orthologous ORF in other species. txCdsPredict agrees with RefSeq reviewed ~96% of the time. Take top scoring ORF using a program, txCdsPredict, that considers: –Length of ORF –Kozak consensus sequence –Nonsense mediated decay –Upstream open reading frames –Length of orthologous ORF in other species. txCdsPredict agrees with RefSeq reviewed ~96% of the time.

Gene Statistics classUCSCEnsemblRefSeq coding204332293418992 antisense64310919 noncoding52289034590 Transcript Statistics classUCSCEnsemblRefSeq coding454754356925187 nearCoding446911214 antisense73110919 noncoding60479045592

Non-coding Coding Near-coding 38% of UCSC noncoding genes are < 200 bp transcripts primarily of known types such as snoRNAs, piRNAs, miRNAs etc. 62% are long, with a size distribution much like coding. (For Ensemble only 21% of noncoding are long)

Long noncoding genes have lower expression levels Absolute expression values from Affymetrix human exon arrays Coding Non coding

Other characteristics of long noncoding Long noncoding have lower tissue specificity. Poor conservation. Average phastCons score is 0.09 for long noncoding vs 0.73 for coding. BLAST analysis suggests 20% of long noncoding may be transcribed pseudogenes. Conclusion - long noncoding but transcribed genes are slippery. Most are likely nonfunctional. –Xist is poorly conserved overall but has some peaks and is reasonably well expressed. Long noncoding have lower tissue specificity. Poor conservation. Average phastCons score is 0.09 for long noncoding vs 0.73 for coding. BLAST analysis suggests 20% of long noncoding may be transcribed pseudogenes. Conclusion - long noncoding but transcribed genes are slippery. Most are likely nonfunctional. –Xist is poorly conserved overall but has some peaks and is reasonably well expressed.

Acknowledgements Programming and analysis: –Galt Barber - Genome Graphs extensions –Webb Miller Lab - Alignments –Adam Seipel - Evolutionary analysis –Dorota Retelska - UCSC noncoding genes Data: –Sanger, Wash U, Broad, JGI, NCBI, EBI, Affy –Contributors to scientific databases worldwide Funding: –NHGRI, NCI, HHMI, State of California Programming and analysis: –Galt Barber - Genome Graphs extensions –Webb Miller Lab - Alignments –Adam Seipel - Evolutionary analysis –Dorota Retelska - UCSC noncoding genes Data: –Sanger, Wash U, Broad, JGI, NCBI, EBI, Affy –Contributors to scientific databases worldwide Funding: –NHGRI, NCI, HHMI, State of California

The End

UCSC Genes Overall Pipeline Start with genomic/RNA alignments Remove antibody fragments Clean alignments and project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique well supported transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Assign accessions. Start with genomic/RNA alignments Remove antibody fragments Clean alignments and project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique well supported transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Assign accessions.

Classifying transcripts Coding: CDS survives trimming stage Near-coding: overlap coding by at least 20 bases on same strand Near-coding junk: near-coding transcripts that show signs of incomplete splicing. These are removed. Antisense: overlap coding by at least 20 bases on opposite strand Noncoding: other transcripts Coding: CDS survives trimming stage Near-coding: overlap coding by at least 20 bases on same strand Near-coding junk: near-coding transcripts that show signs of incomplete splicing. These are removed. Antisense: overlap coding by at least 20 bases on opposite strand Noncoding: other transcripts

Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.

Similar presentations

Presentation on theme: "Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.

Similar presentations

Presentation on theme: "Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group."— Presentation transcript:

Similar presentations

About project

Feedback