Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

Similar presentations


Presentation on theme: "Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1."— Presentation transcript:

1 Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1

2 Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes Strategy: PipeR one2many homolog assignment Template: PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) 2 genes10840 transcripts17547 exons58857 sum of mature transcript length (nt)16·927·027 real coverage (nt)13·083·478 non overlapping loci7428

3 PipeR: a pipeline for mapping lncRNAs blast-exonerate based framework to map lncRNAs against target genomes algorithm used: chromosome 2 Blast hits mapping extension Exonerate spliced transcript lncRNA 3

4 GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping 1.Anchor points: ENCODE vs Mouse with tuned Blast 2.Extension: Exonerate 3.Filtering: Id and Coverage 4.Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome 5.Further Mapping Parameter Space Exploration using Experimental Evidences GFF File Notredame, Bussotti

5 5 Transcript 1 Transcript 2 Gene A Gene B Blast/Exonerate failed Multiple Homologues Conserved exon numberHigh repeat coverageOverlap with protein Homolog 1 Query species Target species Homolog 4Homolog 3Homolog 2 Best reciprocal Mapping overview Transcript 3

6 mapped transcripts out of many lncRNAs found in multiple copies (lncRNA families) - found homologs corresponding to exons Annotations of discovered homologs are readily available 6 GENCODEv10 vs human genome

7 7 About the 10% of all our homolog predictions are fully covered by repeats Homolog repeat coverage

8 8 We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100 Homolog repeat coverage

9 9 <= 20%<= 80%<= 100% genV10 mapped genes genV10 mapped transcripts Total homologs Homologs whose exons overlap protein coding exons (same strand) HUMAN Mapping statistics

10 10 mapped 3190 transcripts out of representing 2249 human genes many lncRNAs found in multiple copies (lncRNA families) - found homologs corresponding to exons Annotations of discovered homologs are readily available GENCODEv10 vs mouse genome

11 11 Difference between the number of exons in the human transcripts and in the mouse homologs “0” means that the exon number is the same Negative bins indicate mouse homologs having more exons than the human query 1160 GENCODE v10 transcripts find at least 1 homolog in mouse with the same exon number Human/Mouse Exon Number Conservation human > mouse human < mouse

12 12 We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100 Homolog repeat coverage

13 13 <= 20%<= 80%<= 100%Reciprocal homologs genV10 mapped genes genV10 mapped transcripts Total homologs Homologs whose exons overlap protein coding exons (same strand) Homologs with conserved number of exons MOUSE Mapping statistics Best Candidates: There are 148 transcripts that have < 20% repeat coverage, conserved exon structure, do not overlap protein coding exons and are best reciprocal homologs with the human queries

14 GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping 1.Anchor points: ENCODE vs Mouse with tuned Blast 2.Extension: Exonerate 3.Filtering: Id and Coverage 4.Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome 5.Further Mapping Parameter Space Exploration using Experimental Evidences GFF File Notredame, Bussotti

15 BlastR vs The World

16

17 blastn (8749) blastr (12093) blastnOpt (12487) a) b)c) Figure 2: Exon read support. a)Venn-diagram indicating the number of exon detected by different methods (numbers in parentesis) and their intersection (transcripts annotated identically by the three methods). b)Average amount of reads per exons c) Percent of reads covered by at least one exon all (7492)

18 Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes Strategy: PipeR one2many homolog assignment Template: PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) 18 genes3845 transcripts5669 exons18353 sum of mature transcript length (nt) real coverage (nt) non overlapping loci2790

19 mapped 1187 transcripts out of 5669 many lncRNAs found in multiple copies (lncRNA families) - found homologs corresponding to exons Annotations of discovered homologs are readily available 19 Ensembl.v65 vs human genome

20 mapped 5622 transcripts out of 5669 many lncRNAs found in multiple copies (lncRNA families) - found homologs corresponding to exons Annotations of discovered homologs are readily available 20 Ensembl.v65 vs mouse genome

21 21 Difference between the number of exons in the mouse transcripts and in the human homologs “0” means that the exon number is the same Negative bins indicate human homologs having more exons than the mouse query 481 Ensemblv65 transcripts find at least 1 homolog in human with the same exon number Mouse/Human Exon Number Conservation mouse > humanmouse < human

22 22 Not observed a peak of homolog predictions fully covered by repeats Homolog repeat coverage

23 23 Input lncRNA datasets have similar repeat distributions Ensemble.65 and GENCODEv10 repeat coverage

24 24 ensV65 mapped genes 3815 ensV65 mapped transcripts 5622 Total homologs Homologs whose exons overlap protein coding exons (same strand) MOUSE Mapping statistics ensV65 mapped genes 879 ensV65 mapped transcripts 1187 Total homologs Homologs whose exons overlap protein coding exons (same strand) 3642 Homologs whose exons do not overlap any gencode v10 element (same strand) 6085 Homologs with conserved number of exons 4925 HUMAN

25 Strategies: 1) GeneId ORF score comparison between mRNAs and lncRNAs 2) BlastX against human proteins (ensembl 65) 3) Overlap with protein coding gene exon annotations (gencodeV10) 4) PipeR filtering routines 25 Part 3: GENCODE v10 lncRNA coding potential check

26 1) ORF scores as returned by GeneID 26 2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match proteins Parameters: seg low complexity filtering, repeat filtering, evalue 10e-10, search just the plus strand. Human Ensembl 65 protein set

27 3) -Checked the overlap between GENCODE v10 lncRNA exons and GENCODE v10 protein coding exons. - Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon Example 1 Example 2 27

28 4) Extensive filtering 7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines Filtering rules: - overlap with protein coding exons - geneID ORF score similar to the ones of mRNA - blastX to uniprot database (50% redundancy) - blastX to nr database - rpsBlast to pfam domain families - blast against Rfam 28


Download ppt "Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1."

Similar presentations


Ads by Google