Presentation is loading. Please wait.

Presentation is loading. Please wait.

Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland.

Similar presentations


Presentation on theme: "Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland."— Presentation transcript:

1 Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland

2 Why Blast2GO Functional characterization of novel sequence data Adapted of high throughput needs of biological laboratories Extracting knowledge about functioning of genomes

3 Blast2GO Impact

4 Outline Concepts on Functional Annotation The Blast2GO annotation framework Visualization of functional data Pathway analysis with Blast2GO

5 Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset? Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset?

6 The Gene Ontology Three branches: Biological Process Molecular Function Cellular Component Annotations are given to te most specific (low) level True path rule: annotation at a given term implies annotation to all its parent terms Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author statement o ISS: infered by sequence similarity o IEA: electronic annotation o …. More general More specific

7 Functional assignment Annotation EmpiricalTransference Molecular interactions Gene/protein expression Biochemical assay Structure Comparison Sequence analysis Identification of folds Motif identification Phylogeny Literature reference Sequence homology

8 Annotation by similarity: concerns Level of homology (~ from 40-60% is possible) The overlap between hit and query, association function and structure The paralog problem: genes with similar sequences might have different functional specifications The evidence for the original annotation Balance between quality and quantity: depends on the use GO 1, GO 2, GO 3, GO 4 QUERY HIT

9 The Blast2GO annotation framework

10 cellular component biological process Fasta Application scheme

11 cellular component biological process Fasta

12 cellular component biological process Application scheme

13 Basic annotation procedure Sq1 Blast Sq2 Sq3 Sq4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Sq1 Sq2 Sq3 Sq4 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Mapping Hit1 Hit2 Annotation

14 Annotation Rule -Let be GO 1…n be candidate annotations for sequence S 1, obtained from hits H i…k -We compute an annotation score AS for each GO i that depends on: -The similarity between sequence S 1 and H j -The evidence code of GO i -The existence of other neigboring GO candidates -The structure of the Gene Ontology -We define an abritary annotation threshold (AT) -S 1 is annotated with GO i if its AS GOi > AT

15 Annotation Rule Annotation Score Quality of source annotation: IEA=0.7, IDA = 1, NR = 0.0,... Similarity Requirement GO4 GO2GO1 GO3 Possibility of abstraction True-Path-Rule selectivity vs. specificity Cut-Off Value new annotation

16 Blast2GO annotation rule - When I have a GO with ECw =1 and I do not allow abstraction (GOw = 0), then the Annotation Score = %similarity - If the ECw < 1 my similarity requirement is higher to obtain the same Annotation Score - If I allow abstraction GOw > 0, then with less similarity I can obtain the required Annotation Score at a parent node

17

18 Start Blast2GO

19 Blast2GO Application Main Sequence Table ( 1) Blast (2) Mapping (3) Annotation Graph visualisation Application messages Blast results Application statistics Any operation will only affect to selected sequences!!!!

20 Load sequences

21 Input data (in FASTA format, AA or nt)‏ as df >my_favourite_species_seq1 | still unknown gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggtttt ataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccct aaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagcta cactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagca gatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacacc gtttattacacactcgaaaggccgttag >my_favourite_species_seq2 | no clue ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaaga gatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgta caaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaa agctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctg acgatatgagtaaagttgtaagaactttaaaatcattttaa >my_favourite_species_seq3 | just sequenced gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggttt ttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaag atgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgaca acaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta >my_favourite_species_seq4 | we will see soon... atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggta gtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaatttt agtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacgg ggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtccca aaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt

22 You adress BLAST program (normally blastx) Number of HITs (use <= 20) Human readable seq. Descriptions via BDA Recommended to save as XML BLAST database (many options) E-Value (depends on the DB) BLAST

23 Use your own server Set word size and filter Filter by description Parsing options for own databases Minimum HSP length Additional BLAST params

24 BLAST Results RED

25 Blast Distribution Charts Evaluate the similarity of your sequences with public DBs

26 Single Sequence Menu

27 Mapping Results GREEN

28 Annotation Menu BLAST based annotation Validation and Annex Other Annotation modes

29 Annotation Allows to set a minimum percentage of the HIT sequence which should be expand by the QUERY sequence This helps to avoid the problem of cis-annotation

30 Annotation Result BLUE

31 Annotation Charts

32 Commonly, level 5 is the most abundant specificity level in the Gene Ontology

33 Recovers implicit biological process and cellular component GO terms based on molecular function annotations Biological ProcessCellular Component acts in is involved in Myhre et al, Bioinformatics 2006 Additional Annotation: ANNEX Molecular Function

34 Additional Annotation: InterProScan Results are stored at your computer as XML files. You can upload them later Once you have completed your InterPro annotation, results can be transformed to GO terms and merged to Blast annotation Runs InterProScan searches at the EBI through Blast2GO

35 InterProScan Results Column with InterProScan results

36 Additional Annotation: GOSlim GOSlim is a reduction of the Gene Ontology to a more reduced vocabulary → Helps to summarize information After GOSlim transformation sequences get YELLOW Different GOSlims available at Blast2GO

37 Enzyme annotation and Kegg Maps GO  Enzyme Codes  KEGG maps

38 Manual Curation You can modify manually annotation of particular sequences If you click in this box, curated sequences get purple

39 Export Results Saves the complete B2G project (heavy) Export annotation results in different formats

40 Export formats By Seq GeneSpring Format GoStat.annot Also for import!

41 More export formats Export Sequence Table Export BestHit Data

42 Sequence Selection Sequence Selection tool to obtain a selection based on annotation status

43 Sequence Selection By Name/Description By Function

44 View Menu Functions to switch between displaying IDs or descriptions for GO annotation or InterPro results

45 Hands-on I Annotation 10 seqs with Blast2GO

46 Visualization How to understand the functional context of a annotated dataset Visualization How to understand the functional context of a annotated dataset

47 Each term has a number of sequences associated Nodes can be coloured to indicate relevance Each term is displayed around its biological context Node shape to differentiate between direct and indirect annotation Combined Graph

48 Different GO branches Reduces nodes by number of annotate sequences Criterion for highlighting and filtering nodes Node data to be displayed Combined Graph

49 Accumulated by GO term (Sequence Count) Incomming information (Node Score) Node information content Σ seq(g)*α dist (g, g') g ∈ desc(g')

50 Compacting Graphs by GO-Slim

51 Saving Options Save as picture and as txt

52 Graph Charts

53 Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)‏ Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection)‏

54 Analysis of a specific function How many sequences are annotated to the function “photosynthesis”?  Option 1: Find in the GO graph -> direct & indirect annotation  Option 2: Find through the Select function. Two sub-options  Option 2.1. Direct annotation (use GO-ID or description)  Option 2.2. Direct & indirect (use GO-ID and “include GO parents”)

55 Find a function on the graph search export Analysis of a specific function

56 Exporting sequence table you see sequences Annotated to the function Analysis of a specific function

57 Select all sequences annotated to this function and its descendents Analysis of a specific function

58 Locate these sequences Analysis of a specific function

59 Hands-on II Summary statistics Visualize & Search Summary statistics Visualize & Search

60 Pathway analysis with Blast2GO Which cellular functions are important in my experiment Pathway analysis with Blast2GO Which cellular functions are important in my experiment

61 Biosynthesis 54%Biosynthesis 18% Sporulation 18% One Gene List (Responsive genes) ‏ The other list (Non responsive genes) ‏ Are this two groups of genes carrying out different biological roles? Functional Enrichment Analysis Are pathway frequencies different?

62 Biosynthesis 54%Biosynthesis 18% Sporulation 18% 95 No biosynthesis 26 Biosynthesis BA Genes in group A have not significantly to do with biosynthesis nor sporulation. Fisher's Exact Test Contingency table p-value for Biosynthesis = One Gene List (Responsive genes) ‏ The other list (Non responsive genes) ‏

63 Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)

64 Different types of comparisons Compare two equivalent conditions (root vs leaves) Remove Common Ids Test and Ref-Set are interchangeable Set 1Set 2 Common IDs Compare a subset against the total Common ids removed from reference Test and Ref-Set are NOT interchangeable Test- Set Ref- Set Common IDs Test- Set Ref- Set Common IDs

65 FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference

66 FatiGO Results Result table with link out to sequence lists

67 Most specific terms Retains only the lowest, most specific enriched term per GO branch

68 Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1

69 Hands-on III Enrichment Analysis

70 Concluding Remarks Blast2GO is a versatile tool for the annotation of sequence data Blast2GO uses controlled vocabularies and a elaborated annotation rule to generate GO labels Visualization and data mining functions help to understand the functional content of your dataset


Download ppt "Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland."

Similar presentations


Ads by Google