Download presentation
Presentation is loading. Please wait.
Published byEugenia Fowler Modified over 8 years ago
1
Web Technologies in Bioinformatics T.J. Esposito April 28, 2005 Advanced Bioinformatics Computing
2
Project Goal To make the normalized Frisina data easy and convenient to work with To avoid having to work with enormous text files of seemingly meaningless numbers
3
Project Goals This will be accomplished by: - Putting the data into a database - Making the database easy to interact with as well - Making the database available to whoever needs it - Giving the data some sort of context
4
Methods One of the most convenient ways of doing this is to: - Use a relational database to store the data - Give the database a web interface, which is convenient to use and readily available - Link that data to other available data from Affymetrix and other sources
5
Methods These goals will be reached using current database and web technology. For the back end database, mySQL will be used. For the web interface, JSP (Java Server Pages) will be used.
6
Reasons for mySQL MySQL will be used due to its speed. Competing systems, like Postgres, were considered; however, more fully featured (yet slower) systems were not necessary. - the data will be manipulated using only SELECTS - MySQL, having fewer features than other systems, makes it faster and thus better suited for use in web applications
7
Reasons for JSP JSP has well known advantages; it is: - Efficient - Convenient - Powerful - Inexpensive - Portable - Secure - Java based
8
JSP Perl and CGI were considered, but JSP was chosen due to: - Its being a current web technology utilized by many major corporations - It seems more convenient and full-featured compared to a Perl/CGI approach - JSP fits current multi-tier database architectures better than CGI, due to the Java API and JSP being development so - I will be working with JSP on co-op, so I wanted to brush up (or rather, learn it) before then
9
Data Expansion One the data has been entered into a mySQL database, and given a moderately flexible web interface, it will also be linked to other sources - Affymetrix data from their site - Other sites like NCBI or GenBank? - Linking data to new sources as needed should be fairly easy
10
Finally… In the end, an expandable system will have been created that hopefully can be used in a real world application. Even if it isn’t, at least I will have gotten the experience in developing such a system with a new technology (JSP), and continued in the Java nature of the course.
11
Questions Any questions?
12
Visualization of Frisina’s Research Data Using University of Maryland’s Treemap 4.1 John Boutell and Tom Maxon
13
Procedure Transform Frisina flat files into Treemap flat files or Excel files Determine relationships Determine organization / visualization preferences
14
File Transformation Treemap file considerations – Begins with a line consisting of a list of variables to be considered. The next line follows with definitions of variables. The subsequent consists of data, with relationships of each following list of data.
15
Determining Relationships A maximum of four layers can be used, so we’ll need to determine what the four layers should be. Example: Middle-aged vs. Young vs. Old could be one layer.
16
Organization and Visualization Determination This step will consist of ordering data and arranging coloration and spacing to insure that the visualization is easily understood.
17
Obtaining Information Regarding Mouse Array Genes Chris Parkin April 28, 2005
18
Overview: Research involves expression data from Affymetrix mouse chip 430a Thousands of genes found on this gene chip, any of which could be of importance
19
Overview: Each gene in the expression data is given an accession number Example Expression Data: X16_Frisina_S2_M430A.CEL X17_Frisina_S2_M430A.CELX25_Frisina_S2_M430A.CEL X36_b_Frisina_S2_M430A.CEL 1415672_at14.263698758127014.816692593843414.720255824430614.7153893835085 1415673_at10.638280270438310.89478492142619.799205600234410.0489561960792 1415675_at12.636349558122112.31069582445811.766599158784211.7192886280750 1415677_at11.922459973379211.623037362274211.088227607264911.1584524620751 1415678_at14.340300014808514.325851390138014.275359439019714.3758483552046 1415679_at15.095903171650314.806682903355914.687691836433514.5911816158065 1415680_at11.420375703526411.412000701239311.238446274842411.3684779023244 1415681_at12.300456677133111.738349048482411.499526158369311.3078357750632
20
Overview: Gene information based on accession # available at Affymetrix website, but is a tedious process Some of the information may not be that useful for this particular research
21
Project Goal: Develop a useful online tool for obtaining information about genes on the mouse chip Two powerful tools to be used in developing this: Perl & NCBI
22
Information to Include: Nucleotide sequence & amino acid translation NCBI Definition: What metabolic role does this sequence play a part in Any available links to PUBMED articles Homology groups (using NCBI’s “Homologene” Any available information in NCBI’s “Gene” database (descriptions, lineage, ontology…)
23
Questions?
24
Gene Group Correlation Presented by –Andrew Darling
25
Outline of Presentation Problem Statement Gene Group Correlation Methods Results Discussion Conclusion
26
Problem Statement Using ~20,000 expression levels taken from ~40 mice of various ages, find the genes responsible for progressive age related hearing loss in mice.
27
Gene Group Correlation Search for genes with expression levels –Grouping similarly to the 4 mouse test groups –Corresponding to the severity of the hearing impairment –Exclude genes used for non hearing impairment genes
28
Methods For each “gene” –Gather expression levels for each mouse –Segregate each expression level by mouse group –Apply mean and deviation calculations for each group –Calculate metric for quality of segregation Do expression levels segregate by mouse group Repeat for each gene Sort for highly segregated (by group) expression values
29
Methods – examples 1 & 2 Gene 1 –Young mice levels = 1, 1, 1, 1, 1, 1, 1, 1 –Middle mice levels = 3, 3, 3, 3, 3, 3, 3, 3 –Old mice levels = 6, 6, 6, 6, 6, 6, 6, 6 –Severe mice levels = 9, 9, 9, 9, 9, 9, 9, 9 –Conclusion – highly segregated by group in order of severity Gene 2 –Young mice levels = 1, 1, 2, 2, 3, 3, 4, 4 –Middle mice levels = 3, 3, 4, 4, 5, 5, 6, 6 –Old mice levels = 5, 5, 6, 6, 7, 7, 8, 8 –Severe mice levels = 6, 6, 7, 7, 8, 8, 9, 9 –Conclusion – mostly segregated by group in order of severity
30
Methods – examples 3 & 4 Gene 3 –Young mice levels = 1, 2, 3, 4, 5, 6, 7, 8 –Middle mice levels = 1, 2, 3, 4, 5, 6, 7, 8 –Old mice levels = 1, 2, 3, 4, 5, 6, 7, 8 –Severe mice levels = 1, 2, 3, 4, 5, 6, 7, 8 –Conclusion – not segregated by group Gene 4 –Young mice levels = 1, 1, 1, 1, 2, 2, 2, 2 –Middle mice levels = 7, 7, 7, 7, 8, 8, 8, 8 –Old mice levels = 5, 5, 5, 5, 6, 6, 6, 6 –Severe mice levels = 3, 3, 3, 3, 4, 4, 4, 4 –Conclusion – mostly segregated by group not in order of severity
31
Results Coding still in process Working out a few parameters –Whether to sort by Distance of group means from each other Size of sigma for each group Mutually exclusive grouping Ordering of group means by severity
32
Discussion Quality of prediction of related genes based on quality of correlation theory –Presumes related gene expression is progressive and consistent –Presumes a quality of gene expression level measurement Further validation possible by sorting for redundant hits –Sequences referenced by several probes on the chip –Several similar probes each correlating highly
33
Conclusion If this works, it’s a freaking miracle
34
Gene Selection What level Of what gene Does what?
35
Clustering Radial Basis Neural Network Develop clustering using 2 “old” data sets Test with all 4 data sets to verify that it clusters correctly Generates weights to form the clusters
36
Anfis Tool to extract the neural network “rules” Gives a formula based on all the inputs to show given any set of input what value it will generate It is possible to extract the exact impact of each input from this formula.
37
Anfis Cont’ However Computationally very expensive Training time for this type of network increases by a factor of 3 for each added line of input. Time to train would be in the order of –10 * 3 22680 seconds (3 24 secs = 10000 yrs)
38
Weights Data values influence the weights To eliminate those influences the values must be converted to binary values. A set of threshold values is needed
39
Input For each variable these threshold are used –MedianMean –25/7575/25 –10/9090/10 –0/100100/0 Each of those data sets are combined into one large training set.
40
Where I’m going with this What the network will learn is to classify the data by each of those sets –Does this already except for the all or nothing case
41
Where I’m going with this Analyze the weights –By distance between weights of opposite categories
42
What does a large differentiation mean Should point at –The gene of importance –The level of expression where the change occurs
43
Data Set Each of those data sets are combined into one large training set.
44
Identify Classifying Genes of Presbycusis Alex Haugh
45
Project Outline Step 1 – Calculate the mean of each of the datasets (Young, Midage, Mild, Severe). Step 2 – Find a set of genes that have unique expressions for each type. Step 3 – Test the ability of these genes to classify each type from training sets. Step 4 – Plot the expression levels of these genes throughout the mouse life cycle.
46
Step 1: Getting the Mean 1.Parse the files given to us by Tex. 2.Take those values and get a ‘Pre’ average. 3.Calculate the standard deviation 4.Remove any values are not contained within 95% 5.Calculate the ‘Post’ average with removed expression levels 6.Record them in a new condensed file format: GeneExpression at1718610.56574 at171878.96768
47
Step 2: Calculating Classifying Genes 1.Read in each of the newly condensed files. 2.Place all of the values into a data structure. 3.Compare all of the values of a gene against all other types and record those genes which are greater than or less than a given threshold value. 4.Narrow down genes to much smaller set 5.Record the genes in a file for use later: --------HIGHER --------- --------LOWER-------- at1718610.56574 at156865.68869 at171878.96768 at171227.76859
48
Step 3: Testing Classifying Genes 1.Read in the classifying genes for each type 2.Read in the unknown dataset 3.Subtract the unknown expression value from classifying gene and take the absolute value. 4.If the gene less than the threshold value record a plus one for that type. 5.Report the type with the most genes within the threshold. Note: Given 100 Classifying genes per type and a threshold value of 0.35 there is a very high rate of accuracy.
49
Step 4: Tracking Levels 1.After testing the classifying genes from each type empirically, record these (hopefully about 20) 2.Record the average value for the gene from all types. 3.Graph the values 4.Observe and record the trends in each gene. 5.Report any genes that don’t follow the given trends.
50
Expectations I expect to find about 20 genes per type that classify ‘unknown’ datasets very well. I expect those genes to generally follow similar trends. I expect to be able to a have a program that can read in datasets and produce reliable results that can assist research by quickly identifying those genes which are outliers and unique.
51
ArrayView Coherent visualization of clustered microarray data. Madhu and Julia
52
Eisen Lab Software Cluster –Treeview –MapleTree FuzzyK –FuzzyExplorer –MapleTree
53
ArrayView Input Output from Cluster, FuzzyK –Convert to ArrayView datafile (XML) Attribute MySQL database –Gene title –Gene symbol –Public DB identifiers –Protein families, domains –Gene Ontology –Metabolic pathways
54
ArrayView Output Hierarchical –Tree filter Possible layouts: –BalloonTree, RadialTree, SquarifiedTreeMapLayout, TopDownTreeLayout, VerticalTreeLayout k-means –Graph filter Possible layouts: –ForceDirected, Random
55
Controls Change focus Rotate display Tool tips Zoom Filter Color code
56
Experimental data Cluster Frisina’s data –Cluster –FuzzyK View clustered Frisina data in ArrayView
57
Questions
58
Advanced Bioinformatics Computing Project Kyle Shenk & Laura Grell
59
Overview TIGR MultiExperiment Viewer (MeV) is a powerful analysis tool for microarray data. –Clustering –Classification –Visualization –Statistical Analysis We hope to use some of these tools to perform some analysis on the Frisina data
60
The Input File MeV requires a Affymetrix.txt file for input –Columns represent each individual sample – so in this case each mouse/experiment –Rows represent the individual genes –Data points are the normalized expression values –GeneName Sample1 Sample2 Sample3 Sample4 > > MouseType young young old_severe old_mild > > 1415670_at 10.47015 13.195 9.620273 11.5090
61
Problem Dr. Frisina has provided us with four files –each representative of a different age group of mice The Affymetrix.txt file contains expression data from all samples We have to convert these four files into one large file the MeV can read and recognize
62
Solution Perl is an ideal language for editing/parsing text and generating files The program we developed reads in all four files and creates one large Affymetrix.txt file Basically the program consists of reading each file line by line and concatenating the line from one file onto the next
63
Kyle’s Solution page +page +page +page = BIG PAGE!! + + +=
64
TIGR MeV The next step is to utilize the TIGR MeV tools and analyze the results. –Expression Viewer –Expression Graphs http://www.tm4.org/mev.html
65
PRINCIPAL COMPONENT ANALYSIS OF THE FRISINA MICROARRAY DATA Presented by Lee Edsall April 28, 2005
66
OUTLINE What is Principal Component Analysis? Method Goals
67
WHAT IS PRINCIPAL COMPONENT ANALYSIS? Also referred to as “ PCA ” Analysis of the variation in the data to find a new set of variables to describe the data Goal is to decrease the number of variables required
68
METHOD Library research and literature review to understand method and determine appropriate parameters Use Minitab to determine the new variables for: Young data Middle age data Old with mild hearing loss data Old with severe hearing loss data Compare the four sets of variables to see if any of them are specific to a set of data
69
GOALS Determine if any genes uniquely identify a set of data Provide a much smaller number of genes to be used in future analysis
70
Comparying and analyzing the tools for the microarray data Shruti Sharma/Jennifer D’Souza
71
GEPAS - "Gene Expression Pattern Analysis Suite" Normalization Preprocessing Viewers Clustering Differential Expression Supervised Classification Data Mining & Analysis
72
MIAME – “Minimum Information About a Microarray Experiment” Interpretes the results Reproduce the experiment.
73
EPCLUST- Expression Profile data CLUSTering and analysis Tool for Clustering Visualization Analysis for gene expression data as well as sequence data.
74
Cluster Performs cluster analysis – Hierarchical clustering – Self-organizing maps (SOMs) – k-means clustering – Principal component analysis Processes large microarray datasets
75
Links to the tools http://ep.ebi.ac.uk/EP/EPCLUST/ http://www.mged.org/Workgroups/MIAME /miame.htmlhttp://www.mged.org/Workgroups/MIAME /miame.html http://gepas.bioinfo.cnio.es/tools.html
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.