Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Similar presentations


Presentation on theme: "Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman."— Presentation transcript:

1 Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman

2 Why Are We Interested in Gene Family Size Distributions? Want to find homologous chromosomal regions Genes as markers Matches between genes indicate possible regional homology Cluster statistics depend on The total number of matches The distribution of matches

3 What do we mean by a “Family”? Ideally: A group of sequences that have arisen from a common ancestor In practice: families are most often defined based on Similar structure Similar sequence

4 Families can be defined at many levels Either domains or whole proteins can be grouped Protein families and their evolution--a structural perspective. Orengo CA, Thornton JM. Orengo CAThornton JM

5 Why are other people interested in gene family sizes? To understand protein family evolution Fit birth/death model to the data To predict how many more genes there are in certain families

6 How Can Genes Be Grouped Into Families? Construct and analyze gene trees: Slow, requires manual supervision Tree construction is error-prone Group based on structural similarity Structure may be similar even if not homologous Structure is generally not known Cluster genes based on sequence similarity Heuristic Fast and comprehensive, even for large datasets

7 Clustering Group together genes with similar E- values (or other sequence-based score) Many heuristics have been proposed

8 Why bother with clustering heuristics? May not find true “gene families” May be throwing away true matches May be including extra noise However, may still be preferable to allowing only 1-to-1 matches

9 Chromosome 5 Chromosome 3

10 Existing Gene Family Data Data for individual species Recent data is only for bacteria Data from multiple species Large sets of species: eukaryotes + prokaryotes

11 The properties of protein family space depend on experimental design Kunin et al, Bioinformatics 2005

12 Our Questions What does the GFS distribution look like? How much does the clustering method affect the GFSD? How much does the cluster E-value threshold affect the GFSD? How much does the GFSD vary across species? Can we fit the GFSD to a particular function?

13 Our Analysis Species: Yeast vs Yeast (5131 Genes) Mouse vs Mouse (7343 Genes) Human vs Human (10610 Genes) IN PROGRESS Clustering Methods Hierarchical Clustering Multiple variants 5 E-value thresholds TribeMCL 5 inflation parameters

14 Hierarchical Clustering Method Threshold Complete linkage Average linkage Single Linkage

15 TribeMCL Inflation parameter (but is difficult to understand) 4-5: small clusters 1.1-3: larger clusters However, clusters do not strictly increase in size when inflation value is reduced e.g., clusters are not hierarchical http://micans.org/mcl Markov clustering More flow across higher weight edges How much total flow between each gene? Handles multi-domain proteins? Very Efficient

16 Mouse Complete-Linkage 10 -10 Log (gene family size) Gene family size

17 Yeast Complete Linkage


Download ppt "Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman."

Similar presentations


Ads by Google