Download presentation
Presentation is loading. Please wait.
Published byElijah Simon Modified over 5 years ago
1
CAFE: a computational tool for the study of gene family evolution
Li Wang
3
What is Gene Family ? 1. a set of several similar genes; 2.formed by duplication of a single original gene; 3.generally with similar biochemical functions;
4
An ancestral gene duplication produces two paralogs (histone H1
An ancestral gene duplication produces two paralogs (histone H1.1 and 1.2). A speciation event produces orthologs in the two daughter species (human and chimpanzee). Bottom: in a separate species , an gene has a similar function but has a separate evolutionary origin and so is an analog.
5
Summary The gene family is also a group of genes with sequence structure and functional similarity derived from the same ancestor. But more often, genes of the same family are scattered in different positions on the same chromosome, or in different chromosomes, each gene has its own different expression regulation mode.
6
Phylogenetic tree of the Mup gene family
7
Classification ∙ Functions:cluster genes with similar functions to form a family. ∙ Sequence similarity:genes that are homologous are generally considered to be a family. ∙ If the genes of a gene family encode proteins, the term protein family is often used in an analogous manner to gene family. We can classify them by function and sequence
8
What can we do 流程图 First obtain the gene and sequence of the gene family you want to predict, and obtain the hidden Markov of the known protein conserved domain through pfam,A domain model of a known protein sequence was constructed by hmmsearch, and the related family genes were screened initially, and then the fast-screening protein sequence information was obtained through getfasta of bedtools
9
Biological significance
1.Gene family analysis can obtain unique family genes, which may be related to its own function; 2.Based on single-copy gene families, estimates of divergence time; 3.Whether significant expansion and contraction has occurred may be related to the function of biological molecules;
10
The evolution of gene family
Concerted Evolution Degeneration Subfunctionalization 协同进化Concerted Evolution:Adaptable co-evolution of two interacting species that developed during evolution. Evolutionary type in which one species has evolved genetically due to the influence of another 无功能化( Degeneration ):Accumulation of harmful mutations (non-synonymous mutations, variable shear mutations), resulting in loss of gene function, such as some pseudogenes 新功能化( Subfunctionalization ):Genes are mutated in the process of replication, making some genes have new functions
11
When examining gene families, there are several basic approaches to seeing how the family has evolved the presence of related genes in other species the degree of conservation of sequence or domain structure among the family members the dispersion of the family on the chromosomes
12
Common examples 1.NBS-LRR:One of the largest families of disease-resistant genes in plants 2.MADS-box:Important transcription factors in plants regulate the growth,development and reproduction of plants 3.HSP70:A highly conserved molecular chaperone that assists in the proper folding of proteins in cells Widely distributed
13
Expansion and contraction
The expansion or contraction of gene families along a specific lineage can be due to chance, or can be the result of natural selection.To distinguish between these two cases is often difficult in practice. Recent work uses a combination of statistical models and algorithmic techniques to detect gene families that are under the effect of natural selection
14
Expansion and contraction
Comparing the number of gene families in two species 5 > 2 5 > 2 Significant level? ↓ CAFE:birth and death parameter
15
CAFE 1.a tool for the statistical analysis of the evolution of the size of gene families 2.It uses a random birth and death process to model the evolution of gene family sizes over a phylogeny 3.The main function of CAFE is to estimate one or more birth-death parameters
16
Commonds About CAFE
17
Download and Install CAFE
1.conda create -n orthofinder orthofinder=2.2.7 2.conda install cafe (CAFE v4.2.1 ) MCL.MMseqs2.R8s
18
Troublesome method MCL:wget MMseqs2: wget /releases/ download/3be8f6/MMseqs2-Linux-AVX2.tar.gz CAFE:git Package file 压缩文件
19
Preparing the input Downloading the data(12 species)
Identifying gene families Estimating a species tree These 12 species are included in the software as example data
20
Identifying gene families Identifying gene families within and among species requires a few steps.
Retain representative transcripts from each gene, removing variable splicing and redundant genes ↓ Create a BLAST database and use blastp for all-by-all comparison Clustering based on blastp results using MCL ↓ Parse the output of MCL and use it as the input of CAFE
21
1.Moving all longest isoforms into a single file
In order to keep all but the longest isoforms, and place all sequences from all species into a single .fa file for the next step, run the following commands on your shell #python python_scripts/cafetutorial_longest_iso.py -d twelve_spp_proteins/ #cat twelve_spp_proteins/longest_*.fa | seqkit rmdup - > makeblastdb_input.fa Python script to extract data cat command can be used to view multiple files seqkit command to remove redundant genes
22
1.BLAST databases are established using makeblastdb first
2. All-by-all BLAST #makeblastdb -in makeblastdb_input.fa -dbtype prot -out blastdb 1.BLAST databases are established using makeblastdb first 2.Then blastp was used for sequence search, and similar sequences were obtained for each sequence #blastp -num_threads 20 -db blastdb -query makeblastdb_input.fa -outfmt 7 -seg yes > blast_output.txt &
23
3.Clustering sequences with mcl
Now we must use the output of BLAST to find clusters of similar sequences. These clusters will essentially be the gene families we will analyse with CAFE. Clustering is done with a program called mcl, running a few commands grep -v "#" blast_output.txt | cut -f 1,2,11 > blast_output.abc
24
In the command above, we convert the blast output into ABC format, which mcl understands. Then with the following commands, we have mcl create a network and a dictionary file (.mci and .tab, respectively)
25
Create network files (. mci) and dictionary files (
Create network files (.mci) and dictionary files (.tab) # mcxload -abc blast_output.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o blast_output.mci --write-tab blast_output.tab &
26
Clustering based on mci files
#mcl blast_output.mci -I 3 ↓ The main parameter to be adjusted is I ↓ Determines the size of the cluster. The smaller the value, the higher the cluster density.
27
4.Final parsing of mcl output
The file obtained in the last section is still not ready to be read by CAFE: we need to parse it and filter it. Run the following command: #python python_scripts/cafetutorial_mcl2rawcafe.py -i dump.blast_output.mci.I30 -o unfiltered_cafe_input.txt -sp "ENSG00 ENSPTR ENSPPY ENSPAN ENSNLE ENSMMU ENSCJA ENSRNO ENSMUS ENSFCA ENSECA ENSBTA" tips:“ENSG00 ” is the identifier of the species in the ENSEMBL number
29
Then there is one final filtering step we must perform
Then there is one final filtering step we must perform. Gene families that have large gene copy number variance can cause parameter estimates to be non-informative. You can remove gene families with large variance from your dataset, but we found that putting aside the gene families in which one or more species have ≥ 100 gene copies does the trick. #python python_scripts/cafetutorial_clade_and_size_filter.py -i unfiltered_cafe_input.txt -o filtered_cafe_input.txt -s
30
Replace the original number with a meaningful species name
# sed -i -e 's/ENSPAN/baboon/' -e 's/ENSFCA/cat/' -e 's/ENSBTA/cow/' -e 's/ENSNLE/gibbon/' -e 's/ENSECA/horse/' -e 's/ENSG00/human/' -e 's/ENSMMU/macaque/' -e 's/ENSCJA/marmoset/' -e 's/ENSMUS/mouse/' -e 's/ENSPPY/orang/' -e 's/ENSRNO/rat/' -e 's/ENSPTR/chimp/' filtered_cafe_input.txt
31
Estimating a species tree 1
Estimating a species tree 1.The species tree is passed to the server as NEWICK formate ↓ This software also uses species trees
32
2.We must make this tree ultrametric, which can be done using the program r8s.
↓ An ultrametric tree is also called a time tree, which is to change the scale of the phylogenetic tree to time, and the distance from the root to all species is the same.
33
Running CAFE Estimating the birth-death parameter λ
The main goal of CAFE is to estimate one or more birth-death (λ) parameters for the provided tree and gene family counts. The λ parameter describes the probability that any gene will be gained or lost.
34
Estimating a single λ for the whole tree
Estimating λ can be achieved if one types the following commands on CAFE’s shell
35
Understanding the output
First four lines 1st the information of tree 2nd Lambda =The predicted value of the time tree 3rd Lambda = Branch reliability 4th IDs of nodes =Number of different nodes .cat is 0 horse is 2 Last four columns 1st number corresponding to the input gene family 2nd Newick's Evolution Tree 3nd Family-wide P-value= Indicate whether the gene family is significantly expanding or contracting ; for example the value is 0.4, it means that the change is not obvious 4th If the value in the third column is less than 0.01, the fourth list indicates which branch of the gene family has changed Lambda是整个进化树的预测值 # IDs of nodes表示不同节点的编号,这里cat为0,horse为2,cat和horse所在的节点是1. 最后是每个基因家族的结果。以最开始的表示行为例,第一列对应输入基因家族的编号;第二列是Newick的进化树,cat_59中的59表示该基因家族在cat里有59个基因;第三列是Family-wide P-value,用于表明该基因家族是否是显著性的扩张或是收缩,这里是0.438,说明变化不明显。在第三列的p值小于0.01时,第四列表明哪个分支的基因家族发生了变化,上图中只有ID 11的基因家族有变化, 但是0,1,2,4分支并没有变化。
36
Veen Graph For most of us here, this picture should be our first lesson in high school mathematics: collections The red part is unique to A
37
Definitions ∙In the branch of mathematics, a sketch used to represent a class in a less strict sense. ∙Venn graph can be used to represent relationships between multiple data sets, and can also perform set operations
38
How to do ? Please remember two websites
39
1.https://bioinfogp.cnb.csic.es/tools/venny/
The page is displayed here. Enter your species name. The box below can be used to add data.
41
2.
42
Here two websites are equally effective, but the second one can use more data
And the second website can adjust his size and font, and a picture to show specific information
43
The latest data analysis methods Sharing some experience during the test Detailed display
44
Thanks for your listening
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.