Gene Regulatory Evolution
Many phenotypes that vary widely in mammals, including vocal learning, longevity, and brain size, have evolved through changes to gene expression, meaning that their differences across species are caused by differences in DNA sequence at cis-regulatory elements (CREs). When accessible or inaccessible to DNA-regulatory proteins, CREs control the way genes are expressed in tissues: the timing, the levels, and the location. While some of the genes involved in these phenotypes have been identified, the CREs responsible and how genome sequence differences in those CREs have led to differences in gene expression remain unknown. The 200 Mammals Project genomes and multi-species alignments are enabling us, for the first time ever, to compare CREs across hundreds of mammals. However, since many CREs are tissue-specific, characterizing these CREs experimentally would involve doing assays in many tissues from many species, which is not feasible. We therefore developed a machine learning model to take what we know about tissue-specific CREs from well-studied mammalian species to predict – using conservation — which CREs from other mammalian species are likely to show similar tissue expression patterns.
We can apply this model to find brain-specific CREs which may be responsible for differences in brain size, development, and activity across mammals. We trained our models using brain ATAC-seq data – a method that locates accessible DNA, or open chromatin, in the genome. We generated this data from mouse, rat, and Rhesus macaque as well as gathered publicly available brain ATAC-seq data from humans. The model we developed achieved high predictive accuracy for classifying brain CREs (AUPRC = 0.88) on the validation set. We used our models to make predictions for brain CRE activity across hundreds of mammals from the 200 Mammals Project. We demonstrated that similarity in predictions between species in the 200 Mammals Project is strongly anti-correlated with evolutionary distance. We then used our predictions to identify clade-specific CREs, meaning that the CRE is accessible in most species within a clade but not in many others, and showed that our predictions were consistent with our data. Our approach to predicting conserved CRE activity can be applied to any tissue or cell type with open chromatin data available from multiple species. In addition, our predictions can be connected to mammalian phenotypes that have evolved through gene expression. We anticipate that the 200 Mammals Project genomes and multi-species alignment will enable researchers to extend our work to identify CREs that are likely to be involved in the most variable of mammalian phenotypes.
Words & Story by Irene Kaplow,
Collaborators: Morgan Wirthlin, Alyssa Lawler, Xiaoyu Zhang, Ashley Brown, and Andreas R. Pfenning