PgmNr D197: Highly accurate prediction of early anterior-posterior enhancer sequences from ChIP-seq data.

Authors:
H. Arbel 1 ; P. Bickel 1 ; S. Celniker 2 ; M. Biggin 2 ; J. B. Brown 1,2


Institutes
1) University of California, Berkeley, Berkeley, CA; 2) Lawrence Berkeley National Laboratory, Berkeley, CA.


Keyword: enhancers

Abstract:

In animals, definitive epigenetic signatures of enhancer elements have been challenging to identify– the best prediction tools offer weak positive predictive power at genome-scales and are not accurate enough to conduct in silico enhancer annotation. In the early Drosophila embryo, a small cohort of ~40 transcription factors drive body patterning. Hence, fly development offers a simplified model system in which to study the relationship between transcription factor binding and tissue-specific enhancer activity. We studied the DNA binding patterns for 22 of these factors, as well as chromatin marks during embryonic stages 4 and 5. We applied supervised machine learning to identify enhancer sequences active during embryogenesis using a test set of over 7000 whole-embryo in situ imaging experiments. We find that we have nearly perfect predictive accuracy for early anterior-posterior (AP) enhancer sequences (>97% predictive accuracy, suitable for whole-genome scans), while dorsal-ventral (DV) and other classes of enhancers are far more challenging to predict. Further examination of “false positives” identified by our methods reveals manual annotation errors in the labeling of in situ experiments – our actual correct classification rate is likely higher than reported here. Well-predicted enhancers admit a unique epigenetic signature involving interactions between both AP and DV transcription factors. Our model identifies nonlinear interactions between cohorts of transcription factors, suggesting the presence of combinatorial activation rules. Using a whole genome scan, we predict that around 1.6% of 1kb windows in the genome are likely AP enhancers (recovering 98% of known enhancers from our training and test data). For the first time, we demonstrate a sufficiently accurate enhancer prediction algorithm to enable the near-comprehensive discovery of a subclass of enhancer sequences.