Virtual ChIP-seq Predicting transcription factor binding by learning from the transcriptome

Karimzadeh M. and Hoffman MM.. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. Genome Biology 23, 126 (2022). doi: https://doi.org/10.1186/s13059-022-02690-2. (BibTeX)

Virtual ChIP-seq predicts transcription factor binding in any cell type from RNA-seq and ATAC-seq (or DNase-seq).

The Virtual ChIP-seq track hub contains genome-wide predictions for binding of 36 TFs in 33 different human tissues:

Predicting transcription factor binding

Virtual ChIP-seq uses multi-layer perceptron to predict binding of individual transcription factors (TFs). Virtual ChIP-seq uses data on chromatin accessibility, genomic conservation, and binding characteristics of TFs from previous experiments in other cell types. It also learns from the asso- ciation of gene expression and TF binding at different genomic regions. By incorporating existing ChIP-seq data, there is no longer a need to represent TF sequence preferences in form of position weight matrices. For a new cell type with data on chromatin accessibility and gene expression, Virtual ChIP-seq predicts indirect TF binding, as well as binding of TFs without known sequence preference.

Accuracy of predictions

To build a generalizable classifier that performs well on new cell types with only transcriptome and chromatin accessibility data, we train the multi-layer perceptron on training cell types (A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and T47D) We assess the performance of the model in validation cell types (K562, PANC-1, IMR-90, MCF-7, H1-hESC, and liver). We use the posterior probability cutoff which maximizes Matthews correlation coefficient (MCC) in H1-hESC for each TF. If we don't have ChIP-seq data of the TF in H1-hESC, we use the mode of the optimal cutoffs in other TFs (0.4). Below, we report median ∓ standard deviation of performance among validation cell types. Column N corresponds to number of validation cell types for each TF.

Datasets and software on Zenodo

	F₁	Accuracy	MCC	auROC	auPR	N
ATF2	0.270∓0.002	0.990∓0.001	0.314∓0.008	0.917∓0.026	0.443∓0.022	1
BHLHE40	0.334∓0.021	0.997∓0.000	0.356∓0.010	0.974∓0.002	0.382∓0.010	1
CEBPB	0.510∓0.091	0.992∓0.002	0.515∓0.072	0.964∓0.017	0.534∓0.073	3
CHD2	0.270∓0.051	0.996∓0.000	0.332∓0.040	0.950∓0.012	0.386∓0.046	1
CREB1	0.362∓0.131	0.997∓0.002	0.371∓0.121	0.868∓0.135	0.335∓0.174	2
CTCF	0.667∓0.143	0.995∓0.004	0.686∓0.107	0.988∓0.055	0.849∓0.121	4
E2F1	0.256∓0.097	0.998∓0.002	0.314∓0.078	0.978∓0.019	0.291∓0.105	2
ELF1	0.431∓0.047	0.997∓0.001	0.456∓0.038	0.949∓0.042	0.493∓0.066	2
ELK1	0.430∓0.069	1.000∓0.000	0.465∓0.054	0.991∓0.009	0.420∓0.054	2
ESR1	0.270∓0.024	0.988∓0.003	0.380∓0.018	0.846∓0.012	0.476∓0.010	1
FOS	0.333∓0.027	0.997∓0.001	0.393∓0.020	0.861∓0.004	0.394∓0.008	1
FOSL1	0.319∓0.006	0.994∓0.001	0.316∓0.006	0.929∓0.006	0.272∓0.012	1
FOXA1	0.407∓0.045	0.994∓0.005	0.444∓0.061	0.961∓0.022	0.467∓0.131	2
GABPA	0.298∓0.049	0.994∓0.002	0.393∓0.036	0.986∓0.012	0.496∓0.036	3
GTF2F1	0.235∓0.120	0.996∓0.001	0.312∓0.070	0.985∓0.015	0.191∓0.081	2
HCFC1	0.459∓0.021	0.999∓0.000	0.487∓0.024	0.990∓0.005	0.515∓0.044	2
HDAC2	0.303∓0.033	0.986∓0.005	0.370∓0.018	0.948∓0.051	0.281∓0.040	2
HSF1	0.350∓0.149	1.000∓0.000	0.378∓0.145	0.999∓0.012	0.309∓0.240	1
JUN	0.218∓0.127	0.998∓0.001	0.311∓0.153	0.983∓0.009	0.456∓0.257	2
JUND	0.363∓0.080	0.994∓0.002	0.399∓0.053	0.971∓0.020	0.370∓0.078	3
MAFK	0.354∓0.041	0.997∓0.001	0.423∓0.028	0.989∓0.005	0.513∓0.103	3
MAX	0.400∓0.045	0.996∓0.002	0.444∓0.059	0.961∓0.012	0.491∓0.111	3
MAZ	0.370∓0.025	0.997∓0.001	0.422∓0.019	0.987∓0.005	0.493∓0.070	2
MXI1	0.394∓0.018	0.999∓0.000	0.402∓0.017	0.993∓0.004	0.381∓0.025	1
NRF1	0.668∓0.051	1.000∓0.000	0.680∓0.046	0.996∓0.018	0.725∓0.062	2
RAD21	0.593∓0.062	0.996∓0.002	0.626∓0.056	0.983∓0.033	0.740∓0.095	3
REST	0.482∓0.120	0.999∓0.001	0.493∓0.091	0.985∓0.008	0.567∓0.095	3
SIN3A	0.389∓0.048	0.998∓0.002	0.394∓0.029	0.966∓0.004	0.411∓0.037	3
SMC3	0.733∓0.016	0.999∓0.000	0.734∓0.016	0.998∓0.001	0.792∓0.018	1
SRF	0.353∓0.060	0.998∓0.001	0.364∓0.070	0.982∓0.008	0.365∓0.115	2
TAF1	0.378∓0.073	0.999∓0.001	0.437∓0.097	0.987∓0.009	0.490∓0.168	3
TEAD4	0.344∓0.061	0.990∓0.002	0.385∓0.020	0.967∓0.023	0.343∓0.019	2
TP53	0.275∓0.103	1.000∓0.000	0.382∓0.086	1.000∓0.008	0.660∓0.222	1
USF1	0.353∓0.047	0.993∓0.001	0.382∓0.040	0.891∓0.012	0.372∓0.046	1
USF2	0.410∓0.040	0.999∓0.000	0.427∓0.028	0.982∓0.007	0.437∓0.032	1
YY1	0.397∓0.049	0.996∓0.001	0.408∓0.058	0.945∓0.043	0.417∓0.104	2

Virtual ChIP-seq accepts chromatin accessibility data in narrowPeak format and RNA-seq data in format of a matrix where rows are human gene symbols and columns are cell types (Minimum of 1 column with your cell of interest). The RNA-seq measure must be normalized to length and library (accepts RPKM, FPKM, TPM, but not raw read counts). It takes an average of 6 CPU hours (depending on TF) and a minimum RAM of 8GB to generate the input tables for your TF of interest. Applying the trained model takes less than 20 minutes for most TFs and datasets.

Track hub, file access, and software

UCSC Genome Browser

View the Virtual ChIP-seq track hub in the UCSC genome browser.

There are 36 supertracks corresponding to each transcription factor. Each supertrack contains a bigBed9 track for Cistrome and ENCODE ChIP-seq data, and one bigwig file for prediction of binding of the TF in each of the Roadmap consortium datasets.

Using the track hub

There are 36 supertracks corresponding to each transcription factor. Each supertrack contains to bigBed9 files, one showing genomic bins with TF binding in Cistrome DB datasets, and one showing Virtual ChIP-seq predictions in the Roadmap consortium datasets.

View the Virtual ChIP-seq track hub in UCSC genome browser.

List of Roadmap consortium tissue types with Virtual ChIP-seq predictions

Tissue	Day	ENCODE accession
adrenal gland	108day	ENCFF551HRI
B cell	37year	ENCFF444ZRC
CD14-positive monocyte	37year	ENCFF007TSW
CD4-positive helperTcell	21year	ENCFF276EBZ
CD8-positive-alpha-beta T cell	21year	ENCFF614QQR
fibroblast of skin of abdomen	97day	ENCFF696SPY
forelimb muscle	108day	ENCFF060JZA
heart	120day	ENCFF203FLV
hindlimb muscle	120day	ENCFF856UQI
kidney	108day	ENCFF577ZMC
large intestine	120day	ENCFF250JHL
left kidney	96day	ENCFF456NFP
left lung	108day	ENCFF610OWH
left renal cortex interstitium	120day	ENCFF602DIZ
left renal pelvis	120day	ENCFF714RWU
muscle of arm	127day	ENCFF517DTZ
muscle of back	127day	ENCFF066LTB
muscle of leg	127day	ENCFF207RZS
muscle of trunk	120day	ENCFF979SJD
ovary	NA	ENCFF916EFR
renal cortex interstitium	120day	ENCFF330NPA
renal pelvis	105day	ENCFF155DZV
right lung	105day	ENCFF828HED
right renal cortex interstitium	120day	ENCFF198WIN
right renal pelvis	120day	ENCFF832UZR
skin fibroblast	97day	ENCFF969YOA
small intestine	108day	ENCFF227RVA
spinal cord	113day	ENCFF412SKC
spleen	112day	ENCFF180AEX
stomach	127day	ENCFF803IZB
T-cell	37year	ENCFF410MHQ
testis	NA	ENCFF518XTM
thymus	127day	ENCFF178AYH

Software and documentation

Read the documentation for Virtual ChIP-seq software, which begins with a quick start.

Support

Please ask questions about Virtual ChIP-seq on our mailing list. If you want to report a bug or request a feature, use Virtual ChIP-seq issue tracker. We are interested in all comments on the package, and the ease of use of installation and documentation.

Source code

Bitbucket repository

Credits

Virtual ChIP-seq is developed by Mehran Karimzadeh during his PhD at Michael Hoffman Lab.