Protein-DNA binding: data, tools & models
Below is an annotated list with databases containing position-specific weight matrices or measured binding energies and third-party software to transform weight matrices to thermodynamic parameters to be used as input for the calculations. Our own software is described on a separate page. Note that resources below are only for TF-DNA binding. Also have a look at the lists of online tools for nucleosome positioning and epigenetic modifications. Please feel free to contact me with suggestions/corrections.
*Entries are added in the order "newest first", there is no ranking.
The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE.
CollecTF compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features.
footprintDB is a database with 2422 unique DNA-binding proteins (mostly transcription factors, TFs), 3662 Position Weight Matrices (PWMs) and 10112 DNA Binding Sites extracted from the literature and other repositories. The binding interfaces of (most) proteins in the database are inferred from the collection of protein-DNA complexes described in 3D-footprint.
AthaMap provides a genome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana.
This is a Web Portal to Explore ChIP-seq and DNase-seq Data. Currently contains human and mouse datasets.
A database of CTCF-binding sites, CTCFBSDB, now contains almost 15 million CTCF-binding sequences in 10 species. It includes integrated CTCF-binding sites with genomic topological domains defined using Hi-C data. Additionally, the updated database includes new features enabled by new CTCF-binding site data, including binding site occupancy and the ability to visualize overlapping CTCF-binding sites determined in separate experiments.
HOCOMOCO contains 426 non-redundant curated binding models for 401 human TFs. DNA sequences of TF binding regions obtained by both pregenomic and high-throughput methods were collected from existing databases and other public data. The ChIPMunk software was used to construct positional weight matrices. Four motif discovery strategies were tested based on different motif shape priors including flat and periodic priors associated with DNA helix pitch. A quality rating was manually assigned to each model based on known binding preferences. An appropriate TFBS model was selected for each TF, with similar models selected for related TFs.
Factorbook is described in a recent publication: Wang et al. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22: 1798–1812.
TFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria.
A comprehensive database of 1226 motifs from 11 different sources; The site allows users to search the database with a regulatory site or matrix to identify the TFs most likely to bind the input sequence.
to be checked later
FlyTF currently contains 129 proteins for which PWMs are available.
TRANSFAC consists of free and paid sections. Provided binding sites are experimentally proved. Human TF weight matrices may be viewed through the web interface of UCSC Genome Browser.
The JASPAR CORE database contains a curated, non-redundant set of profiles, derived from published collections of experimentally defined transcription factor binding sites for eukaryotes. The prime difference from TRANSFAC is the open access to the data.
KDBI is a collection of experimentally determined kinetic data of protein-protein, protein-RNA, protein-DNA, protein-ligand, RNA-ligand, DNA-ligand binding events described in the literature.
ProNIT currently contains more than 4900 entries. Each entry has the protein and nucleic acid information, experimental conditions and the following binding thermodynamic data: dissociation constant Kd, energies, stoichiometry of binding and activity (Km and kcat).
UniPROBE contains data on the preferences of proteins for all possible sequence variants ('words') of length k ('k-mers'), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In total, the database currently hosts DNA binding data for 391 nonredundant proteins (individual proteins or in some cases heterodimers) from a diverse collection of organisms.
This is a personal collection. Currently contains ~50 matrices (Last checked: 06.10.2010).
- BindingDB - a public database of measured protein-small ligand binding affinities.
- DPInteract: DNA-protein interactions for E.coli. (Last updated in 1998).
The DeepBind algorithm is based on convolutional neural networks and can discover new patterns even when the locations of patterns within sequences are unknown. For training, DeepBind uses a set of sequences and, for each sequence, an experimentally determined binding score. Sequences can have varying lengths, and binding scores can be real-valued measurements or binary class labels. The authors (Alipanahi et al., 2015) claimed that this algorithm outperforms all 26 existing methods for protein-DNA specificity prediction previously compared by Weinrouch et al., 2013. This is a stand alone application, available for Windows and Linux.
BayesPI-BAR (Bayesian method for Protein-DNA Interaction with Binding Affinity Ranking) uses biophysical modeling of protein-DNA interaction to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). It includes TF chemical potentials or protein concentrations, and direct TF binding targets as input. The authors claimed that the method compares favorably to existing programs such as sTRAP and is-rSNP, when evaluated on the same SNPs. The method is described here.
- PhysBinder: improving the prediction of transcription factor binding sites by flexible inclusion of biophysical properties
A web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein-DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of TFBS. Users can submit FASTA sequences for analysis.
TRAP calculates binding affinity based on the matrix description of a given TF and a set of DNA sequences to be annotated (input). It requires the specification of two biophysically-motivated parameters. The freely available program code is written in C. Further details are available in the paper by Roider et al., 2007.
STAP uses a biophysical model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip or ChIPSeq data. The program assumes that the measured affinity of a sequence to a TF (TF_exp) in some ChIP-chip or ChIP-seq experiment is determined by: 1) the number and strength of binding sites of TF_exp in this sequence; 2) the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood. Specifically, it takes as input a set of DNA sequences, their binding affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs, including TF_exp. It will learn the relevant parameters of the biophysical model of TF-DNA interaction, including those of TF-DNA interaction and those of TF-TF cooperative interactions.
- MatrixREDUCE - Predicting TF binding through alignment-free and affinity-based analysis of orthologous promoter sequences
The input to MatrixREDUCE is a sequence file in FASTA format and an expression data file in tab-delimited text format (missing values are allowed). Output data include PSAMs in numeric and graphical format, parameters of the fitted model, and an HTML summary page.
- BayesPI - estimation of TF binding energy matrices, binding affinity and chemical potential from ChIP-Chip experiments
BayesPI integrates Bayesian model regularization with biophysical modeling of protein-DNA interactions and nucleosome positioning to study protein-DNA interactions, using a high-throughput dataset.
- Creating PWMs of transcription factors using 3D structure-based computation of protein-DNA free binding energies
The scoring function calibrated against crystallographic data on protein-DNA contacts can recover PWMs, sometimes outperforming experimental PWMs.
ChIP-seq TF binding analysis (*for histone ChIP-seq, see here)
- PscanChIP: finding over-represented transcription factor-binding site motifs and their correlations in sequences from ChIP-Seq experiments.
PscanChIP is a web application that, given a set of genomic regions derived from a genome wide ChIP-Seq experiment, scans them and looks for over represented sequence motifs, according to motif descriptors of the TRANSFAC and JASPAR databases, or uploaded by users. The over represented motifs thus correspond to transcription factor binding sites found to be enriched in the regions themselves. The general idea is to assess which is the motif more likely to represent the binding specificity of the TF investigated; but also to identify "secondary" motifs which might correspond to other TFs interacting with the one for which the ChIP experiment was performed.
Whole-Genome rVISTA enables users to query databases containing pre-computed genome coordinates of evolutionarily conserved transcription factor binding sites in the proximal promoters (from 100 bp up to 5kb upstream) of human, mouse and Drosophila genomes. TF binding sites are based on position weight matrices from the TRANSFAC Professional database. Results are exported in a .bed format for rapid visualization in the UCSC genome browser. Flat files of mapped conserved sites and their genomic coordinates are also available for analysis with stand-alone software.