a ChIP-Seq case study - Genomatix

PPARgamma in adipocyte differentiation - a ChIP-Seq case study 

Example analysis using Genomatix technologies to study a ChIP-Seq data on PPARgamma. 

Intention and extent 

This case study shows an example of an analysis workflow suitable for ChIP-Seq data. It is intended to show 

options and approaches. This study will cover topics such as: 

• peak finding and analysis for known transcription factor binding sites, 

• definition of de novo binding site matrices from cluster sequences, 

• identification and analysis of potential target genes including associated pathways, 

• promoter analysis and identification of a common regulatory framework in a gene subset and subsequent 

scan of all annotated promoters for matches for this framework, 

• positional correlations for different data sets, 

• data visualization. 

Data source 

This study is based on data from a publication studying PPARgamma, a key regulator in adipocyte 

differentiation. Using ChIP-Seq Nielsen et al. (Genes Dev. 2008; 22(21): 2953–2967, PMID: 18981474) 

followed the changes in the genome-wide profile of PPARgamma, RXR and PolII binding sites during 

adipocyte differentiation over 6 days. 

For demonstration we will focus on the changes in PPARgamma binding sites between day 0 and day 6, 

analyze these and extract associated genes and pathways. For both time points, 3 replicates for the 

PPARgamma ChIP are available. For correlations, data sets for RXR and PollI from the same publication will 

be included. 

Workflow overview 

Figure1: workflow for this case study 

© Genomatix 2012

Mapping 

The first step in NGS data analysis is the alignment (also called "mapping") of the raw sequences against 

reference sequences such as genomes or transcriptomes. The mapping on the Genomatix Mining Station 

(GMS) is performed in two steps: first all potential mapping positions for the reads are identified through 

short unique sequence stretches (anchors) followed by a whole read alignment to find the best match. 

Sequence type detection and nucleotide statistics calculation are automatically performed on a GMS during 

data upload and quality control. Statistics include number of reads, GC content and nucleotide distribution 

over read length. 

Using the graphical user interface (GUI) on a GMS, several mappings can be started at the same time. 

Figure 2 shows the setup screen for the PPARgamma samples from day 0. The 32 nt raw sequences were 

mapped against the mouse genome library (NCBI_build37) allowing one point mutation in the first mapping 

step (deep) and requiring at least 92% alignment quality for the whole read. The alignment results are 

reported for uniquely mapping reads but also for reads with up to 50 hits (multiple hits) in bigBED and BAM 

file format. These files can be converted to BED and SAM format during result export. 

Figure 2: Settings for genomic mapping of day 0 PPARgamma-ChIP data. 

After completion of the mapping the results can be accessed from the interface and a mapping statistics is 

shown. In total, 7 and 6 million reads were mapped uniquely for day 0 and day 6, respectively (Figure 3). 

Only these were used for further analysis on the Genomatix Genome Analyzer (GGA). 


Downstream analysis 

Figure 3: Mapping statistics for PPARg day0 (sample 2): Unique hits 

- reads mapping only once in the genome; multiple hits - reads 

mapping between 2 and 50 times in the genome; ambiguous hits - 

reads mapping more than 50 times in the genome; insufficient 

quality hits - reads which could not be mapped fulfilling the 

alignment quality; ignored hits - reads where no anchor seed could 

be found. 

The downstream analysis was performed on the Genomatix Genome Analyzer (GGA) which provides a user 

friendly interface to the whole Genomatix Software Suite and the NGS-Data analysis module. Data 

generated on the GMS are directly accessible from the GGA. 

Data import 

The data were imported via the file upload page which can be accessed from all tasks (use the „Add BED 

files ...“ button) and allows direct upload from the GMS, mounted storage devices or local computers. All 

BED or bigBED files uploaded for the active project are then be displayed in the project management and 

are available for further analysis. 

ChIP-Seq workflow 

To obtain a first overview of the data we recommend the use of the ChIP-Seq workflow which can be found 

in the ‘NGS Analysis’ menu of the navigation bar on top of the page. The workflow comprises the following 

steps: 

• peak finding (clustering) using three algorithms (NGSAnalyzer, MACS, SICER) for samples with and 

without replicates and controls and a subsequent evaluation using DESeq, edgeR or the Audic & Claverie 

approach. 

• read and cluster classification for overlap with genomic features such as exons, introns, promoters and 

intergenic regions. 

• analysis of TF binding sites for overrepresentation in the peak sequences 

• extraction of sequences underlying the peaks (from reference genome) 

• de novo motif definition for generation of a new or confirmation of a known site. 

All these tasks can be setup in one go (Figures 5 & 7): 

For this example, the replicates for PPARgamma day 6 were selected as experiment and replicates from day 

0 as control. PPARgamma should not be expressed at this stage so that these samples can be considered 

as background. 


Figure 5: ChIP-Seq workflow setup: All BED files uploaded within the active project are available for analysis and can be selected as 

treatment or control samples. 

For clustering, default settings (NGSAnalyzer with 100bp window size and automatic read density threshold 

calculation based on Poisson distribution) were used. 

Only clusters which were present in at least 2 replicates (65%) with an overlap of 100 bp were considered. 

For statistical evaluation of the remaining clusters edgeR was used (default). 

Further options, like ‘Cluster Classification and Statistics’, ‘Extraction of Sequences for all Clusters’, 

‘Transcription Factor Binding Site Overrepresentation’, and ‘Definition of new Binding Sites in Clusters’ are 

selected by default. 

Figure 6: ChIP-Seq workflow setup: Selection of peak finding algorithm and parameter setup for replicate treatment and statistical 

analysis. 

As a last step, the analysis was named and submitted. 


Figure 7: Naming and submitting the analysis. 

After completion of the analysis, the result can be accessed through the link provided in the notification email 

or via the ‘Project Management’ under ‘Project & Accounts’ in the navigation bar. 

The result page lists the parameters and programs used and the results of the subtasks selected. All results 

can be downloaded or saved in the ‘Project Management’. 

The clustering results 

In this example, more than 10,000 clusters were called in the single samples, but only 8,291 are detected in 

at least two PPARgamma-day6-ChIP replicates. Of these 7,747 clusters show a statistical significant 

enrichment compared to the day0 controls. This number is comparable to the results from Nielsen et al. who 

report about 7,000 PPARgamma enriched regions. 

11.6% of these are located in promoter regions, which corresponds to an 4.5 fold enrichment. 

All BED containing the positional information for the different cluster categories can be downloaded or saved 

in the „Project Management“ for further (more detailed) analyses. For this example it is sufficient to save the 

BED file for the significant enriched regions in the "Project Management" 

(PPARg_day6_vs_day0_enriched_regions.bed). 

Figure 8: ChIP-Seq workflow 

results: Clustering result 

overview shows that 8,291 

PPARgamma peaks are found 

in at least 2 samples in day 6 

but not in day 0. All detailed 

results can be downloaded. 


Transcription Factor Binding Site Overrepresentation in clusters 

The analysis of predicted transcription factor binding sites in the cluster regions shows a clear enrichment for 

the V$PERO binding site family, which comprises the PPAR/RXR heterodimer binding sites (DR1 elements). 

TF-binding site families combine binding sites from transcription factors with similar matrix and biology and 

thereby avoid unnecessary large and confusing outputs. The top scoring of V$PERO shows that the ChIP 

enrichment was successful (Figure 9). 

Also among the top scoring families is V$RXRF, which contains binding sites for other RXR heterodimers. 

Finding new binding sites in clusters: de novo motif definition 

Figure 9: ChIP-Seq workflow results: 

Overrepresentation analysis for transcription factor 

binding sites. Top ranking family V$PERO contains 

the PPARgamma/RXR heterodimer binding sites 

(DR1 elements). The links underlying the family 

abbreviations provide comprehensive information on 

members and the generation of the matrix family. 

The last part of the workflow, the de novo binding site definition, yields the IUPAC consensus motif 

NNAGSNSAGNN with S standing for C or G. The Workflow uses fixed parameters and is optimized for 

compact binding sites, thus it picks up only one conserved half site of the PPARg/RXR binding site. To 

improve the results, the analysis can be rerun with refined parameters using the task ‘CoreSearch’ (see 

below) accessible under ‘Pattern Definition’ in the navigation bar. Therefore, it is recommended to save the 

sequences of the top 1,000 regions and/or all clusters. 

Extended TF- binding site analysis 

Overrepresentation of TF families has been covered as part of the workflow. The same analysis can be 

performed for individual matrices or TF-modules with one fixed partner using the ‘Overrepresented TF 

binding sites’ task under ‘NGS Analyses’. For this analysis the previously saved BED file 

(PPARg_day6_vs_day0_enriched_regions.bed) containing the positions of the significant regions can be 

used. 

The top scoring individual matrix is V$PPAR_RXR.0.1, which describes the PPAR/RXR heterodimer binding 

sites (DR1), with matches in more than 50% of the input sequences (Figure 10). 

Figure 10: Overrepresentation analysis for individual 

matrices within the enriched peak regions yields 

V$PPAR_RXR binding sites as top scoring. 


The "Module overrepresentation" subtask searching for combinations of other binding sites with V$PERO 

(i.e. potential interaction partners) within 50 bp distance returns with frequent combinations of V$PERO with 

V$NF1F, V$NR2F, the well-known partner V$RXR but also with V$CEBP. These results are in line with the 

original publication where the authors report a high overlap between PPARg, RXR and C/EBP binding sites. 

Figure 11: Analysis of transcription factor combinations with V$PERO between 10 to 50 bp shows an overrepresentation of V$NF1F 

binding sites. The underlying distances are displayed in a graph behind the ‘list‘-link (see figure 12 left). The distance score can be used 

as indicator for a preferential distance between two transcription factor binding sites. 

Support for a functional interaction between the PPAR/RXR site binding protein and one or more V$NRF1 

family members comes from the distance relation of the binding sites (Figure 12, left). A quick check for 

literature cocitations in GePS revealed that PPARgamma can inhibit NF-I binding (Figure 12 right). 

Figure 12: left: display of observed distances between the V$PERO and the V$NF1F site show a preference at about 15 bp, hinting to a 

functional interaction. 

right: Cocitation analysis for PPARgamma and RXRalpha with members of the V$NF1F binding site family (human). 

Refined de novo motif definition 

With the background knowledge that PPARgamma binds the direct repeat of AGGTCA the motif definition 

task can be rerun with a 9 bp alignment core (instead of the 7bp used in the workflow) and a reduced 

sequence constraint (at least 50% of sequences must contain the motif instead of 75%) for the sequences of 

the top 1,000 clusters. Using these parameters the program returns a matrix with the consensus “N 

NGGNCA G AGGNN” which resembles the DR1 element and the matrix presented in the publication. Figure 

13 shows the nucleotide distribution matrix and the sequence logo. 


Figure 13: Nucleotide distribution matrix and sequence logo for de novo binding site generated from the top 1,000 cluster sequences. 

Biological classification of neighboring genes 

The aim of most ChIP-Seq experiments is to identify potential target genes which can then be associated 

with pathways to explore the underlying mechanisms. Although long distance regulation occurs, proximal 

effects play an important role in gene regulation. Genes located in proximity of the binding sites can be 

identified by either correlation of primary transcripts with enriched regions (using GenomeInspector) or by 

annotation of these regions for overlap with promoters or nearby genes (using ‘Annotation and statistics‘ 

under ‘NGS Analysis‘, Figure 14). 

Figure 14: Setup screen for ‘General annotation and statistics‘ used to identify regions overlapping with various genomic features 

ncluding genes and promoters but also for identification of gene located up- and downstream of the enriched regions. 


After submission, the regions will be annotated for overlap with loci, exons, introns, promoters, transcription 

start sites, intergenic regions, microRNAs and repeats but also for the next neighboring genes up- and 

downstream from the region for both sense and anti-sense strand. A statistic will be displayed and the results 

can be downloaded completely or filtered for one or more of the categories. The results can be browsed 

(Figure 15) and GeneIDs of all genes overlapping with the input region or with their promoter can be 

extracted (Figure 16). 

Figure 15: ‘Annotation and Statistics‘ result page: neighboring genes and overlapping features are listed for each region, links to further 

gene information and the GenomeBrowser for visualization are provided. 

Figure 16: ‘Annotation and Statistics‘ result page: regions can be filtered by overlap and geneIDs of nearby genes can be extracted. 

For this example, the geneIDs of genes where promoters overlapped with PPARgamma enriched regions 

were downloaded as text file. To analyze the corresponding genes, the gene IDs can then be transferred to 

the Genomatix Pathway System by simple copy and paste or upload of the saved file. 


Pathway analysis with GePS 

The Genomatix Pathway System uses information from public sources combined with proprietary databases 

to characterize gene lists based on statistical analysis of literature, pathways and GO- and MeSH-terms. 

Pathways and networks can be generated and superimposed with user data. GePS can be accessed from 

the navigation bar under ‘Genomes & Data’. 

Figure 17: Genomatix Pathway System (GePS) overview screen showing the different entry options. 

To analyze the genes with PPARgamma binding sites in the promoter region, the file containing the geneIDs 

was uploaded and the organism was selected (Figure 18). Alternatively, the geneIDs could have been pasted 

into the setup screen. 

Figure 18: Genomatix Pathway System setup screen. GeneIDs or symbols can be entered via copy and paste or file upload. Available 

annotation types are listed. These will be used for classification and can be used as data filter for the analyzed genes. 

The first result GePS delivers is a characterization of the gene list based on pathways, Gene Ontology, 

MeSH-term and Genomatix proprietary annotation. Overrepresentation of biological terms associated with 

genes from the input list are calculated and listed in the left panel together with the respective p-value. 


Canonical pathways are only available for human but for other organisms genes can be mapped to the 

human orthologs before the analysis. Here literature based pathways (from Genomatix Literature Mining) 

were considered and show PPARgamma and alpha pathways as top scorers. The top ranking processes and 

diseases are related to metabolism. The tissue filter shows peroxisomes and adipocytes and even the cell 

line used in the experiment (3T3 L1). Reassuring is that PPARgamma is the most cocited transcription factor 

for the genes analyzed, indicating an enrichment for potential PPARgamma targets. The results fit well with 

PPARgamma being a key player in lipid metabolism. 

The results can be used as filters for networks or to construct new ones. The network below was generated 

by clicking on the top ranking pathway ‘Peroxisome proliferative activated …’. It shows PPARgamma as 

central transcription factor and known target genes such as Lpl. Dotted connection lines indicate 

automatically retrieved literature cocitations while solid lines indicate expert curated annotation. The latter 

ones show for example that Lpl and Sod1 are activated and Adipoq is inhibited by PPARgamma. Ucp2 and 

Rxra are greyed out since these two genes do not fulfill the additional filter ‘lipid metabolic process’ under 

‘Biological Processes’ applied (Figure 20). 

Comprehensive information about genes and connections can be retrieved by double click on the gene 

symbol and the line, respectively (Figure 21). 

Figure 19: Gene classification results for genes with PPARgamma binding in the promoter based on Genomatix literature Mining, GO- 

and MeSH-terms. 


Figure 20: Network generated for genes assigned to the literature pathway ‘lipid Peroxisome proliferative activated receptor alpha‘ and 

filtered for additional assignment to the biological process GO-term ‘lipid metabolic process‘ based on literature cocitations. Genes in 

yellow boxes fulfill both criteria, genes in grey boxes are not assigned to the GO-term ‘lipid metabolic process‘. Solid and dotted lines 

represent expert curated and literature retrieved interactions, respectively. Arrows indicate direct activation, diamonds modulation, and 

line/circle indicated inhibition. 

Figure 21: Additional information that can be browsed in the Genomatix Pathway System upon double click on the gene or connection of 

interest. 


Identification of common regulatory elements in promoters 

Transcription factors often act synergistically to achieve and coordinate cell type specific gene expression. 

These functional combinations are often conserved in terms of organization, distance, and orientation of the 

individual elements forming so-called modules or frameworks. 

The GePS network (Figure 20) shows that PPARgamma activates Lpl (lipoprotein lipase), Ucp2 (uncoupling 

protein 2) and Scd1 (stearoyl-CoA desaturase 1), all expressed in adipocytes. To investigate whether these 

three genes share regulatory elements their promoters were extracted and searched for common 

frameworks. 

Promoter sequence extraction 

The promoters for all alternative transcripts were extracted from the Eldorado database using 

‘Gene2Promoter’ under ‘Genomes & Data’ (Figure 22). Mus musculus was selected as organism and the 

three gene symbols were entered into the keyword search section. 

Figure 22: Gene2Promoter input page. 

The summary on top of the result page lists a total of 36 transcripts and 14 promoters for the three input 

genes which are shown in the table below (Figure 23) 


Figure 23: Interactive Gene2Promoter result page listing all alternative transcripts and promoters for selected genes. Additional 

information such as conservation and CAGE tag support are provided together with links for more comprehensive information and 

visualization. 

10 of the 36 promoters belong to relevant transcripts (2 for Lpl and Scd1, 6 for Ucp2). Only these were 

selected for further analysis with FrameWorker. 

Figure 24: Interactive Gene2Promoter result page: Promoters can be selected and tested for presence of transcription factor binding 

sites, corresponding sequences can be extracted and directly analyzed in serval subtasks. 

Identification of common regulatory elements 

The low number of sequences allowed an exhaustive analysis in FrameWorker, meaning that all promoter 

combinations for the three genes will be tested separately, resulting in 24 combinations. The analysis was 

run with default parameters except that the maximum distance variance was increased to 20. One of the 24 

combinations returned a framework consisting of three transcription factor binding sites: V$RXRF, V$KLFS 


and V$EGRF with distances of roughly 80 and 100 bp between the single sites (Figure 25). The model does 

not contain a PPARgamma site but members of the three families, while not directly linked to adipocytes, are 

associated with lipid homeostasis, glucose transport and response to glucose and insulin stimulus, 

respectively. 

The model was saved and subsequently used for a ModelInspector analysis. 

Identification of genes sharing the identified model and overlay with meta-data 

Figure 25: 

FrameWorker result: Transcription factor combination (framework) 

common to promoters from the three input genes (Lpl, Ucp2 and 

Scd1) consisting of three transcription factor binding site families 

with defined distance and orientation. The framework was saved 

and all mouse promoters were subsequently scanned for 

matches. 

ModelInspector is a program that performs a sequence scan for presence of predefined TF-combinations, 

called frameworks or modules. For this example, all mouse promoters of annotated genes were scanned for 

the presence of the V$RXRF-V$KLFS-V$EGRF-framework returning 271 matches in promoters of 199 

genes. The included GO-term analysis showed ‘metabolic process’ as top category with 115 associated 

genes and a very low p-value, indicating that the module can enrich for genes associated with metabolism. 

The 199 geneIDs were extracted and imported into GePS. Figure 26 shows the network which was 

generated by starting with PPARgamma and the option to extend networks by frequently cocited genes. The 

dots on both sites of the gene boxes are the visualization of the ChIP-Seq enrichment (in promoter regions) 

which have been imported as metadata. Absence of PolII clusters in promoters can indicate reduced gene 

transcription but can also indicate a very short initiation time, thus not leading to enrichments. 

Figure 26: Network generated from 

genes fulfilling two criteria: a) being 

identified in the ModelInspector run 

as harboring the V$RXRF-V$KLFS- 

V$EGRF framework in at least one 

promoter and b) being cocited with 

PPARgamma in PubMed abstracts. 

The dots besides gene boxes 

indicate the presence of 

PPARgamma, RXR or PolII clusters 

called in the data from Nielsen et al. 

(2008) 


Correlation between different data sets 

PPARgamma binds to peroxisome proliferator response elements as a heterodimer with retinoic X receptor 

(RXR) and RXR binding sites have been found to be overrepresented in the TF analysis (see above). 

Therefore, it would be interesting to analyze the overlap between PPARgamma and RXR binding sites. The 

RXR-ChIP data are derived from the same publication and have been processed similar to the PPARgamma 

set. 

Positional correlations between genomic elements and/or user data can be performed in the task 

‘GenomeInspector’ which can be accessed from ‘NGS Analysis’ in the navigation bar. Using the 

PPARgamma set as an anchor and calculating the distance distribution profile for the RXR data set results in 

the curve shown in Figure 27. Regions contributing to the correlation can be extracted from both sets and 

used for further analysis (e.g. annotation and pathway analysis or framework analysis). 

Figure 27: Positional correlation of PPARgamma enriched regions (aligned with their middle at 0) with the RXR enriched regions 

generated in GenomeInspector. The graph shows a clear overlap between the two data sets. Regions contributing to the correlation can 

be extracted. 

Data visualization 

In the genome browser the data can be visualized in the genomic context, overlayed with general annotation, 

proprietary data from Genomatix or other ChIP-Seq or RNA-Seq data sets. This allows an integration of 

different datasets and a quick assessment of the state at the locus of interest. Figure 27 shows the Scd1 

locus (located on the antisense strand) with PPARgamma, RXR and PolII raw reads and the positions of the 

called clusters. The graph shows only background for the PPARgamma data at day 0 but a strong 

enrichment at 5‘ promoter and several upstream and downstream regions, indicating potential enhancer 

regions. The RXR data show a similar picture. At day 0, PolII is found at the potential enhancer regions and 

the promoter. After adipocyte differentiation at day 6, PolII is no longer enriched at the promoter and 

enhancers but spreads over the whole gene body - reflecting the PPARgamma expression. 


Figure 28: Visualization of the Scd1 locus in the genome browser. Alternative transcripts are shown in black. Single reads are shown for 

day 0 and day 6 for PPARgamma (blue), RXR (read) and PolII (green). For day 6 these are overlayed with the called clusters in the 

same but lighter color. 

Summary 

Based on the data published by Nielsen et al. (2008) we showed comprehensive ChIP-Seq analysis pipeline 

from mapping down to pathway analysis. 

The raw reads were mapped to the mouse genome and unique alignments were clustered to identify regions 

of enriched read density indicating PPARgamma, RXR or PolII binding, respectively. The 7,747 regions 

identified in the PPARgamma data set showed a strong overrepresentation of in silico predicted 

PPARgamma binding sites indicating the successful ChIP experiment. Further analysis showed frequent cooccurrence 

of V$NF1F binding sites in about 15 bp distance and CEBP binding sites. The latter being in 

agreement with the publication. De novo motif definition extracted the “N NGGNCA G AGGNN“ consensus 

sequence, which resembles parts of the DR1 element, the known PPARgamma/RXR heterodimer binding 

site. 

To identify potential PPARgamma targets, genes up- and downstream of the enriched regions were 

determined. Genes with PPARgamma binding within their promoter were extracted and analyzed with the 

Genomatix Pathway System. Overrepresented pathways, GO- and MeSH-terms indicated PPAR pathways 

and general metabolic processes. The TF most frequent cocited with these genes is PPARgamma, again 

confirming the experiment. In the network generated from the top scoring pathway ‘Peroxisome proliferative 

activator …’. expert curated annotation shows direct activation of the three genes (Lpl, Scd1, Ucp2) by 

PPARgamma. The 10 relevant promoters from the three genes were exhaustively analyzed for common 

regulatory motifs. A V$RXRF-V$KLFS-V$EGRF was detected and used to scan all mouse promoters. This 

scan yielded 271 matches in promoters of 199 genes. GO-term analysis for these genes revealed an 

association with ‘metabolic processes’. Furthermore, the overlap between the PPARgamma and RXR 

enrichment was determined. And finally, the data sets were visualized in the genomic context. 


For more information on Genomatix solutions and services, please visit: 

http://www.genomatix.com 

Visit 

http://www.youtube.com/user/GenomatixWebcasts 

for tutorials and demo videos. 

Find us on facebook at: 

http://www.facebook.com/genomatix 

http://www.genomatix.com 

Contact Germany 

Genomatix Software GmbH 

Bayerstr. 85a 

80335 Munich 

Germany 

phone +49 89 599766 0 

email info@genomatix.de 

Contact USA 

Genomatix Software Inc. 

3025 Boardwalk, Suite 160 

Ann Arbor, MI 48108 

USA 

phone +1 877 436 6628 

email sales-us@genomatix.com

a ChIP-Seq case study - Genomatix

Create successful ePaper yourself

Delete template?

Save as template?