Last updated: 2019-10-30

Checks: 2 0

Knit directory: fibroblast-clonality/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .vscode/
    Ignored:    code/.DS_Store
    Ignored:    code/selection/.DS_Store
    Ignored:    code/selection/.Rhistory
    Ignored:    code/selection/figures/
    Ignored:    data/.DS_Store
    Ignored:    logs/
    Ignored:    src/.DS_Store
    Ignored:    src/Rmd/.Rhistory

Untracked files:
    Untracked:  .dockerignore
    Untracked:  .dropbox
    Untracked:  .snakemake/
    Untracked:  Rplots.pdf
    Untracked:  Snakefile_clonality
    Untracked:  Snakefile_somatic_calling
    Untracked:  analysis/.ipynb_checkpoints/
    Untracked:  analysis/assess_mutect2_fibro-ipsc_variant_calls.ipynb
    Untracked:  analysis/cardelino_fig1b.R
    Untracked:  analysis/cardelino_fig2b.R
    Untracked:  code/analysis_for_garx.Rmd
    Untracked:  code/selection/data/
    Untracked:  code/selection/fit-dist.nb
    Untracked:  code/selection/result-figure.R
    Untracked:  code/yuanhua/
    Untracked:  data/Melanoma-RegevGarraway-DFCI-scRNA-Seq/
    Untracked:  data/PRJNA485423/
    Untracked:  data/canopy/
    Untracked:  data/cell_assignment/
    Untracked:  data/cnv/
    Untracked:  data/de_analysis_FTv62/
    Untracked:  data/donor_info_070818.txt
    Untracked:  data/donor_info_core.csv
    Untracked:  data/donor_neutrality.tsv
    Untracked:  data/exome-point-mutations/
    Untracked:  data/fdr10.annot.txt.gz
    Untracked:  data/human_H_v5p2.rdata
    Untracked:  data/human_c2_v5p2.rdata
    Untracked:  data/human_c6_v5p2.rdata
    Untracked:  data/neg-bin-rsquared-petr.csv
    Untracked:  data/neutralitytestr-petr.tsv
    Untracked:  data/raw/
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
    Untracked:  data/sces/
    Untracked:  data/selection/
    Untracked:  data/simulations/
    Untracked:  data/variance_components/
    Untracked:  figures/
    Untracked:  output/differential_expression/
    Untracked:  output/differential_expression_cardelino-relax/
    Untracked:  output/donor_specific/
    Untracked:  output/nvars_by_category_by_donor.tsv
    Untracked:  output/nvars_by_category_by_line.tsv
    Untracked:  output/variance_components/
    Untracked:  qolg_BIC.pdf
    Untracked:  references/
    Untracked:  reports/
    Untracked:  src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.unst_cells.carderelax.Rmd
    Untracked:  src/Rmd/Rplots.pdf
    Untracked:  src/Rmd/cell_assignment_cardelino-relax_template.Rmd
    Untracked:  tree.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 35c3269 Davis McCarthy 2019-10-30 Updating index to get accurate update time
html 35c3269 Davis McCarthy 2019-10-30 Updating index to get accurate update time
Rmd 550176f Davis McCarthy 2019-10-30 Updating analysis to reflect accepted ms
html 8729e02 davismcc 2018-11-09 Build site.
Rmd 218d792 John Blischak 2018-09-11 Fix some links on homepage.
html 0540cdb davismcc 2018-09-02 Build site.
html f0ed980 davismcc 2018-08-31 Build site.
html ca3438f davismcc 2018-08-29 Build site.
html e573f2f davismcc 2018-08-27 Build site.
html 9ec2a59 davismcc 2018-08-26 Build site.
Rmd cae617f davismcc 2018-08-26 Updating simulation analyses
html 36acf15 davismcc 2018-08-25 Build site.
Rmd 56d90a6 davismcc 2018-08-25 Completing index with descriptions of data availability and new analyses.
Rmd d618fe5 davismcc 2018-08-25 Updating analyses
html 090c1b9 davismcc 2018-08-24 Build site.
html 02a8343 davismcc 2018-08-24 Build site.
Rmd 97e062e davismcc 2018-08-24 Updating Rmd’s
Rmd 43f15d6 davismcc 2018-08-24 Adding data pre-processing workflow and updating analyses.
html d2e8b31 davismcc 2018-08-19 Build site.
html 1489d32 davismcc 2018-08-17 Add html files
Rmd 6b5f8c7 davismcc 2018-08-17 Updating organisational pages.
Rmd 1cbadbd davismcc 2018-08-10 Updating analyses.
html 2531565 davismcc 2018-08-08 Tweaking clone prevalences
Rmd 7397e00 davismcc 2018-08-08 Updating stylez and tweaking Rmds
html 9856275 davismcc 2018-08-07 Build site.
Rmd 5fc189d davismcc 2018-08-07 Start workflowr project.

Project overview

This project investigates clonality in human dermal fibroblast cell populations in 32 cell lines from distinct donors, using bulk whole-exome sequencing and single-cell RNA-sequencing data.

Key findings:

For a richer overview, see the About page.

Manuscript

A pre-print describing the work in detail is available:

Data pre-processing

The data pre-processing for this project from the raw data described above is complicated and computationally expensive, so this repository does not reproduce the data pre-processing in an automated way. However, we provide the source code for the Snakemake workflow for data pre-processing in this repository. Docker images providing the computing environment and software used are publicly available, split into an image for command line bioinformatics tools and an R installation with necessary packages installed.

If you would like to pre-process the data from raw reads to results as we have, please consult our description of how to run the workflow.

Analyses

Here we present the reproducible the results of our analyses. They were generated by rendering the R Markdown documents into webpages available at the links below.

The results presented in the paper were produced with these analyses.

  1. Simulation results.

  2. Overview of lines.

  3. Selection models.

  4. Analysis of clonal prevalences.

  5. Analysis for the example cell line joxm.

  6. Variance components analysis.

  7. Differential expression analysis.

  8. Analysis of effects of somatic variants on cis gene expression.

Data availability

This is a complicated project, and reproducing all of the results presented, especially from raw data is highly non-trivial. Nevertheless, we have made all data available so that everything is entirely reproducible.

Single-cell RNA-seq data have been deposited in the ArrayExpress database at EMBL-EBI under accession number E-MTAB-7167. Whole-exome sequencing data is available through the HipSci portal. Processed data and large results files are available from Zenodo with DOI 10.5281/zenodo.1403510.

To set up the project to reproduce our analyses, first clone the source code repository from GitHub. Next, download all of the reference, metadata and results files and add them to the (cloned) project folder with the following structure:

.
├── data
│   ├── canopy
│   │   ├── canopy_results.*.rds
│   ├── cell_assignment
│   │   ├── cardelino_results.*.rds
│   ├── de_analysis_FTv62
│   │   ├── cellcycle_analyses
│   │   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.rds
│   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.rds
│   ├── donor_info_070818.txt
│   ├── donor_info_core.csv
│   ├── donor_neutrality.tsv
│   ├── exome-point-mutations
│   │   ├── high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
│   │   └── high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
│   ├── human_H_v5p2.rdata
│   ├── human_c2_v5p2.rdata
│   ├── human_c6_v5p2.rdata
│   ├── neg-bin-rsquared-petr.csv
│   ├── neutralitytestr-petr.tsv
|   ├── sces
│   │   ├── sce_*.rds
│   ├── selection
│   │   ├── neg-bin-params-fit.csv
│   │   ├── neg-bin-rsquared-fit.csv
│   ├── simulations
│   │   ├── *.filt_lenient.cell_coverage_sites.mult.rds
│   │   ├── *.simulate.rds
│   └── variance_components
│       ├── covar_all.csv
│       ├── donorVar
│       │   ├── *.var_part.var1.csv
│       ├── fit_all_gene_highVar.csv
│       ├── fit_per_gene_highVar.csv
│       ├── gene_info_all.csv
│       └── logcnt_all.csv
├── metadata
│   ├── cell_metadata.csv
│   └── data_processing_metadata.tsv
├── references
│   ├── 1000G_phase1.indels.hg19.sites.vcf.gz
│   ├── GRCh37.p13.genome.ERCC92.fa
│   ├── Homo_sapiens.GRCh37.rel75.cdna.all.ERCC92.fa.gz
│   ├── Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
│   ├── dbsnp_138.hg19.biallelicSNPs.HumanCoreExome12.Top1000ExpressedIpsGenes.Maf0.01.HWE0.0001.HipSci.vcf.gz
│   ├── dbsnp_138.hg19.vcf.gz
│   ├── gencode.v19.annotation_ERCC.gtf
│   ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz
│   ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz.csi
│   └── knownIndels.intervals

For simplicity, we ignore all the directories and files present in the source code repository (that you should have clones) to focus just on where you should add the files downloaded from Zenodo. Yes, it’s still complicated, but such is life.

There is a large number of canopy_results.*.rds files: these should be stored in the data/canopy directory. Similarly, all of the cardelino_results.*.rds files should be stored in data/cell_assignment. All of the SingleCellExperiment object files (sce_*.rds) should be stored in data/sces. Simulation results files (*.mult.rds; *.simulate.rds) should be stored in data/simulations. Variance components results should be stored in data/variance_components as shown above.

Differential expression results belong in data/de_analysis_FTv62.

Metadata files belong in metadata. Reference files belong in references.

With the data downloaded and organised as above, you will be able to reproduce the analyses presented in the RMarkdown files linked to above and, if desired, even run the whole analysis pipeline from raw reads to results following these instructions.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.