Last updated: 2019-10-30

Checks: 7 0

Knit directory: fibroblast-clonality/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20180807) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .vscode/
    Ignored:    code/.DS_Store
    Ignored:    code/selection/.DS_Store
    Ignored:    code/selection/.Rhistory
    Ignored:    code/selection/figures/
    Ignored:    data/.DS_Store
    Ignored:    logs/
    Ignored:    src/.DS_Store
    Ignored:    src/Rmd/.Rhistory

Untracked files:
    Untracked:  .dockerignore
    Untracked:  .dropbox
    Untracked:  .snakemake/
    Untracked:  Rplots.pdf
    Untracked:  Snakefile_clonality
    Untracked:  Snakefile_somatic_calling
    Untracked:  analysis/.ipynb_checkpoints/
    Untracked:  analysis/assess_mutect2_fibro-ipsc_variant_calls.ipynb
    Untracked:  analysis/cardelino_fig1b.R
    Untracked:  analysis/cardelino_fig2b.R
    Untracked:  code/analysis_for_garx.Rmd
    Untracked:  code/selection/data/
    Untracked:  code/selection/fit-dist.nb
    Untracked:  code/selection/result-figure.R
    Untracked:  code/yuanhua/
    Untracked:  data/Melanoma-RegevGarraway-DFCI-scRNA-Seq/
    Untracked:  data/PRJNA485423/
    Untracked:  data/canopy/
    Untracked:  data/cell_assignment/
    Untracked:  data/cnv/
    Untracked:  data/de_analysis_FTv62/
    Untracked:  data/donor_info_070818.txt
    Untracked:  data/donor_info_core.csv
    Untracked:  data/donor_neutrality.tsv
    Untracked:  data/exome-point-mutations/
    Untracked:  data/fdr10.annot.txt.gz
    Untracked:  data/human_H_v5p2.rdata
    Untracked:  data/human_c2_v5p2.rdata
    Untracked:  data/human_c6_v5p2.rdata
    Untracked:  data/neg-bin-rsquared-petr.csv
    Untracked:  data/neutralitytestr-petr.tsv
    Untracked:  data/raw/
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
    Untracked:  data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
    Untracked:  data/sces/
    Untracked:  data/selection/
    Untracked:  data/simulations/
    Untracked:  data/variance_components/
    Untracked:  docs/figure/analysis_for_joxm_cardelino-relax.Rmd/
    Untracked:  docs/figure/clone_prevalences_cardelino-relax.Rmd/
    Untracked:  docs/figure/differential_expression_cardelino-relax.Rmd/
    Untracked:  figures/
    Untracked:  output/differential_expression/
    Untracked:  output/differential_expression_cardelino-relax/
    Untracked:  output/donor_specific/
    Untracked:  output/line_info.tsv
    Untracked:  output/nvars_by_category_by_donor.tsv
    Untracked:  output/nvars_by_category_by_line.tsv
    Untracked:  output/variance_components/
    Untracked:  qolg_BIC.pdf
    Untracked:  references/
    Untracked:  reports/
    Untracked:  src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.unst_cells.carderelax.Rmd
    Untracked:  src/Rmd/Rplots.pdf
    Untracked:  src/Rmd/cell_assignment_cardelino-relax_template.Rmd
    Untracked:  tree.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd f3a24d1 Davis McCarthy 2019-10-30 Fixing some nameclash bugs
Rmd 550176f Davis McCarthy 2019-10-30 Updating analysis to reflect accepted ms
html 8729e02 davismcc 2018-11-09 Build site.
html 0540cdb davismcc 2018-09-02 Build site.
html f0ed980 davismcc 2018-08-31 Build site.
html ca3438f davismcc 2018-08-29 Build site.
html e573f2f davismcc 2018-08-27 Build site.
html 9ec2a59 davismcc 2018-08-26 Build site.
html 36acf15 davismcc 2018-08-25 Build site.
Rmd d618fe5 davismcc 2018-08-25 Updating analyses
html 090c1b9 davismcc 2018-08-24 Build site.
html d2e8b31 davismcc 2018-08-19 Build site.
html 1489d32 davismcc 2018-08-17 Add html files
Rmd d5e785d davismcc 2018-08-17 Updating to use “line” instead of “donor”
Rmd bcee02b davismcc 2018-08-17 Updating donors to lines in text.
Rmd 0862ede davismcc 2018-08-10 Fixing some typos
Rmd a27cace davismcc 2018-08-10 Updating analysis
Rmd 0c8f70d davismcc 2018-08-10 Tidying up mutated genes analysis
Rmd 1cbadbd davismcc 2018-08-10 Updating analyses.
Rmd de1daa9 davismcc 2018-08-09 Adding analysis for DE between mutated and unmutated clones

Introduction

In this analysis, we look at the direct impact of somatic variants on the genes in which they are located. We combine information from the clonal trees about which variants tag which clones and VEP functional annotations about the expected consequence of each variant. For each gene with a somatic variant (and sufficient gene expression), we test for differential expression between cells that have the mutation (based on clone assignment) and those that do not. We can then compare these “mutated gene” differential expression results for variants with different categories of functional annotation.

Clonal trees were inferred with Canopy using variant allele frequency information from a set of strictly filtered somatic variants called from whole-exome sequencing data. We filtered somatic variants to only include variants that had read coverage in at least one single-cell for the relevant line. Assignment of cells to clones was done with cardelino.

Load libraries and data

Load VEP consequence information.

There are 8503 variants, aggregated across all lines, and we can check the number that fall into the different VEP annotation categories.


               3_prime_UTR_variant                5_prime_UTR_variant 
                               181                                121 
           downstream_gene_variant                 intergenic_variant 
                                13                                 13 
                    intron_variant               mature_miRNA_variant 
                              1539                                  3 
                  missense_variant non_coding_transcript_exon_variant 
                              3648                                428 
         regulatory_region_variant            splice_acceptor_variant 
                                 1                                 41 
              splice_donor_variant              splice_region_variant 
                                24                                291 
                        start_lost                        stop_gained 
                                 9                                227 
                         stop_lost              stop_retained_variant 
                                 2                                  5 
                synonymous_variant              upstream_gene_variant 
                              1923                                 34 

Load exome sites, that is, the somatic SNVs identified with Petr Danecek’s somatic variant calling approach.

Add variant consequences annotations to exome sites.

# A tibble: 8,503 x 3
   var_id           Location          Consequence       
   <chr>            <chr>             <chr>             
 1 chr1:874501_C_T  1:874501-874501   missense_variant  
 2 chr1:915025_G_A  1:915025-915025   missense_variant  
 3 chr1:916622_C_T  1:916622-916622   intron_variant    
 4 chr1:985731_G_A  1:985731-985731   intron_variant    
 5 chr1:985732_G_A  1:985732-985732   intron_variant    
 6 chr1:1019508_G_A 1:1019508-1019508 missense_variant  
 7 chr1:1019509_G_A 1:1019509-1019509 synonymous_variant
 8 chr1:1116160_G_A 1:1116160-1116160 synonymous_variant
 9 chr1:1158664_G_A 1:1158664-1158664 synonymous_variant
10 chr1:1178463_T_A 1:1178463-1178463 missense_variant  
# … with 8,493 more rows

Load SingleCellExperiment objects with single-cell gene expression data.

reading euts :    79 unstimulated cells.
reading fawm :    53 unstimulated cells.
reading feec :    75 unstimulated cells.
reading fikt :    39 unstimulated cells.
reading garx :    70 unstimulated cells.
reading gesg :    105 unstimulated cells.
reading heja :    50 unstimulated cells.
reading hipn :    62 unstimulated cells.
reading ieki :    58 unstimulated cells.
reading joxm :    79 unstimulated cells.
reading kuco :    48 unstimulated cells.
reading laey :    55 unstimulated cells.
reading lexy :    63 unstimulated cells.
reading naju :    44 unstimulated cells.
reading nusw :    60 unstimulated cells.
reading oaaz :    38 unstimulated cells.
reading oilg :    90 unstimulated cells.
reading pipw :    107 unstimulated cells.
reading puie :    41 unstimulated cells.
reading qayj :    97 unstimulated cells.
reading qolg :    36 unstimulated cells.
reading qonc :    58 unstimulated cells.
reading rozh :    91 unstimulated cells.
reading sehl :    30 unstimulated cells.
reading ualf :    89 unstimulated cells.
reading vass :    37 unstimulated cells.
reading vils :    37 unstimulated cells.
reading vuna :    71 unstimulated cells.
reading wahn :    82 unstimulated cells.
reading wetu :    77 unstimulated cells.
reading xugn :    35 unstimulated cells.
reading zoxy :    88 unstimulated cells.

We have SCE objects (expression data and cell assignments) for 32 lines, with 88% of cells confidently assigned to a clone.

Finally, we load transcriptome-wide differential expression results computed using the quasi-likelihood F test method in edgeR.

Annotating variants with clone presence by line

Next, we need to annotate the variants with information about which clones they are expected to be present in (obtained from the Canopy tree inference).

reading euts 
reading fawm 
reading feec 
reading fikt 
reading garx 
reading gesg 
reading heja 
reading hipn 
reading ieki 
reading joxm 
reading kuco 
reading laey 
reading lexy 
reading naju 
reading nusw 
reading oaaz 
reading oilg 
reading pipw 
reading puie 
reading qayj 
reading qolg 
reading qonc 
reading rozh 
reading sehl 
reading ualf 
reading vass 
reading vils 
reading vuna 
reading wahn 
reading wetu 
reading xugn 
reading zoxy 

Differential expression comparing mutated clone to all other clones

Get DE results comparing mutated clone to all unmutated clones, but we first have to organise the data further.

Turn dataframes into GRanges objects to enable fast and robust overlaps.

Run DE testing between mutated and un-mutated clones for all affected genes for each line.

mut_genes_df_allcells_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
    cat("working on ", don, "\n")
    sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
    ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
    sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
    sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
    sites_tmp$gene <- rownames(sce_tmp)
    dge_tmp <- de_res[["dge_list"]][[don]]
    dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
    base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
    de_tbl_tmp <- data.frame(line = don,
                             gene = sites_tmp$gene, 
                             hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
                             ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
                             var_id = sites_tmp$var_id,
                             location = sites_tmp$Location,
                             consequence = sites_tmp$Consequence,
                             clone_presence = sites_tmp$clone_presence,
                             logFC = NA, logCPM = NA, F = NA, PValue = NA,
                             comment = "", stringsAsFactors = FALSE)
    for (i in seq_len(length(sites_tmp))) {
        clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
        mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
        dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
        if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
            qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
            de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
            de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
            de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
            de_tbl_tmp$F[i] <- de_tmp$table$F
            de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
        }
        if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
            de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
        if (!is.fullrank(dsgn_tmp))
            de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
    }
    mut_genes_df_allcells_list[[don]] <- de_tbl_tmp
}
working on  euts 
working on  fawm 
working on  feec 
working on  fikt 
working on  garx 
working on  gesg 
working on  heja 
working on  hipn 
working on  ieki 
working on  joxm 
working on  kuco 
working on  laey 
working on  lexy 
working on  naju 
working on  nusw 
working on  oaaz 
working on  oilg 
working on  pipw 
working on  puie 
working on  qayj 
working on  qolg 
working on  qonc 
working on  rozh 
working on  sehl 
working on  ualf 
working on  vass 
working on  vuna 
working on  wahn 
working on  wetu 
working on  xugn 
working on  zoxy 
[1] 1140   13
  line                      gene hgnc_symbol ensembl_gene_id
1 euts ENSG00000006327_TNFRSF12A   TNFRSF12A ENSG00000006327
2 euts      ENSG00000028528_SNX1        SNX1 ENSG00000028528
3 euts      ENSG00000060642_PIGV        PIGV ENSG00000060642
4 euts     ENSG00000083168_KAT6A       KAT6A ENSG00000083168
5 euts      ENSG00000083520_DIS3        DIS3 ENSG00000083520
6 euts    ENSG00000090060_PAPOLA      PAPOLA ENSG00000090060
              var_id             location consequence clone_presence
1  chr16:3071779_A_T   16:3071779-3071779 stop_gained         clone2
2 chr15:64388302_G_A 15:64388302-64388302  synonymous         clone3
3  chr1:27121236_G_A  1:27121236-27121236  synonymous         clone3
4  chr8:41798596_G_A  8:41798596-41798596    missense         clone2
5 chr13:73335940_G_C 13:73335940-73335940  synonymous  clone2&clone3
6 chr14:97000841_C_T 14:97000841-97000841    missense         clone3
       logFC   logCPM          F     PValue                      comment
1  0.0378859 9.221413 0.03458196 0.85301221                             
2 -0.1549842 6.353333 0.08000619 0.77812377                             
3 -2.8765803 4.367360 7.03122364 0.00990008                             
4         NA       NA         NA         NA gene did not pass DE filters
5 -0.7635865 4.366939 0.96241345 0.32995811                             
6  0.0493242 7.067499 0.01151527 0.91485064                             

With this analysis, 1067 genes/variants could be tested for DE between mutated and unmutated clones.

We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.


               3_prime_UTR                5_prime_UTR 
                        27                         12 
                    intron                   missense 
                        20                        650 
non_coding_transcript_exon                   nonsense 
                        20                         39 
                  splicing                 synonymous 
                        21                        351 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.7450  0.8918  0.7912  0.9408  1.0000      73 

There are few variants with a significant difference in expression between clones with and without the mutation (FDR < 10%).

Plot these results.

Version Author Date
36acf15 davismcc 2018-08-25

The picture can look a little clearer with simplified annotation categories.

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

Let us check the median logFC values for each simplified VEP annotation category and add boxplots for logFC to the volcano plots.

# A tibble: 8 x 3
  consequence_simplified          med nvars
  <chr>                         <dbl> <int>
1 non_coding_transcript_exon -0.385      20
2 nonsense                   -0.306      39
3 intron                     -0.176      20
4 3_prime_UTR                -0.0998     27
5 synonymous                 -0.0745    351
6 missense                   -0.00875   650
7 splicing                    0.0130     21
8 5_prime_UTR                 0.609      12

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

We can also just look specifically at the boxplots of the logFC values across categories, first looking at the median logFC values and results from a t-test testing if the mean logFC value is different from zero. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.

# A tibble: 8 x 6
  consequence_simplified          med nvars  t_coef t_stat t_pval
  <chr>                         <dbl> <int>   <dbl>  <dbl>  <dbl>
1 3_prime_UTR                -0.0998     27  0.0935  0.325 0.748 
2 5_prime_UTR                 0.609      12  0.732   2.71  0.0221
3 intron                     -0.176      20 -0.285  -1.72  0.102 
4 missense                   -0.00875   650 -0.0623 -1.30  0.194 
5 non_coding_transcript_exon -0.385      20 -0.197  -1.46  0.159 
6 nonsense                   -0.306      39 -0.189  -1.68  0.101 
7 splicing                    0.0130     21 -0.223  -1.06  0.306 
8 synonymous                 -0.0745    351 -0.115  -1.67  0.0962

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

DE for mutated clone: stricter filtering

In the analysis above, the minimum allowed clone size (i.e. group for DE analysis) was 3 cells. Such potentially small group sizes could lead to noisy estimates of the logFC values that we looked at above. Let us repeat the above analysis but increasing the minimum number of cells for a group to 10 to try to obtain more accurate (on average) logFC estimates from the DE model.

Run DE testing between mutated and un-mutated clones for a more strictly filtered set of affected genes for each line.

mut_genes_df_allcells_filt_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
  cat("working on ", don, "\n")
  sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
  ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
  sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
  sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
  sites_tmp$gene <- rownames(sce_tmp)
  dge_tmp <- de_res[["dge_list"]][[don]]
  dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
  base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
  de_tbl_tmp <- data.frame(line = don,
                           gene = sites_tmp$gene, 
                           hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
                           ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
                           var_id = sites_tmp$var_id,
                           location = sites_tmp$Location,
                           consequence = sites_tmp$Consequence,
                           clone_presence = sites_tmp$clone_presence,
                           logFC = NA, logCPM = NA, F = NA, PValue = NA,
                           comment = "", stringsAsFactors = FALSE)
  for (i in seq_len(length(sites_tmp))) {
    clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
    mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
    dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
    if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)
        && (sum(mutatedclone) > 9.5) && (sum(!mutatedclone) > 9.5)) {
      qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
      de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
      de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
      de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
      de_tbl_tmp$F[i] <- de_tmp$table$F
      de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
    }
    if ((sum(mutatedclone) < 9.5) || (sum(!mutatedclone) < 9.5))
      de_tbl_tmp$comment[i] <- "minimum group size < 10 cells"
    if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
      de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
    if (!is.fullrank(dsgn_tmp))
      de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
  }
  mut_genes_df_allcells_filt_list[[don]] <- de_tbl_tmp
}
working on  euts 
working on  fawm 
working on  feec 
working on  fikt 
working on  garx 
working on  gesg 
working on  heja 
working on  hipn 
working on  ieki 
working on  joxm 
working on  kuco 
working on  laey 
working on  lexy 
working on  naju 
working on  nusw 
working on  oaaz 
working on  oilg 
working on  pipw 
working on  puie 
working on  qayj 
working on  qolg 
working on  qonc 
working on  rozh 
working on  sehl 
working on  ualf 
working on  vass 
working on  vuna 
working on  wahn 
working on  wetu 
working on  xugn 
working on  zoxy 

With this analysis, 830 could be tested for DE between mutated and unmutated clones (237 fewer than with the lenient filtering above).

We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.


               3_prime_UTR                5_prime_UTR 
                        27                         12 
                    intron                   missense 
                        20                        650 
non_coding_transcript_exon                   nonsense 
                        20                         39 
                  splicing                 synonymous 
                        21                        351 

Again, we’ll look specifically at the boxplots of the logFC values across categories. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.

# A tibble: 8 x 3
  consequence_simplified          med nvars
  <chr>                         <dbl> <int>
1 non_coding_transcript_exon -0.383      20
2 nonsense                   -0.280      39
3 3_prime_UTR                -0.136      27
4 synonymous                 -0.0649    351
5 intron                     -0.0162     20
6 missense                   -0.00201   650
7 splicing                    0.0324     21
8 5_prime_UTR                 0.609      12

Version Author Date
d2e8b31 davismcc 2018-08-19

On the evidence of the plot above it does not look like stricter filtering of variants fundmentally changes the results. It still does not look possible to discern differences in the logFC distributions of mutated genes between different annotation categories for variants.

DE for mutated clone: fitting PCs

We can also get DE results comparing mutated clone to all unmutated clones, fitting the first couple of PCs in the model to see if this accounts for unwanted variation and increases power to detect differences in expression between mutated and unwanted clones for mutated genes.

## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_PCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
    cat("working on ", don, "\n")
    sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
    ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
    sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
    sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
    sites_tmp$gene <- rownames(sce_tmp)
    dge_tmp <- de_res[["dge_list"]][[don]]
    dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
    base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
    de_tbl_tmp <- data.frame(line = don,
                             gene = sites_tmp$gene, 
                             hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
                             ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
                             var_id = sites_tmp$var_id,
                             location = sites_tmp$Location,
                             consequence = sites_tmp$Consequence,
                             clone_presence = sites_tmp$clone_presence,
                             logFC = NA, logCPM = NA, F = NA, PValue = NA,
                             comment = "", stringsAsFactors = FALSE)
    pca_tmp <- reducedDim(runPCA(
        de_res[["sce_list_unst"]][[don]], ntop = 500, ncomponents = 5), "PCA")
    for (i in seq_len(length(sites_tmp))) {
        clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
        mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
        dsgn_tmp <- cbind(base_design, pca_tmp[, 1], data.frame(mutatedclone))
        if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
            qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
            de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
            de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
            de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
            de_tbl_tmp$F[i] <- de_tmp$table$F
            de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
        }
        if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
            de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
        if (!is.fullrank(dsgn_tmp))
            de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
    }
    mut_genes_df_allcells_list_PCreg[[don]] <- de_tbl_tmp
}
working on  euts 
working on  fawm 
working on  feec 
working on  fikt 
working on  garx 
working on  gesg 
working on  heja 
working on  hipn 
working on  ieki 
working on  joxm 
working on  kuco 
working on  laey 
working on  lexy 
working on  naju 
working on  nusw 
working on  oaaz 
working on  oilg 
working on  pipw 
working on  puie 
working on  qayj 
working on  qolg 
working on  qonc 
working on  rozh 
working on  sehl 
working on  ualf 
working on  vass 
working on  vuna 
working on  wahn 
working on  wetu 
working on  xugn 
working on  zoxy 

               3_prime_UTR                5_prime_UTR 
                        27                         12 
                    intron                   missense 
                        20                        650 
non_coding_transcript_exon                   nonsense 
                        20                         39 
                  splicing                 synonymous 
                        21                        351 

Version Author Date
d2e8b31 davismcc 2018-08-19

There is thus very little change in DE results when regressing out 5 PCs in these DE models.

DE for mutated clones: fitting cell cycle scores

We can also check to see if accounting for cell cycle scores in the DE model increases power to detect differential expression between mutated and unmutated clones for mutated genes.

## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_CCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
    cat("working on ", don, "\n")
    sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
    ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
    sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
    sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
    sites_tmp$gene <- rownames(sce_tmp)
    dge_tmp <- de_res[["dge_list"]][[don]]
    dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
    base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
    de_tbl_tmp <- data.frame(line = don,
                             gene = sites_tmp$gene, 
                             hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
                             ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
                             var_id = sites_tmp$var_id,
                             location = sites_tmp$Location,
                             consequence = sites_tmp$Consequence,
                             clone_presence = sites_tmp$clone_presence,
                             logFC = NA, logCPM = NA, F = NA, PValue = NA,
                             comment = "", stringsAsFactors = FALSE)
    G1 <- de_res[["sce_list_unst"]][[don]]$G1
    S <- de_res[["sce_list_unst"]][[don]]$S
    G2M <- de_res[["sce_list_unst"]][[don]]$G2M
    for (i in seq_len(length(sites_tmp))) {
        clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
        mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
        dsgn_tmp <- cbind(base_design, G1, S, G2M, data.frame(mutatedclone))
        if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
            qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
            de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
            de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
            de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
            de_tbl_tmp$F[i] <- de_tmp$table$F
            de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
        }
        if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
            de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
        if (!is.fullrank(dsgn_tmp))
            de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
    }
    mut_genes_df_allcells_list_CCreg[[don]] <- de_tbl_tmp
}
working on  euts 
working on  fawm 
working on  feec 
working on  fikt 
working on  garx 
working on  gesg 
working on  heja 
working on  hipn 
working on  ieki 
working on  joxm 
working on  kuco 
working on  laey 
working on  lexy 
working on  naju 
working on  nusw 
working on  oaaz 
working on  oilg 
working on  pipw 
working on  puie 
working on  qayj 
working on  qolg 
working on  qonc 
working on  rozh 
working on  sehl 
working on  ualf 
working on  vass 
working on  vuna 
working on  wahn 
working on  wetu 
working on  xugn 
working on  zoxy 

Version Author Date
d2e8b31 davismcc 2018-08-19

Accounting for cell cycle scores (inferred with cyclone) in the DE model does not drastically change DE results, but perhaps does reduce power to find differences in expression between mutated and unmutated clones.

Characterising mutations and mutated genes

First, we can look at the number of variants annotated to the different VEP categories and assigned to combinations of clones, aggregated across all lines.

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

We can also produce a similar plot, but showing the total number variants in each annotation category for each line separately.

Version Author Date
36acf15 davismcc 2018-08-25
d2e8b31 davismcc 2018-08-19

Genes with more than one mutation

It’s also of interest to know if there are genes with multiple mutations in a line.

# A tibble: 4 x 4
# Groups:   line [2]
  line  hgnc_symbol n_variants_in_gene max_dist_btw_vars
  <chr> <chr>                    <int>             <dbl>
1 joxm  COL1A1                       2              1815
2 joxm  TBC1D9B                      3             29054
3 qonc  MAPK12                       3                27
4 qonc  WIPI2                        2              1187

There are only 4 genes, across all lines, that have more than one (non-dinucleotide) somatic variant.

In the case of the COL1A1 gene in joxm, there are two splicing variants (one a splice-acceptor variant and one a more general splice-region variant), and cells in the mutated clones have significantly lower expression than the unmutated clones for this gene. For the remaining genes with more than one somatic variant, there is no significant difference in gene expression between the mutated and unmutated clones.

Mutations in cell cycle genes

Given the results of the transcriptome-wide differential expression analyses, it would be interesting to know if cell cycle genes specifically harbour mutations. To explore this, we first obtain the set of genes belonging to cell cycle gene sets in the MSigDB c2 gene set collection.

This yields a list of 1243 genes annotated to be relevant to the cell cycle.

Let us look at the number of variants used for clonal reconstruction (a strictly filtered set) across VEP annotation categories that lie in cell cycle genes.

Version Author Date
36acf15 davismcc 2018-08-25

Mutations in proliferation genes

We can also look at the load of mutations in genes annotated to MSigDB c2 proliferation gene sets.

This yields a list of 1219 genes annotated to be relevant to the cell cycle.

Version Author Date
36acf15 davismcc 2018-08-25

Finally, create a data.frame that contains the counts of variants annotated to different classes of genes and VEP classes, and save to file for use in other analyses.

We can also look at the total mutational load for each of the lines.

# A tibble: 31 x 2
   line  nvars
   <chr> <dbl>
 1 euts     49
 2 fawm     20
 3 feec     24
 4 fikt     27
 5 garx     95
 6 gesg     45
 7 heja     32
 8 hipn     11
 9 ieki     15
10 joxm    116
11 kuco      9
12 laey     63
13 lexy     11
14 naju     15
15 nusw     13
16 oaaz     24
17 oilg     38
18 pipw     56
19 puie     18
20 qayj     16
21 qolg     28
22 qonc     20
23 rozh     16
24 sehl     29
25 ualf     48
26 vass    101
27 vuna     35
28 wahn    115
29 wetu     19
30 xugn     18
31 zoxy     14

Conclusions

We find a small number of somatic variants to associate with differential expression between mutated and unmutated cells/clones. Unfortunately, in this set of variants and lines (with relatively small sample size in terms of numbers of cells), we cannot make any claims about different effects on expression of somatic variants with different functional annotations.


─ Session info ──────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Melbourne         
 date     2019-10-30                  

─ Packages ──────────────────────────────────────────────────────────────
 package              * version   date       lib source                
 AnnotationDbi        * 1.46.1    2019-08-20 [1] Bioconductor          
 assertthat             0.2.1     2019-03-21 [1] CRAN (R 3.6.0)        
 backports              1.1.4     2019-04-10 [1] CRAN (R 3.6.0)        
 beeswarm               0.2.3     2016-04-25 [1] CRAN (R 3.6.0)        
 Biobase              * 2.44.0    2019-05-02 [1] Bioconductor          
 BiocGenerics         * 0.30.0    2019-05-02 [1] Bioconductor          
 BiocNeighbors          1.2.0     2019-05-02 [1] Bioconductor          
 BiocParallel         * 1.18.1    2019-08-06 [1] Bioconductor          
 BiocSingular           1.0.0     2019-05-02 [1] Bioconductor          
 bit                    1.1-14    2018-05-29 [1] CRAN (R 3.6.0)        
 bit64                  0.9-7     2017-05-08 [1] CRAN (R 3.6.0)        
 bitops                 1.0-6     2013-08-17 [1] CRAN (R 3.6.0)        
 blob                   1.2.0     2019-07-09 [1] CRAN (R 3.6.0)        
 broom                  0.5.2     2019-04-07 [1] CRAN (R 3.6.0)        
 callr                  3.3.2     2019-09-22 [1] CRAN (R 3.6.0)        
 cellranger             1.1.0     2016-07-27 [1] CRAN (R 3.6.0)        
 cli                    1.1.0     2019-03-19 [1] CRAN (R 3.6.0)        
 colorspace             1.4-1     2019-03-18 [1] CRAN (R 3.6.0)        
 cowplot              * 1.0.0     2019-07-11 [1] CRAN (R 3.6.0)        
 crayon                 1.3.4     2017-09-16 [1] CRAN (R 3.6.0)        
 crosstalk              1.0.0     2016-12-21 [1] CRAN (R 3.6.0)        
 DBI                    1.0.0     2018-05-02 [1] CRAN (R 3.6.0)        
 DelayedArray         * 0.10.0    2019-05-02 [1] Bioconductor          
 DelayedMatrixStats     1.6.1     2019-09-08 [1] Bioconductor          
 desc                   1.2.0     2018-05-01 [1] CRAN (R 3.6.0)        
 devtools               2.2.1     2019-09-24 [1] CRAN (R 3.6.0)        
 digest                 0.6.21    2019-09-20 [1] CRAN (R 3.6.0)        
 dplyr                * 0.8.3     2019-07-04 [1] CRAN (R 3.6.0)        
 DT                     0.9       2019-09-17 [1] CRAN (R 3.6.0)        
 edgeR                * 3.26.8    2019-09-01 [1] Bioconductor          
 ellipsis               0.3.0     2019-09-20 [1] CRAN (R 3.6.0)        
 evaluate               0.14      2019-05-28 [1] CRAN (R 3.6.0)        
 fansi                  0.4.0     2018-10-05 [1] CRAN (R 3.6.0)        
 farver                 1.1.0     2018-11-20 [1] CRAN (R 3.6.0)        
 fdrtool                1.2.15    2015-07-08 [1] CRAN (R 3.6.0)        
 forcats              * 0.4.0     2019-02-17 [1] CRAN (R 3.6.0)        
 fs                     1.3.1     2019-05-06 [1] CRAN (R 3.6.0)        
 gdtools              * 0.2.0     2019-09-03 [1] CRAN (R 3.6.0)        
 generics               0.0.2     2018-11-29 [1] CRAN (R 3.6.0)        
 GenomeInfoDb         * 1.20.0    2019-05-02 [1] Bioconductor          
 GenomeInfoDbData       1.2.1     2019-04-30 [1] Bioconductor          
 GenomicRanges        * 1.36.1    2019-09-06 [1] Bioconductor          
 ggbeeswarm             0.6.0     2017-08-07 [1] CRAN (R 3.6.0)        
 ggExtra                0.9       2019-08-27 [1] CRAN (R 3.6.0)        
 ggforce              * 0.3.1     2019-08-20 [1] CRAN (R 3.6.0)        
 ggplot2              * 3.2.1     2019-08-10 [1] CRAN (R 3.6.0)        
 ggrepel              * 0.8.1     2019-05-07 [1] CRAN (R 3.6.0)        
 ggridges             * 0.5.1     2018-09-27 [1] CRAN (R 3.6.0)        
 git2r                  0.26.1    2019-06-29 [1] CRAN (R 3.6.0)        
 glue                   1.3.1     2019-03-12 [1] CRAN (R 3.6.0)        
 gridExtra              2.3       2017-09-09 [1] CRAN (R 3.6.0)        
 gtable                 0.3.0     2019-03-25 [1] CRAN (R 3.6.0)        
 haven                  2.1.1     2019-07-04 [1] CRAN (R 3.6.0)        
 hms                    0.5.1     2019-08-23 [1] CRAN (R 3.6.0)        
 htmltools              0.3.6     2017-04-28 [1] CRAN (R 3.6.0)        
 htmlwidgets            1.3       2018-09-30 [1] CRAN (R 3.6.0)        
 httpuv                 1.5.2     2019-09-11 [1] CRAN (R 3.6.0)        
 httr                   1.4.1     2019-08-05 [1] CRAN (R 3.6.0)        
 IHW                  * 1.12.0    2019-05-02 [1] Bioconductor          
 IRanges              * 2.18.3    2019-09-24 [1] Bioconductor          
 irlba                  2.3.3     2019-02-05 [1] CRAN (R 3.6.0)        
 jsonlite               1.6       2018-12-07 [1] CRAN (R 3.6.0)        
 knitr                  1.25      2019-09-18 [1] CRAN (R 3.6.0)        
 labeling               0.3       2014-08-23 [1] CRAN (R 3.6.0)        
 later                  0.8.0     2019-02-11 [1] CRAN (R 3.6.0)        
 lattice                0.20-38   2018-11-04 [4] CRAN (R 3.5.1)        
 lazyeval               0.2.2     2019-03-15 [1] CRAN (R 3.6.0)        
 lifecycle              0.1.0     2019-08-01 [1] CRAN (R 3.6.0)        
 limma                * 3.40.6    2019-07-26 [1] Bioconductor          
 locfit                 1.5-9.1   2013-04-20 [1] CRAN (R 3.6.0)        
 lpsymphony             1.12.0    2019-05-02 [1] Bioconductor (R 3.6.0)
 lubridate              1.7.4     2018-04-11 [1] CRAN (R 3.6.0)        
 magrittr               1.5       2014-11-22 [1] CRAN (R 3.6.0)        
 MASS                   7.3-51.1  2018-11-01 [4] CRAN (R 3.5.1)        
 Matrix                 1.2-17    2019-03-22 [1] CRAN (R 3.6.0)        
 matrixStats          * 0.55.0    2019-09-07 [1] CRAN (R 3.6.0)        
 memoise                1.1.0     2017-04-21 [1] CRAN (R 3.6.0)        
 mgcv                   1.8-28    2019-03-21 [4] CRAN (R 3.5.3)        
 mime                   0.7       2019-06-11 [1] CRAN (R 3.6.0)        
 miniUI                 0.1.1.1   2018-05-18 [1] CRAN (R 3.6.0)        
 modelr                 0.1.5     2019-08-08 [1] CRAN (R 3.6.0)        
 munsell                0.5.0     2018-06-12 [1] CRAN (R 3.6.0)        
 nlme                   3.1-139   2019-04-09 [4] CRAN (R 3.5.3)        
 org.Hs.eg.db         * 3.8.2     2019-05-01 [1] Bioconductor          
 pillar                 1.4.2     2019-06-29 [1] CRAN (R 3.6.0)        
 pkgbuild               1.0.5     2019-08-26 [1] CRAN (R 3.6.0)        
 pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 3.6.0)        
 pkgload                1.0.2     2018-10-29 [1] CRAN (R 3.6.0)        
 plyr                   1.8.4     2016-06-08 [1] CRAN (R 3.6.0)        
 polyclip               1.10-0    2019-03-14 [1] CRAN (R 3.6.0)        
 prettyunits            1.0.2     2015-07-13 [1] CRAN (R 3.6.0)        
 processx               3.4.1     2019-07-18 [1] CRAN (R 3.6.0)        
 promises               1.0.1     2018-04-13 [1] CRAN (R 3.6.0)        
 ps                     1.3.0     2018-12-21 [1] CRAN (R 3.6.0)        
 purrr                * 0.3.2     2019-03-15 [1] CRAN (R 3.6.0)        
 R6                     2.4.0     2019-02-14 [1] CRAN (R 3.6.0)        
 RColorBrewer         * 1.1-2     2014-12-07 [1] CRAN (R 3.6.0)        
 Rcpp                   1.0.2     2019-07-25 [1] CRAN (R 3.6.0)        
 RCurl                  1.95-4.12 2019-03-04 [1] CRAN (R 3.6.0)        
 readr                * 1.3.1     2018-12-21 [1] CRAN (R 3.6.0)        
 readxl                 1.3.1     2019-03-13 [1] CRAN (R 3.6.0)        
 remotes                2.1.0     2019-06-24 [1] CRAN (R 3.6.0)        
 rlang                * 0.4.0     2019-06-25 [1] CRAN (R 3.6.0)        
 rmarkdown              1.15      2019-08-21 [1] CRAN (R 3.6.0)        
 rprojroot              1.3-2     2018-01-03 [1] CRAN (R 3.6.0)        
 RSQLite                2.1.2     2019-07-24 [1] CRAN (R 3.6.0)        
 rstudioapi             0.10      2019-03-19 [1] CRAN (R 3.6.0)        
 rsvd                   1.0.2     2019-07-29 [1] CRAN (R 3.6.0)        
 rvest                  0.3.4     2019-05-15 [1] CRAN (R 3.6.0)        
 S4Vectors            * 0.22.1    2019-09-09 [1] Bioconductor          
 scales                 1.0.0     2018-08-09 [1] CRAN (R 3.6.0)        
 scater               * 1.12.2    2019-05-24 [1] Bioconductor          
 sessioninfo            1.1.1     2018-11-05 [1] CRAN (R 3.6.0)        
 shiny                  1.3.2     2019-04-22 [1] CRAN (R 3.6.0)        
 SingleCellExperiment * 1.6.0     2019-05-02 [1] Bioconductor          
 slam                   0.1-45    2019-02-26 [1] CRAN (R 3.6.0)        
 stringi                1.4.3     2019-03-12 [1] CRAN (R 3.6.0)        
 stringr              * 1.4.0     2019-02-10 [1] CRAN (R 3.6.0)        
 SummarizedExperiment * 1.14.1    2019-07-31 [1] Bioconductor          
 superheat            * 0.1.0     2017-02-04 [1] CRAN (R 3.6.0)        
 svglite                1.2.2     2019-05-17 [1] CRAN (R 3.6.0)        
 systemfonts            0.1.1     2019-07-01 [1] CRAN (R 3.6.0)        
 testthat               2.2.1     2019-07-25 [1] CRAN (R 3.6.0)        
 tibble               * 2.1.3     2019-06-06 [1] CRAN (R 3.6.0)        
 tidyr                * 1.0.0     2019-09-11 [1] CRAN (R 3.6.0)        
 tidyselect             0.2.5     2018-10-11 [1] CRAN (R 3.6.0)        
 tidyverse            * 1.2.1     2017-11-14 [1] CRAN (R 3.6.0)        
 tweenr                 1.0.1     2018-12-14 [1] CRAN (R 3.6.0)        
 usethis                1.5.1     2019-07-04 [1] CRAN (R 3.6.0)        
 utf8                   1.1.4     2018-05-24 [1] CRAN (R 3.6.0)        
 vctrs                  0.2.0     2019-07-05 [1] CRAN (R 3.6.0)        
 vipor                  0.4.5     2017-03-22 [1] CRAN (R 3.6.0)        
 viridis              * 0.5.1     2018-03-29 [1] CRAN (R 3.6.0)        
 viridisLite          * 0.3.0     2018-02-01 [1] CRAN (R 3.6.0)        
 whisker                0.4       2019-08-28 [1] CRAN (R 3.6.0)        
 withr                  2.1.2     2018-03-15 [1] CRAN (R 3.6.0)        
 workflowr              1.4.0     2019-06-08 [1] CRAN (R 3.6.0)        
 xfun                   0.9       2019-08-21 [1] CRAN (R 3.6.0)        
 xml2                   1.2.2     2019-08-09 [1] CRAN (R 3.6.0)        
 xtable                 1.8-4     2019-04-21 [1] CRAN (R 3.6.0)        
 XVector                0.24.0    2019-05-02 [1] Bioconductor          
 yaml                   2.2.0     2018-07-25 [1] CRAN (R 3.6.0)        
 zeallot                0.1.0     2018-01-28 [1] CRAN (R 3.6.0)        
 zlibbioc               1.30.0    2019-05-02 [1] Bioconductor          

[1] /home/AD.SVI.EDU.AU/dmccarthy/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library