Shrinkage in DESeq2

2024-09-01 843 words 4 minutes

Contents

DESeq2 is a popular bioinformatics tool for identifying differentially expressed genes from RNA-Seq data.

In DESeq2, the calculation of log2 fold changes (LFC) is a key step. However, LFC can be misleading, particularly with extreme fold changes for genes with low counts or high variability.

DESeq2 addresses this noisy LFC by applying shrinkage to log2 fold changes (implemented as lfcShrink function) to reduce the impact of outliers and making the results more reliable for ranking and visualization.

The following example explains how to use shrinkage of log2 fold changes using DESeq2.

Load count dataset

We will use the sequence read count data from the pasilla library for performing differential gene expression analysis and applying shrinkage of log2 fold changes.

Load the read count data and sample information table,

# import librar
library("pasilla")

# load count data
counts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
					  
# convert to matrix
counts_mat <- as.matrix(read.csv(counts, sep="\t", row.names="gene_id"))

head(counts_mat)

# output
            untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
FBgn0000003          0          0          0          0        0        0        1
FBgn0000008         92        161         76         70      140       88       70
FBgn0000014          5          1          0          0        4        0        0
FBgn0000015          0          2          1          2        1        0        0
FBgn0000017       4664       8714       3564       3150     6205     3072     3334
FBgn0000018        583        761        245        310      722      299      308
					  

# load sample information table
anot <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
coldata <- read.csv(anot, row.names=1)
coldata <- coldata[,c("condition", "type")]
coldata$condition <- factor(coldata$condition)

# output
             condition        type
treated1fb     treated single-read
treated2fb     treated  paired-end
treated3fb     treated  paired-end
untreated1fb untreated single-read
untreated2fb untreated single-read
untreated3fb untreated  paired-end

In the read count dataset, we have count data from two conditions viz. treated and untreated samples. The treated and untreated samples have three and four replicates, respectively.

For DESeq2, the column names of the count table and row names of the sample information table must match and be in the same order.

# make column names from count data and row names from sample table similar
rownames(coldata) <- sub("fb", "", rownames(coldata))
all(rownames(coldata) %in% colnames(counts_mat))

# output
[1] TRUE

# make column names from count data and row names from sample table in same order
counts_mat <- counts_mat[, rownames(coldata)]
all(rownames(coldata) == colnames(counts_mat))

# output
[1] TRUE

Construct a DESeqDataSet and set reference level

Construct a DESeqDataSet for differential gene expression analysis using DESeqDataSetFromMatrix function.

In this step, you can also assign the control sample using relevel for calculating the expression changes between treated and untreated samples.

# import library
library("DESeq2")

# construct DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts_mat,
                              colData = coldata,
                              design = ~ condition)

# set untreated as control sample
dds$condition <- relevel(dds$condition, ref = "untreated")

Additionally, you can also apply filtering before differential gene expression analysis to remove the low count genes.

Differential gene expression analysis

Perform differential gene expression analysis using DESeq

dds <- DESeq(dds)
res <- results(dds)
res

head(res)
# output

log2 fold change (MLE): condition untreated vs treated 
Wald test p-value: condition untreated vs treated 
DataFrame with 6 rows and 6 columns
               baseMean log2FoldChange     lfcSE        stat    pvalue      padj
              <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
FBgn0000003    0.171569    -1.02604541  3.805503 -0.26962147 0.7874515        NA
FBgn0000008   95.144079    -0.00215142  0.223884 -0.00960955 0.9923328  0.996927
FBgn0000014    1.056572     0.49673557  2.160264  0.22994204 0.8181368        NA
FBgn0000015    0.846723     1.88276170  2.106432  0.89381546 0.3714206        NA
FBgn0000017 4352.592899     0.24002523  0.126024  1.90459450 0.0568328  0.282361
FBgn0000018  418.614931     0.10479911  0.148280  0.70676360 0.4797134  0.823907

The output reports log2 fold changes (log2FoldChange), p value (pvalue), and adjusted p value (padj).

Visualize the differential gene expression using MA plot,

plotMA(res, ylim=c(-2,2))

Shrinkage of log2 fold changes

The low count or highly variable count genes could impact the true estimation of log2 fold changes and can lead to unreliable results.

In this case, you can use shrinkage of log2 fold changes using lfcShrink function for reducing the impact of extreme fold changes and making the results more reliable for ranking and visualization.

The lfcShrink function needs a coefficient that we want to shrink, which can be obtained by resultsNames function.

resultsNames(dds)

# output
[1] "Intercept"                      "condition_untreated_vs_treated"

Perform shrinkage of log2 fold changes for condition_untreated_vs_treated comparison,

resLFC <- lfcShrink(dds, coef="condition_untreated_vs_treated", type="apeglm")

head(resLFC)

# output
log2 fold change (MAP): condition untreated vs treated 
Wald test p-value: condition untreated vs treated 
DataFrame with 6 rows and 5 columns
               baseMean log2FoldChange     lfcSE    pvalue      padj
              <numeric>      <numeric> <numeric> <numeric> <numeric>
FBgn0000003    0.171569   -0.007217218  0.205803 0.7874515        NA
FBgn0000008   95.144079   -0.000798434  0.151704 0.9923328  0.996927
FBgn0000014    1.056572    0.004660654  0.204893 0.8181368        NA
FBgn0000015    0.846723    0.018121745  0.206169 0.3714206        NA
FBgn0000017 4352.592899    0.190154460  0.120136 0.0568328  0.282361
FBgn0000018  418.614931    0.070624117  0.123741 0.4797134  0.823907

In the above example, we have used the apeglm function for effect size shrinkage. The apeglm is the default estimator in DESeq2.

In addition, you can also use ashr and normal methods for shrinkage estimation.

Visualize the differential gene expression after shrinkage of log2 fold changes using MA plot,

plotMA(resLFC, ylim=c(-2,2))

/images/posts/88_ma_shrinkage.png — MA plot with shrinkage of log2 fold changes

If you compare the MA plot with and without shrinkage of log2 fold changes, you can see that with shrinkage the extreme points are pulled towards the center, making it easier to identify truly differentially expressed genes while reducing the noise from low-count genes.