ggplot2でお絵かき:プロット+平均値+標準偏差を示す

バイオ系博士学生のGoro (@BioDr_goro) です。

以前書いた記事の続編です。

lpasteur.hatenablog.com

生命科学系の論文では，再現がとれないことがしばしばあります。
データは信用できるのか，統計手法に間違いはないか，をハッキリさせるため，生データの開示が義務付けられているジャーナルもあります。
下記の(A) のような図ではなく(B) が推奨されています (両者とも同じデータを扱っています)。

f:id:LPasteur:20190415014139p:plain — 遺伝子AとBの実験開始時および12時間後における発現量. (A) 棒グラフ, (B) 散布図.

それでも！私は！！棒グラフがいいんだ！！！
という方は前回の記事を参照してください。

本記事では (B) の図がRのggplot2で美しく書くためのコードを紹介します。
エクセルでは限界があるので，Rで描画することがオススメです。

1. 作業ディレクトリを変更する
2. 今回使うパッケージを呼び出す
3. データをRに読み込ませる
4. 平均値と標準偏差を求める
4. 生データをプロットして平均値も示す
5. 生データも平均値も標準偏差も示す
参考にしたサイト

1. 作業ディレクトリを変更する

setwd("ファイルが保存されているディレクトリを指定")

Rstudioを使っているのであれば，
ツールバー > Session > Set Working Directory > Choose Directory (control + shift + H)
でディレクトリを変更できます。

f:id:LPasteur:20200617004345p:plain

2. 今回使うパッケージを呼び出す

library(ggplot2)
library(ggthemes)
library(dplyr)
library(ggsignif)

3. データをRに読み込ませる

前回と同じデモデータを使用します。
demo_data - Google ドライブ

表全体をコピーして，クリップボードから読み込ませたい場合は，

#Windows
demo_1 <- read.table("clipborad", header = T, sep = "\t")

#Mac
demo_1 <- read.table(pipe("pbpaste"), sep = "\t", header = T)

となります。

もしcsvファイルを読み込ませたいのであれば，

demo_1 <- read.csv("demo_data.csv", header = T, sep = ",")

としてください。

4. 平均値と標準偏差を求める

前回はdplyrを用いて，"data_mean_sd" というオブジェクトに平均値と標準偏差を格納しました。

demo_mean_sd <- demo_1 %>% group_by(group, time) %>% summarize(mean = mean(value), sd = sd(value))

今回は図示の都合上，平均値と標準偏差をaggregate関数を用いて別々のオブジェクトへ格納します。

aggregate関数の使い方は，こんな感じです。
オブジェクト名 <- aggregate(集計列 ~ キー列その1:キー列その2, オブジェクト, 関数指定)
~ キー列その1:キー列その2で，キー列その1とその2のすべての組み合わせでグループ化しています。
そしてグループ化されたそれぞれ集計列に対して，指定した計算をしています。

mean <- aggregate(value ~ group:time, demo_1, mean)
sd <- aggregate(value ~ group:time, demo_1, sd)

4. 生データをプロットして平均値も示す

ひとまず標準偏差sdは置いておきます。

ggplotでベースを作り，
geom_pointで生データをプロットし，
stat_summaryで平均値をバーで示し，
labs() で軸ラベルの名前を変え，
theme_few()，theme()，scale_y_continous() で体裁を整えます。

plot <- ggplot(demo_1, aes(x = interaction(factor(time), factor(group)), y = value)) +
geom_point(shape = 21, size = 3, aes(fill = time)) +
scale_fill_manual(values = c("white", "gray")) +
stat_summary(fun.y = "mean", fun.ymin = function(x)mean(x), fun.ymax = function(x)mean(x), geom = "crossbar", colour = "black", width = 0.2) +
labs(x = "", y = "Gene expression [copies/ng]") +
theme_few() +
theme(axis.title.x = element_text(size = 10, family = "Helvetica"), axis.title.y = element_text(size = 10, family = "Helvetica"), axis.text.x = element_text(size = 10, colour = 1, family = "Helvetica"), axis.text.y = element_text(size = 10, colour = 1, family = "Helvetica")) +
scale_y_continuous(expand = c(0,0), limits = c(0,8))

plot

f:id:LPasteur:20200617010039j:plain

プロットが重なっていると見えにくいので，geom_plot() ではなくgeom_jitter() を使います。

plot <- ggplot(demo_1, aes(x = interaction(factor(time), factor(group)), y = value))  +
geom_jitter(shape = 21, size = 3, aes(fill = time)) +
scale_fill_manual(values = c("white", "gray")) +
stat_summary(fun.y = "mean", fun.ymin = function(x)mean(x), fun.ymax = function(x)mean(x), geom = "crossbar", colour = "black", width = 0.2) +
labs(x = "", y = "Gene expression [copies/ng]") +
theme_few() +
theme(axis.title.x = element_text(size = 10, family = "Helvetica"), axis.title.y = element_text(size = 10, family = "Helvetica"), axis.text.x = element_text(size = 10, colour = 1, family = "Helvetica"), axis.text.y = element_text(size = 10, colour = 1, family = "Helvetica")) + 
scale_y_continuous(expand = c(0,0), limits = c(0,8))

plot

f:id:LPasteur:20200617011428j:plain

プロットが横に広がりすぎているので，geom_jitterのwidthを変更します。

plot <- ggplot(demo_1, aes(x = interaction(factor(time), factor(group)), y = value))  +
geom_jitter(shape = 21, size = 3, aes(fill = time), width = 0.2) +
scale_fill_manual(values = c("white", "gray")) +
stat_summary(fun.y = "mean", fun.ymin = function(x)mean(x), fun.ymax = function(x)mean(x), geom = "crossbar", colour = "black", width = 0.2) +
labs(x = "", y = "Gene expression [copies/ng]") +
theme_few() +
theme(axis.title.x = element_text(size = 10, family = "Helvetica"), axis.title.y = element_text(size = 10, family = "Helvetica"), axis.text.x = element_text(size = 10, colour = 1, family = "Helvetica"), axis.text.y = element_text(size = 10, colour = 1, family = "Helvetica")) + 
scale_y_continuous(expand = c(0,0), limits = c(0,8))

plot

f:id:LPasteur:20200617011548j:plain

5. 生データも平均値も標準偏差も示す

平均値を示すためのstat_summaryと，標準偏差を示すためのstat_summaryを設定します。

plot <- ggplot(demo_1, aes(x = interaction(factor(time), factor(group)), y = value))  +
geom_jitter(shape = 21, size = 3, aes(fill = time), width = 0.2) +
scale_fill_manual(values = c("white", "gray")) +
stat_summary(fun.y = "mean", fun.ymin = function(x)mean(x), fun.ymax = function(x)mean(x), geom = "crossbar", colour = "black", width = 0.2) +
stat_summary(fun.ymin = function(x)mean(x) - sd(x), fun.ymax = function(x)mean(x) + sd(x), geom = "errorbar", colour = "black", width = 0.1) +
labs(x = "", y = "Gene expression [copies/ng]") +
theme_few() +
theme(axis.title.x = element_text(size = 10, family = "Helvetica"), axis.title.y = element_text(size = 10, family = "Helvetica"), axis.text.x = element_text(size = 10, colour = 1, family = "Helvetica"), axis.text.y = element_text(size = 10, colour = 1, family = "Helvetica")) + 
scale_y_continuous(expand = c(0,0), limits = c(0,8))

plot

f:id:LPasteur:20200617011512j:plain

計算したp値を図示します。計算方法は前回の記事を参照してください。

plot <- ggplot(demo_1, aes(x = interaction(factor(time), factor(group)), y = value))  +
geom_jitter(shape = 21, size = 3, aes(fill = time), width = 0.2) +
scale_fill_manual(values = c("white", "gray")) +
stat_summary(fun.y = "mean", fun.ymin = function(x)mean(x), fun.ymax = function(x)mean(x), geom = "crossbar", colour = "black", width = 0.2) +
stat_summary(fun.ymin = function(x)mean(x) - sd(x), fun.ymax = function(x)mean(x) + sd(x), geom = "errorbar", colour = "black", width = 0.1) +
labs(x = "", y = "Gene expression [copies/ng]") +
theme_few() +
theme(axis.title.x = element_text(size = 10, family = "Helvetica"), axis.title.y = element_text(size = 10, family = "Helvetica"), axis.text.x = element_text(size = 10, colour = 1, family = "Helvetica"), axis.text.y = element_text(size = 10, colour = 1, family = "Helvetica")) + 
scale_y_continuous(expand = c(0,0), limits = c(0,10)) +
geom_signif(y_position=c(7, 8), xmin=c(1, 3), xmax=c(2, 4), annotation=c("p < 0.001", "p < 0.05"), tip_length=0, colour = "black")

plot