Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/69.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何使用ggplot从装箱数据绘制归一化差异的线图?_R_Ggplot2 - Fatal编程技术网

R 如何使用ggplot从装箱数据绘制归一化差异的线图?

R 如何使用ggplot从装箱数据绘制归一化差异的线图?,r,ggplot2,R,Ggplot2,我有几组数据,我为它们计算装箱的标准化差异。我希望使用ggplot在单线图中绘制结果。表示成对差异的不同组合的线应该通过颜色和线型来区分 我一直坚持从垃圾箱中提取计算出的值,现在应该是y轴值,并将它们绘制到x轴上 下面是我用于导入数据和计算标准化差异的代码 # Read data from column 3 as data table for different number of rows # you could use replicate here for test # dat1 <-

我有几组数据,我为它们计算装箱的标准化差异。我希望使用ggplot在单线图中绘制结果。表示成对差异的不同组合的线应该通过颜色和线型来区分

我一直坚持从垃圾箱中提取计算出的值,现在应该是y轴值,并将它们绘制到x轴上

下面是我用于导入数据和计算标准化差异的代码

# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))

# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")

# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)

bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)

# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)

# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)

# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)

# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
dputdbin输出:

编辑 最后一段代码仅使用dbin并打印多个dbin:

dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))

calc_bin_props <- function(data) {
  as_tibble(data) %>%
    mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
    group_by(bin) %>%
    summarise(sum = sum(a), .groups = "drop") %>%
    filter(!is.na(bin)) %>%
    ungroup() %>%
    mutate(sum = sum / sum(sum))
}

diff_data <-
  full_join(
    calc_bin_props(data = dat1),
    calc_bin_props(dat2),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinA = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

diff_data2 <-
  full_join(
    calc_bin_props(data = dat3),
    calc_bin_props(dat4),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinB = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]

# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
  geom_line() +
  scale_linetype_manual(values=c("solid", "longdash")) +
  scale_color_manual(values = c("black", "black"))
dev.off()
图书馆管理员 如问题所示创建示例数据,但将不同的概率添加到两个示例调用中,以创建如此明显的差异 在两组随机数据之间

dat1%视为不可用 dat2%为不可测值 使用dplyr,我们可以在data.frames.tibbles中处理此问题,而无需 需要切换到其他数据类型

让我们定义一个可以应用于两个数据集的函数,以获取 预处理完成了

我们使用base::cut来创建 将每个值与其bin配对的新列。然后我们将数据分组 按bin计算每个bin的总和,最后除以bin总和 按总数计算

计算本底道具% mutatebin=cuta,breaks=seq25,50,by=2,labels=seq25,48,by=2%>% 组_bybin%>% Summaresum=suma,.组=下降%>% 滤器is.nabin%>% 解组%>% mutatesum=总和/总和 } 现在,我们在两个数据集上都称为calc_-bin_-props,并通过bin将它们连接起来。 这为我们提供了一个包含bin、sum.x和sum.y列的数据框架。 后两者对应于从dat1和DAT导出的bin和 dat2。通过变异线,我们计算出 两列

差异数据% mutatedbin=sum.x-sum.y, bin=as.numericas.characterbin%>% 选择-用trsh启动\u 在我们将数据输入ggplot之前,我们将其转换为long 使用pivot_格式化更长时间这允许我们指示ggplot 将sum.x、sum.y和dbin的结果绘制为单独的线

差异数据%>% pivot_longer-bin%>% ggplotaesas.numericbin,值,颜色=名称,线型=名称+ 几何线+ 比例\u线型\u手动值=clongdash、solid、solid+ 比例\颜色\手动值=黑色、紫色、绿色
请提供dat1和dat2或dbin的数据。如果您运行dputdbin,或者如果该对象非常大,则改为执行dputheaddbin,并将其输出复制并粘贴到问题中的其他代码段中。这将允许你的问题的读者为你的问题提供一个经过测试的解决方案。我为dbin@till添加了两个复制DAT的输出。这似乎不正确,与我的原始计算不匹配。我将立即将此添加为对该问题的编辑。我对dplyr::ntile的工作方式有误解。答案现在已更新,并生成与原始代码完全相同的数值结果。关于图,请更准确地说明您在问题中寻找的内容,可能提供您想要的图的草图。但我没有将dbin值乘以100,因为这会导致缩放比例太高,无法直观地将其与bin sum值进行比较。这很有效,并且值比较良好。在上一个绘图原始代码示例中,有没有办法像在第一个绘图示例中那样将x轴设置在25和50之间?这是由于数据存在于25和50之间,技术上是无限的,但我在50时将它们截断。否则,我假设只绘制dbin,但是将有许多dbin是通过多个数据比较计算出来的,我希望将它们绘制在同一个绘图上,但作为不同的颜色线。值应该完全相同,tibble的打印方法对值进行舍入,对于计算/可视化,使用不舍入的实际值。我在回答中更新了cut命令,以便它使用值25到48作为箱子标签。这将确保x轴在您要查找的范围内。
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
    X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
    X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
    X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
    X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
    X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))

dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))

calc_bin_props <- function(data) {
  as_tibble(data) %>%
    mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
    group_by(bin) %>%
    summarise(sum = sum(a), .groups = "drop") %>%
    filter(!is.na(bin)) %>%
    ungroup() %>%
    mutate(sum = sum / sum(sum))
}

diff_data <-
  full_join(
    calc_bin_props(data = dat1),
    calc_bin_props(dat2),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinA = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

diff_data2 <-
  full_join(
    calc_bin_props(data = dat3),
    calc_bin_props(dat4),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinB = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]

# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
  geom_line() +
  scale_linetype_manual(values=c("solid", "longdash")) +
  scale_color_manual(values = c("black", "black"))
dev.off()