R 如何制作一个很好的可复制的例子
当与同事讨论性能、教学、发送错误报告或在邮件列表和StackOverflow上搜索指导时,经常会问a,并且总是很有帮助的 关于创建一个优秀的例子,你有什么建议?如何以文本格式从中粘贴数据结构?您还应该包括哪些其他信息 除了使用R 如何制作一个很好的可复制的例子,r,r-faq,R,R Faq,当与同事讨论性能、教学、发送错误报告或在邮件列表和StackOverflow上搜索指导时,经常会问a,并且总是很有帮助的 关于创建一个优秀的例子,你有什么建议?如何以文本格式从中粘贴数据结构?您还应该包括哪些其他信息 除了使用dput()、dump()或structure(),还有其他技巧吗?什么时候应该包含library()或require()语句?除了c、df、data等,应该避免哪些保留字 如何制作一个好的可重复的例子?这里有一个好例子 最重要的一点是:只需确保编写一小段代码,我们就可以运
dput()
、dump()
或structure()
,还有其他技巧吗?什么时候应该包含library()
或require()
语句?除了c
、df
、data
等,应该避免哪些保留字
如何制作一个好的可重复的例子?这里有一个好例子
最重要的一点是:只需确保编写一小段代码,我们就可以运行它来查看问题所在。一个有用的函数是dput()
,但是如果您有非常大的数据,您可能需要制作一个小样本数据集,或者只使用前10行左右
编辑:
还要确保你自己确定了问题所在。该示例不应是带有“第200行有错误”的整个R脚本。如果您使用R(我喜欢browser()
)和Google中的调试工具,您应该能够真正确定问题所在,并重现一个相同问题出现错误的小例子。我个人更喜欢“一”行程序。大致如下:
my.df <- data.frame(col1 = sample(c(1,2), 10, replace = TRUE),
col2 = as.factor(sample(10)), col3 = letters[1:10],
col4 = sample(c(TRUE, FALSE), 10, replace = TRUE))
my.list <- list(list1 = my.df, list2 = my.df[3], list3 = letters)
别忘了提及您可能正在使用的任何特殊软件包
如果你想在更大的物体上演示一些东西,你可以试试
my.df2 <- data.frame(a = sample(10e6), b = sample(letters, 10e6, replace = TRUE))
如果需要在sp
中实现某些空间对象,可以通过“空间”包中的外部文件(如ESRI shapefile)获取一些数据集(请参见任务视图中的空间视图)
库(rgdal)
奥格德里弗斯()
dsn基本上,a应该使其他人能够在他们的机器上准确地再现您的问题
MRE由以下项目组成:
- 演示问题所需的最小数据集
- 再现错误所需的最小可运行代码,可在给定数据集上运行
- 所用软件包、R版本及其运行操作系统的所有必要信息
- 在随机过程的情况下,一个种子(由
set.seed()
设置)用于再现性
有关良好MRE的示例,请参阅所用函数帮助文件底部的“示例”部分。只需在R控制台中键入例如help(mean)
,或short?mean
提供最小数据集
通常,共享庞大的数据集是没有必要的,而且可能会阻碍其他人阅读您的问题。因此,最好使用内置数据集或创建一个类似于原始数据的小“玩具”示例,这实际上就是最小值的含义。如果出于某种原因,您确实需要共享原始数据,那么您应该使用一种方法,例如dput()
,允许其他人获得您数据的精确副本
内置数据集
您可以使用其中一个内置数据集。可以通过data()
查看内置数据集的全面列表。每个数据集都有一个简短的描述,并且可以获得更多信息,例如,对于R随附的“iris”数据集,可以使用?iris
。安装的软件包可能包含其他数据集
创建示例数据集
初步说明:有时您可能需要特殊格式(即类),例如因子、日期或时间序列。对于这些,请使用如下函数:as.factor
,as.Date
,as.xts
。。。例如:
向量
x <- rnorm(10) ## random vector normal distributed
x <- runif(10) ## random vector uniformly distributed
x <- sample(1:100, 10) ## 10 random draws out of 1, 2, ..., 100
x <- sample(LETTERS, 10) ## 10 random draws out of built-in latin alphabet
m <- matrix(1:12, 3, 4, dimnames=list(LETTERS[1:3], LETTERS[1:4]))
m
# A B C D
# A 1 4 7 10
# B 2 5 8 11
# C 3 6 9 12
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=rnorm(n))
dat
# id date group age type x
# 1 1 2020-12-26 A 27 type 1 0.0356312
# 2 2 2020-12-27 B 19 type 2 1.3149588
# 3 3 2020-12-28 A 20 type 3 0.9781675
# 4 4 2020-12-29 B 26 type 4 0.8817912
# 5 5 2020-12-30 A 26 type 5 0.4822047
# 6 6 2020-12-31 B 28 type 6 0.9657529
将数据子集化
要共享子集,请使用head()
,subset()
或索引iris[1:4,]
。然后将其包装到dput()
中,以便为其他人提供可以立即放入R中的内容。范例
要在您的问题中共享的控制台输出:
使用dput
时,您可能还希望只包括相关列,例如dput(mtcars[1:3,c(2,5,6)])
注意:如果您的数据框中有一个具有多个级别的因子,那么dput
输出可能会很麻烦,因为它仍然会列出所有可能的因子级别,即使它们不在数据子集中。要解决此问题,可以使用droplevels()
函数。请注意,物种是一个只有一个级别的因子,例如,dput(液滴级别(iris[1:4,])
。dput
的另一个警告是,它不适用于键控的数据表
对象或来自tidyverse
的分组tbl_-df
(classgrouped_-df
)。在这些情况下,您可以在共享之前转换回常规数据帧,dput(as.data.frame(my_data))
生成最小代码
结合最少的数据(见上文),您的代码应该通过简单的复制和粘贴在另一台机器上准确地再现问题
这应该是容易的部分,但通常不是。你不应该做的事情:
- 显示各种数据转换;确保提供的数据格式正确(当然,除非这是问题所在)
- 复制粘贴在某个地方出现错误的整个脚本。请尝试查找导致错误的行。通常情况下,你会发现问题出在你自己身上
你应该做什么:
- 添加您使用的软件包(使用
library()
)
- 在新的R会话中测试运行代码,以确保代码可运行。人们应该能够在控制台中复制粘贴您的数据和代码,并获得与您相同的结果
- 如果你打开连接
d <- as.Date("2020-12-30")
class(d)
# [1] "Date"
x <- rnorm(10) ## random vector normal distributed
x <- runif(10) ## random vector uniformly distributed
x <- sample(1:100, 10) ## 10 random draws out of 1, 2, ..., 100
x <- sample(LETTERS, 10) ## 10 random draws out of built-in latin alphabet
m <- matrix(1:12, 3, 4, dimnames=list(LETTERS[1:3], LETTERS[1:4]))
m
# A B C D
# A 1 4 7 10
# B 2 5 8 11
# C 3 6 9 12
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=rnorm(n))
dat
# id date group age type x
# 1 1 2020-12-26 A 27 type 1 0.0356312
# 2 2 2020-12-27 B 19 type 2 1.3149588
# 3 3 2020-12-28 A 20 type 3 0.9781675
# 4 4 2020-12-29 B 26 type 4 0.8817912
# 5 5 2020-12-30 A 26 type 5 0.4822047
# 6 6 2020-12-31 B 28 type 6 0.9657529
id date group age type x
1 1 2020-12-26 A 27 type 1 0.0356312
2 2 2020-12-27 B 19 type 2 1.3149588
3 3 2020-12-28 A 20 type 3 0.9781675
dput(iris[1:4, ]) # first four rows of the iris data set
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
4L), class = "data.frame")
set.seed(42)
rnorm(3)
# [1] 1.3709584 -0.5646982 0.3631284
set.seed(42)
rnorm(3)
# [1] 1.3709584 -0.5646982 0.3631284
> x <- matrix(1:8, nrow=4, ncol=2,
dimnames=list(c("A","B","C","D"), c("x","y"))
> x
x y
A 1 5
B 2 6
C 3 7
D 4 8
>
> x.df
row col value
1 A x 1
> x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
varying=list(colnames(x)), times=colnames(x),
v.names="value", timevar="col", idvar="row")
df <- read.table(header=TRUE,
text="Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
")
code
code
code
code
code (40 or so lines of it)
data(mtcars)
names(mtcars)
your problem demostrated on the mtcars data set
dput(read.table("clipboard",sep="\t",header=TRUE))
dput(read.table("clipboard",sep="",header=TRUE))
install.packages("devtools")
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")
reproduce(myData)
# sample data
DF <- data.frame(id=rep(LETTERS, each=4)[1:100], replicate(100, sample(1001, 100)), Class=sample(c("Yes", "No"), 100, TRUE))
reproduce(DF, cols=c("id", "X1", "X73", "Class")) # I could also specify the column number.
This is what the sample looks like:
id X1 X73 Class
1 A 266 960 Yes
2 A 373 315 No Notice the selection split
3 A 573 208 No (which can be turned off)
4 A 907 850 Yes
5 B 202 46 Yes
6 B 895 969 Yes <~~~ 70 % of selection is from the top rows
7 B 940 928 No
98 Y 371 171 Yes
99 Y 733 364 Yes <~~~ 30 % of selection is from the bottom rows.
100 Y 546 641 No
==X==============================================================X==
Copy+Paste this part. (If on a Mac, it is already copied!)
==X==============================================================X==
DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L, 25L, 25L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y"), class = "factor"), X1 = c(266L, 373L, 573L, 907L, 202L, 895L, 940L, 371L, 733L, 546L), X73 = c(960L, 315L, 208L, 850L, 46L, 969L, 928L, 171L, 364L, 641L), Class = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id", "X1", "X73", "Class"), class = "data.frame", row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))
==X==============================================================X==
==X==============================================================X==
Copy+Paste this part. (If on a Mac, it is already copied!)
==X==============================================================X==
DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L,25L, 25L), .Label
= c("A", "B", "C", "D", "E", "F", "G", "H","I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U","V", "W", "X", "Y"), class = "factor"),
X1 = c(809L, 81L, 862L,747L, 224L, 721L, 310L, 53L, 853L, 642L),
X2 = c(926L, 409L,825L, 702L, 803L, 63L, 319L, 941L, 598L, 830L),
X16 = c(447L,164L, 8L, 775L, 471L, 196L, 30L, 420L, 47L, 327L),
X22 = c(335L,164L, 503L, 407L, 662L, 139L, 111L, 721L, 340L, 178L)), .Names = c("id","X1",
"X2", "X16", "X22"), class = "data.frame", row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))
==X==============================================================X==
d <- read.table("http://pastebin.com/raw.php?i=m1ZJuKLH")
mydata <- data.frame(a=character(0), b=numeric(0), c=numeric(0), d=numeric(0))
>fix(mydata)
install.packages("SciencesPo")
dt <- data.frame(
Z = sample(LETTERS,10),
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE)
)
> dt
Z X Y
1 D 8 no
2 T 1 yes
3 J 7 no
4 K 6 no
5 U 2 no
6 A 10 yes
7 Y 5 no
8 M 9 yes
9 X 4 yes
10 Z 3 no
> anonymize(dt)
Z X Y
1 b2 2.5 c1
2 b6 -4.5 c2
3 b3 1.5 c1
4 b4 0.5 c1
5 b7 -3.5 c1
6 b1 4.5 c2
7 b9 -0.5 c1
8 b5 3.5 c2
9 b8 -1.5 c2
10 b10 -2.5 c1
# sample two variables without replacement
> anonymize(sample.df(dt,5,vars=c("Y","X")))
Y X
1 a1 -0.4
2 a1 0.6
3 a2 -2.4
4 a1 -1.4
5 a2 3.6
dput(droplevels(head(mydata)))
set.seed(1) # important to make random data reproducible
myData <- data.frame(a=sample(letters[1:5], 20, rep=T), b=runif(20))
cyl mean.hp
1: 6 122.28571
2: 4 82.63636
3: 8 209.21429
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
r_data_frame(
n = 500,
id,
race,
age,
sex,
hour,
iq,
height,
died
)
ID Race Age Sex Hour IQ Height Died
1 001 White 33 Male 00:00:00 104 74 TRUE
2 002 White 24 Male 00:00:00 78 69 FALSE
3 003 Asian 34 Female 00:00:00 113 66 TRUE
4 004 White 22 Male 00:00:00 124 73 TRUE
5 005 White 25 Female 00:00:00 95 72 TRUE
6 006 White 26 Female 00:00:00 104 69 TRUE
7 007 Black 30 Female 00:00:00 111 71 FALSE
8 008 Black 29 Female 00:00:00 100 64 TRUE
9 009 Asian 25 Male 00:30:00 106 70 FALSE
10 010 White 27 Male 00:30:00 121 68 FALSE
.. ... ... ... ... ... ... ... ...
mydf1<- matrix(rnorm(20),nrow=20,ncol=5)
class(mydf1)
# this shows the type of the data you have
dim(mydf1)
# this shows the dimension of your data
#found based on the following
typeof(mydf1), what it is.
length(mydf1), how many elements it contains.
attributes(mydf1), additional arbitrary metadata.
#If you cannot share your original data, you can str it and give an idea about the structure of your data
head(str(mydf1))
If I have a matrix x as follows:
> x <- matrix(1:8, nrow=4, ncol=2,
dimnames=list(c("A","B","C","D"), c("x","y")))
> x
x y
A 1 5
B 2 6
C 3 7
D 4 8
>
How can I turn it into a dataframe with 8 rows, and three
columns named `row`, `col`, and `value`, which have the
dimension names as the values of `row` and `col`, like this:
> x.df
row col value
1 A x 1
...
(To which the answer might be:
> x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
+ varying=list(colnames(x)), times=colnames(x),
+ v.names="value", timevar="col", idvar="row")
)
#If I have a matrix x as follows:
x <- matrix(1:8, nrow=4, ncol=2,
dimnames=list(c("A","B","C","D"), c("x","y")))
x
# x y
#A 1 5
#B 2 6
#C 3 7
#D 4 8
# How can I turn it into a dataframe with 8 rows, and three
# columns named `row`, `col`, and `value`, which have the
# dimension names as the values of `row` and `col`, like this:
#x.df
# row col value
#1 A x 1
#...
#To which the answer might be:
x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
varying=list(colnames(x)), times=colnames(x),
v.names="value", timevar="col", idvar="row")
library(testthat)
# code defining x and y
if (y >= 10) {
expect_equal(x, 1.23)
} else {
expect_equal(x, 3.21)
}
library(reprex)
y <- 1:4
mean(y)
reprex()