Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/64.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用dplyr'将数据分为两组;塞迪夫酒店_R - Fatal编程技术网

如何使用dplyr'将数据分为两组;塞迪夫酒店

如何使用dplyr'将数据分为两组;塞迪夫酒店,r,R,我使用dplyr将一些数据简单地分解为训练和测试 当我做一个简单的例子时,效果非常好: a = c(1, 2, 3, 4, 5, 6, 7, 8) b = c("A", "B", "C", "D", "E", "F", "G", "H") df = data.frame(a, b) train = sample_frac(df, 0.8) test = setdiff(df, train) > nrow(train) + nrow(test) == nrow(df) [1] TRUE

我使用dplyr将一些数据简单地分解为训练和测试

当我做一个简单的例子时,效果非常好:

a = c(1, 2, 3, 4, 5, 6, 7, 8)
b = c("A", "B", "C", "D", "E", "F", "G", "H")

df = data.frame(a, b)

train = sample_frac(df, 0.8)
test = setdiff(df, train)

> nrow(train) + nrow(test) == nrow(df)
[1] TRUE
然而,当我尝试使用经典的UCI葡萄酒数据集做同样的事情时,我似乎没有得到相同的结果:

wine = read.csv("http://www.nd.edu/~mclark19/learn/data/goodwine.csv")

wine_train = sample_frac(wine, 0.8)
wine_test = setdiff(wine, wine_train)

> nrow(wine_train) + nrow(wine_test) == nrow(wine)
[1] FALSE
> nrow(wine_train) + nrow(wine_test)
[1] 6105
> nrow(wine)
[1] 6497
关于setdiff的行为我有什么遗漏吗

谢谢,
AG

可能是因为存在重复的行:

>any(duplicated(wine))
[1] TRUE
如果清理数据集:

drunk = wine[!duplicated(wine),]
drunk_train = sample_frac(drunk, 0.8)
drunk_test = setdiff(drunk, drunk_train)
nrow(drunk_test) + nrow(drunk_train) == nrow(drunk)
[1] TRUE

我同意,跟踪时出现了偷偷摸摸的错误!啊,非常感谢,我没想到要检查副本——谢谢!可悲的是,据我个人所知,我可以证明上校的回答是正确的。我花了好几个小时才找到它!我很感激新的
df
-将数据连接回现实世界哈哈