Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/69.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 基于“聚合”;近;行值_R_Dataframe_Aggregate_Na - Fatal编程技术网

R 基于“聚合”;近;行值

R 基于“聚合”;近;行值,r,dataframe,aggregate,na,R,Dataframe,Aggregate,Na,我有一个非常凌乱的数据框(webscraped),不幸的是其中有许多双重甚至三重的条目。大多数数据帧如下所示: > df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","co

我有一个非常凌乱的数据框(webscraped),不幸的是其中有许多双重甚至三重的条目。大多数数据帧如下所示:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d
  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>
然而,主要的问题是,并非整个数据帧都是这样。事实上,我还有其他类似的部分:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d
  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>
>df2 df2
var1 var2 var3
正确
正确
是的
正确
5楼
6克正确
7 g对,g错
8克
和其他变体。最后,每个ID都应该有一行,其中包含正确的var2和var3。此时,我迷失了方向:我的var1不是唯一的。但是,我知道“属于”在一起的重复ID在数据帧中分组(如我的示例所示);e、 g.第4102行和第4103行中可能还有另一个“a”

我认为应该采用的方法是使用带有var1的聚合作为ID,但另外告诉R,聚合应该只检查+2行var1。你知道怎么编码吗


谢谢

如果
var2
var3
对于
var1
的每个级别只有一个唯一的值,则:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))
var1 var2 var3
正确的,正确的
正确的,正确的
正确的,正确的
正确的,正确的
正确的,正确的
正确
7克对,克对,克

下面是一个使用
data.table的方法

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

例如,
var2[!is.na(var2)][1]
中的思想是从var2中获取第一个非缺失值。如果缺少所有值,则返回NA。var1对这两个变量执行此操作

如果有两个以上的变量,可以切换到
lappy
。例如,下面的例子

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d
在一个实例中,如果多个var1具有一个有效值,并且该值由一个未丢失的var2表示,那么您可以通过连接达到预期的结果

评论中的数据

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

在这里,var1的所有未丢失的var2观测值都被合并到原始数据集中。

不幸的是,这将“right.g”带到了f行中(因为我的数据集中没有“right.f”)。结果表明,当组中的所有值都是
NA
(就像
var2
这里的
var1==“f”
)时,问题与a有关。我已经通过使用
NA\u character\uu
而不是
NA
修复了它。更新了@Imo的较短代码,用于处理所有
NA
的组。非常感谢,这看起来很有希望。关于如何处理数据帧中的重复项(即在数据帧中随机位置分组的具有相同ID的观测值)有何想法?如果df1看起来是这样的:
df1请查看答案末尾的附加文本。您的最后一行代码(包括连接)会引发此错误:
错误出现在
[.data.table
(setDT(df1),df1[,(var2=var2[!is.na(var2)]),:“on”参数应该是一个命名的原子向量oc列名,指示“i”中的哪些列应该与“x”中的哪些列联接。
刚刚用一个新的R会话重试了一次,它就起了作用。您可能必须更新您正在使用的
数据的版本。表
。我使用的是1.10.4。如果这不是一个选项,您可能需要更新尝试将
on=(var1,var2)
替换为
on=c(“var1,var2”)