R 根据规范标记重复(非唯一)值
R 根据规范标记重复(非唯一)值,r,data.table,R,Data.table,在此处输入代码我正在处理一个数据集“Final.Export”,看起来如下所示: LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags 47 390 Moosehead Acolor(PCU) Apparent color <NA> 48 390 Moosehead Acolor(PCU
在此处输入代码
我正在处理一个数据集“Final.Export”,看起来如下所示:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]
“data1”的问题是,只有在第一个唯一行(标记为NA)之后的重复行(根据我对重复的定义)被标记为“1”。我需要将唯一行和关联的重复行标记为“1”。如何做到这一点
如果这令人困惑,请告诉我如何澄清 没有一个可复制的例子很难说,但你似乎想要这样的东西:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]
编辑:
OP澄清后,他们似乎只是想要:
data1[,dup:=duplicated(.SD),
.SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]
这将有助于给出一些重现问题的输入和输出示例。
Final.Export
的示例,Final.Export1
的示例显示您试图避免的NA
值,同时也显示您想要的Final.Export1
外观。从OP中很难分辨,但看起来他们希望通过键查找重复项,在这种情况下,这可以简化为DT[,dup:=duplicated(DT)]
(不需要by
)--同样,从版本1.8.11
duplicated.data.table
有自己的by
参数