R 删除数据帧中重复行的最大效率

R 删除数据帧中重复行的最大效率,r,dataframe,duplicates,duplicate-removal,R,Dataframe,Duplicates,Duplicate Removal,我有一个非常大的数据框:超过600万行,28个任何类型的变量(num、factors、characters)。我需要删除重复的行。但是,识别实际重复项的唯一方法是对一个大字符变量(每次观察大约1000到2000个字符)进行检查。 我可以很好地使用标准的duplicated()函数,但我不确定这是最省时的解决方案 是否有任何功能或软件包可以有效地完成这项工作? 提前感谢您的建议 structure(list(city = c("New York", "New York", "New York",

我有一个非常大的数据框:超过600万行,28个任何类型的变量(num、factors、characters)。我需要删除重复的行。但是,识别实际重复项的唯一方法是对一个大字符变量(每次观察大约1000到2000个字符)进行检查。 我可以很好地使用标准的
duplicated()
函数,但我不确定这是最省时的解决方案

是否有任何功能或软件包可以有效地完成这项工作? 提前感谢您的建议

structure(list(city = c("New York", "New York", "New York", "Brussels", 
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351, 
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD", 
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.", 
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers", 
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop", 
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")
试一试

库(data.table)
setkey(setDT(df),审查)

res另一种方法,虽然不一定更有效,但是对数据进行计数:

df <- structure(list(city = c("New York", "New York", "New York", "Brussels", 
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351, 
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD", 
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.", 
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers", 
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop", 
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")

# do the count
df[with(df, ave(paste(prodCategory, city), userID, FUN=function(x) length(unique(x))))==1,]


city prodCategory       date userID
2  New York            4 2014-10-09   XYZZ
5    London            4 2014-10-11   SDFG
6 Arlington            4 2014-10-12  WEDGD
                                                                                                                                                                                                          review
2            this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.
5             That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop
6 Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat

df您可以使用
data.table
中的
unique
by
选项。请考虑将数据集的子集(4列,10行)用重复或.<代码>区别< /COD> > <代码> GROPY按<代码> dPLYR >(尚未标明它)。@ Sal,您在代码< >城市< /代码>或整行中寻找重复吗?谢谢。刚刚在一个示例上尝试了dplyr的
distinct
,它工作得又好又快。我对“data.table”不是很流利,但会尝试一下,并在
评论中重复使用它
df <- structure(list(city = c("New York", "New York", "New York", "Brussels", 
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351, 
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD", 
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.", 
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat", 
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers", 
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop", 
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")

# do the count
df[with(df, ave(paste(prodCategory, city), userID, FUN=function(x) length(unique(x))))==1,]


city prodCategory       date userID
2  New York            4 2014-10-09   XYZZ
5    London            4 2014-10-11   SDFG
6 Arlington            4 2014-10-12  WEDGD
                                                                                                                                                                                                          review
2            this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.
5             That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop
6 Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat