合并多个CSV文件并删除R中的重复项

合并多个CSV文件并删除R中的重复项,r,csv,merge,duplicates,R,Csv,Merge,Duplicates,我有将近3000个相同格式的CSV文件(包含推文),我想将这些文件合并成一个新文件,并删除重复的推文。我遇到过各种讨论类似问题的话题,但是文件的数量通常很少。我希望您能帮助我在R中编写一个既高效又高效的代码 filenames <- list.files(path = "~/") do.call("rbind", lapply(filenames, read.csv, header = TRUE)) Error in file(file, "rt") : ca

我有将近3000个相同格式的CSV文件(包含推文),我想将这些文件合并成一个新文件,并删除重复的推文。我遇到过各种讨论类似问题的话题,但是文件的数量通常很少。我希望您能帮助我在R中编写一个既高效又高效的代码

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
CSV文件具有以下格式:

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
CSV格式的图像:

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
我(在第2列和第3列)将(Twitter上的)用户名改为A-E,“实际名称”改为A1-E1

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
原始文本文件:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"
filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
不知怎的,我的标题搞砸了,它们显然应该向右移动一列。每个CSV文件最多包含1500条推文。我想通过检查第二列(包含tweets)来删除重复项,因为它们应该是唯一的,并且作者列可以是相似的(例如,一个作者发布多条tweets)

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
是否可以将合并文件和删除重复文件结合起来,或者这会带来麻烦,是否应该将过程分开?作为一个起点,我包含了两个链接,两个来自Hayward Godwin的博客,讨论了合并CSV文件的三种方法

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
显然,这个网站上也有一些与我的问题相关的话题(例如),但我还没有找到任何讨论合并和删除重复项的内容。我真的希望你能帮助我和我有限的R知识应对这个挑战

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
虽然我在网上找到了一些代码,但实际上并没有生成输出文件。大约3000个CSV文件采用上述格式。我的意思是尝试了以下代码(对于合并部分):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
更新

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
我尝试了以下代码:

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]
在第四行之后,我得到:

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names
  Error: object 'my.df' not found
我怀疑这些错误是由csv文件编写过程中的一些失败造成的,因为有些情况下author/local.time列不正确。在它们应该在的位置的左边或右边,这会导致一个额外的列。我手动修改了5个文件,并在这些文件上测试了代码,没有发现任何错误。然而,似乎什么也没发生。我没有从R得到任何输出

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
为了解决额外列的问题,我稍微调整了代码:

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

我做错了什么?

首先,将问题简化为文件所在的文件夹,并尝试将模式设置为以“.csv”结尾的只读文件,例如

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
my.new.df <- my.df[!duplicated(my.df$tweet),]
注意。已更改,因此所有列均为字符和“;”分开的

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
如果需要的话,我会在以后解析出时间

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
另一个单独的问题是data.frame中tweet的唯一性——但我不清楚您希望它们对用户是唯一的还是全局唯一的。对于全球独一无二的推文,类似

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              
Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
my.new.df <- my.df[!duplicated(my.df$tweet),]

my.new.df显示您正在使用的一些代码。可能是您将错误的
标题
参数发送到了
read.csv()
。您的问题已经很清楚了,但还不清楚到目前为止您做了什么以及为什么不起作用。显示用于读取文件的
read.csv()
调用。然后我们可以评论你做错了什么。我编辑了我的问题,希望这是你的问题所在。
filename
是否包含要导入的文件的正确列表?这段代码显然落在一条
read.csv
语句上。您可能需要更改
list.files()
以返回完整路径。您的工作目录是什么?工作目录是包含所有CSV文件的文件。因此,list.files()应该“加载”我要查找的CSV文件。对于文件名部分,这是由列表加载的文件指定的。文件不是吗?Thnx Sean,明天将试一试!文件夹中只有.csv文件,因此模式部分似乎没有必要..我有一些空闲时间,因此决定测试你的建议Sean。在尝试代码的第一部分后,我出现了以下错误。。read.table中出错(file=file,header=header,sep=sep,quote=quote,:列数多于列名)在那里,您能发布其中一个csv文件的前几行吗(假设可以)并指出它们是否都具有相同的格式?Tim,我已经编辑了我的问题,并包含了一个图像作为我的csv文件的示例。我选择了一个图像,因为简单的复制粘贴破坏了问题的布局。所有csv文件都具有相同的格式,每个csv文件的推文数量最多为1500条。看起来像是你的CSV文件不适合列。你能检查一下吗?