R 将一些csv文件合并为一个不同的列数
我已经加载了20个csv文件,功能如下:R 将一些csv文件合并为一个不同的列数,r,R,我已经加载了20个csv文件,功能如下: tbl = list.files(pattern="*.csv") for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i])) 或 看起来是这样的: > head(tbl) [1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv" [6] "F13.csv" 我
tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))
或
看起来是这样的:
> head(tbl)
[1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv"
[6] "F13.csv"
我必须把所有这些文件合并成一个。让我们称之为主文件,但让我们尝试用所有名称创建一个表。
在所有这些csv文件中都有一个名为“加入”的列。我想从所有这些csv文件的所有名称表。当然,许多访问可以在不同的csv文件中重复。我想保留与加入相应的所有数据
一些问题:
有些名字是一样的,我不想重复
有些名字几乎是一样的。区别在于有名字和后面变成点和数字。
这些csv文件的列数可以不同。
这是显示这些数据的屏幕截图:
让我给你看看它的样子:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
我尝试了将近两个星期,但我没能做到。所以请帮帮我。你的问题似乎包含多个子问题。我鼓励你把它们分开 显然,您首先需要将数据帧与不同的列组合起来。您可以使用plyr软件包中的rbind.fill:
下面是一个使用一些tidyverse函数和自定义函数的示例,该函数可以将多个缺少列的csv文件合并到一个数据帧中:
library(tidyverse)
# specify the target directory
dir_path <- '~/test_dir/'
# specify the naming format of the files.
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'
# create sample data with some missing columns
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)
# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
x <- read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
select(file_name, everything()) # reorder the columns so file name is first
return(x)
}
# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
list.files(dir_path, pattern = re_file) %>%
map_df(~ read_dir(dir_path, .))
# files with missing columns are filled with NAs.
rbind.filllist_的_数据将更快,dplyr::rbind_alllist_的_数据将更快。两者都工作得很好,速度足够快。任何想法如何删除重复和相同的名称只是不同的数字后的点。谢谢任何关于我做错了什么值得否决的建议?非常有用的答案和有效的!。非常感谢。如果尝试dplyr::rbind_alllist_of_数据,如果列表元素的长度不相同,R会话将中止。
all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) :
The number of columns is not correct.
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
library(plyr)
all_data = do.call(rbind.fill, list_of_data)
library(tidyverse)
# specify the target directory
dir_path <- '~/test_dir/'
# specify the naming format of the files.
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'
# create sample data with some missing columns
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)
# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
x <- read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
select(file_name, everything()) # reorder the columns so file name is first
return(x)
}
# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
list.files(dir_path, pattern = re_file) %>%
map_df(~ read_dir(dir_path, .))
# files with missing columns are filled with NAs.