R-子集数据框,仅保留所有列上符合多个条件的行
我想要的快速总结如下: 我在同一个文件夹中有数千个.csv文件,其中包含诸如折扣率或折扣现金流之类的短语,主要在第一列,但在前10列中也随机包含 使用一些函数(可能是R-子集数据框,仅保留所有列上符合多个条件的行,r,csv,filter,subset,grepl,R,Csv,Filter,Subset,Grepl,我想要的快速总结如下: 我在同一个文件夹中有数千个.csv文件,其中包含诸如折扣率或折扣现金流之类的短语,主要在第一列,但在前10列中也随机包含 使用一些函数(可能是grepl(),subset(),或filter()),我想提取包含这些短语的行,并将它们与它们各自来自的文件名一起放入新的数据框中 我遇到的问题是,我一直在尝试的每个函数一次只允许查看一列或两列。以下是我一直使用的代码: #Reading in a single .csv file for now: MyData <- r
grepl()
,subset()
,或filter()
),我想提取包含这些短语的行,并将它们与它们各自来自的文件名一起放入新的数据框中
我遇到的问题是,我一直在尝试的每个函数一次只允许查看一列或两列。以下是我一直使用的代码:
#Reading in a single .csv file for now:
MyData <- read.csv("c:/____________/.csv", header = TRUE, sep=",")
#Assigning numbers to each column since each file I will be plugging in has different column headings:
colnames(MyData) <- c(1:ncol(MyData))
#Using subset to check the 1st column and 5th column for discount rate
#(only because I knew these 2 columns contained the phrase "discount rate" ahead of time.)
my.data.frame <- subset(MyData, MyData$`1`=="discount rate" | MyData$`5`=="discount rate")
#现在读取单个.csv文件:
MyData这样的东西行得通吗?您可以使用正则表达式(regex)进行修改
根据您的需要提供“折扣”
#Sample dataframe with 'discount rate', 'discounted rates', or 'discounted cash flow' randomly placed
df <- data.frame(a=c('discount rate', 'nothing', 'discounted cash flow', 'nothing', 'nothing'), b=1:5,
c=6:10, d=c('nothing', 'discounted rates', 'nothing', 'nothing', 'nothing'), stringsAsFactors = F)
df
a b c d
1 discount rate 1 6 nothing
2 nothing 2 7 discounted rates
3 discounted cash flow 3 8 nothing
4 nothing 4 9 nothing
5 nothing 5 10 nothing
#Get rows where the word 'discount' occurs in any row
discountRows <- unique(unlist(apply(df, 2, function(x) grep('discount', x))))
#Subset df with only rows where the word 'discount' occurs
df[discountRows,]
a b c d
1 discount rate 1 6 nothing
3 discounted cash flow 3 8 nothing
2 nothing 2 7 discounted rates
#Assign subsetted df to new dataframe with original name in it
assign(paste0(deparse(substitute(df)), '_discountRows'), df[discountRows,])
#随机放置“贴现率”、“贴现率”或“贴现现金流”的示例数据框
df考虑使用grepl
(在正则表达式匹配中返回TRUE/FALSE
)放在apply
中。并将所有内容包装在一个更大的lappy
中,通过多个csv文件构建一个数据帧列表,其中包含子集行,然后在末尾进行行绑定:
setwd("C:/path/to/my/folder")
myfiles <- list.files(path="C:/path/to/my/folder")
dfList <- lapply(myfiles, function(file){
df <- read.csv(file, header = TRUE)
colnames(df) <- c(1:ncol(df))
# ADD COLUMN FOR FILENAME
df$filename <- file
# RETURNS 1 IF ANY COLUMN HAS MATCH
df$discountfound <- apply(df, 1, function(col)
max(grepl("discount rate|discounted cash flow", col)))
# SUBSET AND REMOVE discountfound COLUMN
df <- transform(subset(df, df$discountfound == TRUE), discountfound=NULL)
})
# ASSUMES ALL DATAFRAMES HAVE EQUAL NUMBER OF COLUMNS
finaldf <- do.call(rbind, dfList)
setwd(“C:/path/to/my/folder”)
myfiles你可以试试这个。我希望这就是你想要的
mydata = data.frame(a = c(1:3,"discount rate","discounted rates",2:5),
b = c("discount rate","discounted rates",2:8))
row = c()
for (i in 1:nrow(mydata)){
good_row = grep(paste("discount rate","discounted rates",sep="|"),unlist(mydata[i,]))
if (length(good_row) != 0){
row = c(row,i)
}
}
mydata = mydata[row,]
查看grep
函数。使用grep,它看起来很复杂,因为您必须指定一个要搜索的列名,但我正在查看的所有文件都没有一致的名称或列数