筛选r中同一列中的多个字符串
我的大型数据集(Groceries)中有一列包含字符数据(水果),所有这些数据都是小写,并且都不包含标点符号 看起来有点像这样:筛选r中同一列中的多个字符串,r,string,filter,subset,R,String,Filter,Subset,我的大型数据集(Groceries)中有一列包含字符数据(水果),所有这些数据都是小写,并且都不包含标点符号 看起来有点像这样: # Groceries Data Frame Id Groceries$Fruits 1 apple orange banana lemon grapefruit 2 grapes tomato passion fruit 3 strawberry orange kiwi 4 lemon orange passion fruit
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
我试图从水果栏中选择包含5种特定水果(橙色、酸橙、柠檬、葡萄柚和百香果)的所有行(其中有3320行)。起初,我只对包含所有5种水果的行感兴趣,没有其他水果。因此,在这5行中,应过滤/子集的唯一一行是第4行。水果不必有任何特定的顺序
数据实际上是测试的答案,所以最终我想确定谁得到了0/5的水果,谁得到了1/5,2/5等等
到目前为止,我已经尝试了两种方法,都没有用。
首先,我尝试使用grep(),但结果数据框中没有存储行
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
可能需要更多的澄清,你打算做什么,答案有5个水果和一些额外的,但这应该帮助你。我用“西番莲果”替换了所有“西番莲果”的例子,使其更简单:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
第一行进行西番莲替换,然后stru count统计
df$Fruit
中出现的所有正确水果。最后,如果所有5个结果都正确,但有额外的结果,Count
重置为0 这里有一种方法可以使用grepl
和目标关键字列表
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
df看看别人的天才解决方案,下面是我的答案
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
sumshyftw提供的答案非常棒。我喜欢从sumshyftw中学到一些东西。但为了证明我的观点,不受限制的字符串搜索可能会扰乱您的计数:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
R不再把菠萝算作苹果
但是,为了记录在案,sumshyftw在我的例子中解决了困难的部分,值得称赞:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
CorrectFruits您能否提供一个可复制的杂货店数据集示例?部分数据会有帮助。嗨,Jim O,谢谢你这么快就回来了,我添加了一个简短的示例,说明一些数据在底部看起来是什么样子,尽管数据集有数千行长。让我知道一些细节,如果我能有效地添加任何其他内容,我会尝试。但为了更好地帮助你,您的查询是否始终限于5个水果?是的,问题是“尽可能多地命名这5个水果”,并且上面列出了5个水果的单独图片。如果您在tidyverse
中工作,我建议不要命名您的函数filter
,以避免潜在冲突。这非常有用,非常感谢。如果你愿意的话,我有两次跟进。。。1) 对于额外的水果答案,您如何更改代码,使他们的分数不是重置为0,而是每增加一个水果就失去一分?2) 如果我想创建一个新的df,保留Groceries数据框中的所有数据,但只保留分数为5的数据行,我该怎么做?我不理解第一个后续问题。请看下面我的答案,甚至没有考虑过菠萝的问题!我喜欢你的答案。
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5