Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
筛选r中同一列中的多个字符串_R_String_Filter_Subset - Fatal编程技术网

筛选r中同一列中的多个字符串

筛选r中同一列中的多个字符串,r,string,filter,subset,R,String,Filter,Subset,我的大型数据集(Groceries)中有一列包含字符数据(水果),所有这些数据都是小写,并且都不包含标点符号 看起来有点像这样: # Groceries Data Frame Id Groceries$Fruits 1 apple orange banana lemon grapefruit 2 grapes tomato passion fruit 3 strawberry orange kiwi 4 lemon orange passion fruit

我的大型数据集(Groceries)中有一列包含字符数据(水果),所有这些数据都是小写,并且都不包含标点符号

看起来有点像这样:

# Groceries Data Frame
Id    Groceries$Fruits
1     apple orange banana lemon grapefruit
2     grapes tomato passion fruit
3     strawberry orange kiwi
4     lemon orange passion fruit grapefruit lime
5     lemon orange passion fruit grapefruit lime peach
  ...
我试图从水果栏中选择包含5种特定水果(橙色、酸橙、柠檬、葡萄柚和百香果)的所有行(其中有3320行)。起初,我只对包含所有5种水果的行感兴趣,没有其他水果。因此,在这5行中,应过滤/子集的唯一一行是第4行。水果不必有任何特定的顺序

数据实际上是测试的答案,所以最终我想确定谁得到了0/5的水果,谁得到了1/5,2/5等等

到目前为止,我已经尝试了两种方法,都没有用。 首先,我尝试使用grep(),但结果数据框中没有存储行

# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit, 
grapefruit", Groceries$Fruits), ]

可能需要更多的澄清,你打算做什么,答案有5个水果和一些额外的,但这应该帮助你。我用“西番莲果”替换了所有“西番莲果”的例子,使其更简单:

df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit", 
                   "lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)

第一行进行西番莲替换,然后stru count统计
df$Fruit
中出现的所有正确水果。最后,如果所有5个结果都正确,但有额外的结果,
Count
重置为0

这里有一种方法可以使用
grepl
和目标关键字列表

df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2", 
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L, 
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear", 
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]

df看看别人的天才解决方案,下面是我的答案

ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple", 
            "apple", 
            "apple orange kiwi fifth",
            "orange apple pineapple kiwi fifth",
            "pineapple orange apple fifth kiwi"
            )
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
sumshyftw提供的答案非常棒。我喜欢从sumshyftw中学到一些东西。但为了证明我的观点,不受限制的字符串搜索可能会扰乱您的计数:

CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     1
4  4   55+          US   Correct orange apple pineapple kiwi fifth     2
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     2
R不再把菠萝算作苹果

但是,为了记录在案,sumshyftw在我的例子中解决了困难的部分,值得称赞:

CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     4
4  4   55+          US   Correct orange apple pineapple kiwi fifth     5
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     5

CorrectFruits您能否提供一个可复制的杂货店数据集示例?部分数据会有帮助。嗨,Jim O,谢谢你这么快就回来了,我添加了一个简短的示例,说明一些数据在底部看起来是什么样子,尽管数据集有数千行长。让我知道一些细节,如果我能有效地添加任何其他内容,我会尝试。但为了更好地帮助你,您的查询是否始终限于5个水果?是的,问题是“尽可能多地命名这5个水果”,并且上面列出了5个水果的单独图片。如果您在
tidyverse
中工作,我建议不要命名您的函数
filter
,以避免潜在冲突。这非常有用,非常感谢。如果你愿意的话,我有两次跟进。。。1) 对于额外的水果答案,您如何更改代码,使他们的分数不是重置为0,而是每增加一个水果就失去一分?2) 如果我想创建一个新的df,保留Groceries数据框中的所有数据,但只保留分数为5的数据行,我该怎么做?我不理解第一个后续问题。请看下面我的答案,甚至没有考虑过菠萝的问题!我喜欢你的答案。
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple", 
            "apple", 
            "apple orange kiwi fifth",
            "orange apple pineapple kiwi fifth",
            "pineapple orange apple fifth kiwi"
            )
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
filter(df, grepl("apple", Fruits))

  ID   Age Nationality     Color                            Fruits
1  1 26-35    Canadian   Correct                         pineapple
2  2 26-35          US Incorrect                             apple
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth
4  4   55+          US   Correct orange apple pineapple kiwi fifth
5  5 56-45     British  Correect pineapple orange apple fifth kiwi
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     1
4  4   55+          US   Correct orange apple pineapple kiwi fifth     2
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     2
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     0
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     1
4  4   55+          US   Correct orange apple pineapple kiwi fifth     1
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     1
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     4
4  4   55+          US   Correct orange apple pineapple kiwi fifth     5
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     5
df2 <- filter(df, df$Count == 5)
df2

  ID   Age Nationality    Color                            Fruits Count
1  4   55+          US  Correct orange apple pineapple kiwi fifth     5
2  5 56-45     British Correect pineapple orange apple fifth kiwi     5