R 字符串匹配：单词+；人物_R_String_String Matching

R 字符串匹配：单词+；人物

r string

R 字符串匹配：单词+；人物,r,string,string-matching,R,String,String Matching,我试图搜索一个数据框来匹配一个字符串，在这里我从一个充满注释的列中创建了一个对象例如：我正在寻找任何一行的笔记可能匹配 mph_words<-c(">10", "> 10", ">20", "> 20") 正如您所看到的，一些注释在“”和数字之间有空格，因此使用strsplit进行搜索并不理想，因为我确实需要将“”与数字保持在一起我试过了 > mph_words %in% lc_notes[2000] [1] FALSE FALSE FALSE FAL

我试图搜索一个数据框来匹配一个字符串，在这里我从一个充满注释的列中创建了一个对象

例如：

我正在寻找任何一行的笔记可能匹配

mph_words<-c(">10", "> 10", ">20", "> 20")

正如您所看到的，一些注释在“”和数字之间有空格，因此使用strsplit进行搜索并不理想，因为我确实需要将“”与数字保持在一起

我试过了

> mph_words %in% lc_notes[2000]
[1] FALSE FALSE FALSE FALSE

> pmatch(mph_words, lc_notes[1703])
[1] NA NA NA NA

grepl(lc_notes[1703],mph_words)
[1] FALSE FALSE FALSE FALSE

> str_detect(mph_words,lc_notes[1703])
[1] FALSE FALSE FALSE FALSE

> for (word in 1:length(mph_words)){
+   print(str_extract(mph_words[word],lc_notes[1703]))
+ }
[1] NA
[1] NA
[1] NA
[1] NA

我不知道下一步该做什么。如果它是一个正则表达式，你能在你的答案中解释一下吗？我正在努力更好地理解正则表达式

编辑我试图打印出在mph_单词中特别包含一个字符的行。因此，代码将搜索我的lc_注释中的每一行，并打印第1703行

提前谢谢你

编辑以匹配已编辑的问题：
要查找行号，请使用

grep

grep("[<>]\\s*\\d+\\b",  lc_notes)

grep（“[]\\s*\\d+\\b”，lc\U注释）

[]

匹配

\\s*

允许可选空白

\\d

与以下数字匹配

grep将给出匹配的行数。

编辑以匹配编辑的问题：
要查找行号，请使用

grep

grep("[<>]\\s*\\d+\\b",  lc_notes)

grep（“[]\\s*\\d+\\b”，lc\U注释）

[]

匹配

\\s*

允许可选空白

\\d

与以下数字匹配

grep将给出匹配的行数。

这里有一种使用

strsplit

和

lappy

# standardize (get rid of white spaces between <,> and digits in mph_words
mph_words <- unique(gsub('([<>])\\s{0,}(\\d+)', '\\1\\2', mph_words, perl = TRUE))        
# match 
check <- lapply(1:length(lc_notes), 
                function (k) any(mph_words %in% unlist(strsplit(lc_notes[k], ' '))))
check
# [[1]]
# [1] TRUE

# [[2]]
# [1] TRUE

# [[3]]
# [1] FALSE

# Finally printing the indices with a match
which(unlist(check))
# [1] 1 2

#标准化（消除mph#U字中和数字之间的空格）
这里有一种使用strsplit和lappy的方法
# standardize (get rid of white spaces between <,> and digits in mph_words
mph_words <- unique(gsub('([<>])\\s{0,}(\\d+)', '\\1\\2', mph_words, perl = TRUE))        
# match 
check <- lapply(1:length(lc_notes), 
                function (k) any(mph_words %in% unlist(strsplit(lc_notes[k], ' '))))
check
# [[1]]
# [1] TRUE

# [[2]]
# [1] TRUE

# [[3]]
# [1] FALSE

# Finally printing the indices with a match
which(unlist(check))
# [1] 1 2

#标准化（消除mph#U字中和数字之间的空格）
mph_words我将使用应用和stringr:：str_detect
来实现：
lc_notes <- c("collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph.")
mph_words<-c(">10", "> 10", ">20", "> 20")

sapply(lc_notes, function(x) any(str_detect(x, mph_words)))

collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph. 
                                                                   FALSE 

我在这里使用了unname
来强调它返回的向量是lc_notes
中匹配任何正则表达式模式的项的索引。您也可以做相反的操作，并在其上调用names
，以获取行的文本：
names(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph." 
[2] "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph."


如果您想要一个更简单的正则表达式，该正则表达式与空格匹配或不匹配，请在空格字符上使用？
可选量词：
mph_words<-c("> ?10", "> ?20")

mph\u单词我会将apply
与stringr:：str\u detect
一起用于：
lc_notes <- c("collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph.")
mph_words<-c(">10", "> 10", ">20", "> 20")

sapply(lc_notes, function(x) any(str_detect(x, mph_words)))

collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph. 
                                                                   FALSE 

我在这里使用了unname
来强调它返回的向量是lc_notes
中匹配任何正则表达式模式的项的索引。您也可以做相反的操作，并在其上调用names
，以获取行的文本：
names(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph." 
[2] "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph."


如果您想要一个更简单的正则表达式，该正则表达式与空格匹配或不匹配，请在空格字符上使用？
可选量词：
mph_words<-c("> ?10", "> ?20")

mph\u words我刚尝试复制并粘贴您的代码，但得到以下结果：>sub（[]\s*\d+\b）。*“，“\\1”，lc\u notes）错误：“\s”是以“.*（[]\s”开头的字符串中无法识别的转义符“您能解释一下您的代码吗？谢谢！这是我遇到的错误：错误：'\s'是一个无法识别的转义符，以字符串开头”“*（[]\s）。”“现在修复。尝试新版本。谢谢！我得到了一份打印件，但我正在澄清我的问题。感谢正则表达式的解释——这非常有用。我刚刚尝试复制并粘贴了您的代码，但得到了以下结果：>sub（*（[]\s*\d+\b）。*”，“\\1”，lc\U注释）错误：“\s”是以“.”开头的字符串中无法识别的转义符（*[]\s）您能解释一下您的代码吗？谢谢！这是我得到的错误：错误：“\s”是以“.”开头的字符串中无法识别的转义符（*[]\s）现在修复。尝试新版本。谢谢！我得到了打印件，但我正在澄清我的问题。感谢正则表达式的解释——这非常有用谢谢！这正是我所希望的！我真的非常感谢？选项，尤其是名称/未命名选项！谢谢！这正是我所希望的！我真的非常感谢？选项，名称特别是s/unname选项！感谢大家的快速帮助，更重要的是，感谢代码解释！@G5W--感谢所有编辑，但保留#很重要，因为mph的价值对代码的下一部分很重要。@nate谢谢！我应该澄清得更好（仍在学习如何提问）但是lc#notes在“>”/“谢谢大家的快速帮助，更重要的是，代码解释！”之间有模糊的空格。@G5W--感谢所有编辑，但保留#很重要，因为mph的价值对代码的下一部分很重要。@nate谢谢！我应该澄清得更好（仍在学习如何提问）但是lc_notes在“>”和“/”之间有不明确的空格