匹配两个字符串并提取R中的匹配字符

匹配两个字符串并提取R中的匹配字符,r,string,R,String,假设我有下面提到的输入字符 text_input <- c("ADOPT", "A", "FAIL", "FAST") test <- c("TEST", "INPUT", "FAIL", "FAST") 编辑 只需在这里再添加一个问题。。。当输入是纯字符串时会发生什么,如下面提供的字符串 text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvar

假设我有下面提到的输入字符

text_input <- c("ADOPT", "A", "FAIL", "FAST")
test <- c("TEST", "INPUT", "FAIL", "FAST")
编辑

只需在这里再添加一个问题。。。当输入是纯字符串时会发生什么,如下面提供的字符串

text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")

test <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")

text\u input如果我们需要提取字符

library(stringr)
str_extract(text_input, paste0("[", test, "]+"))

如果我们正在寻找完整的字符串匹配

library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))

如果我们需要提取字符

library(stringr)
str_extract(text_input, paste0("[", test, "]+"))

如果我们正在寻找完整的字符串匹配

library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))

这也可以通过使用包
fuzzyjoin
来实现,该包包含一种基于regex加入df的方法

text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")

library(fuzzyjoin)
library(dplyr)

df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )

# now we can regex match them
df %>%
  regex_left_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 4 x 2
# text  regex_name
#   <chr> <chr>     
# 1 ADOPT NA        
# 2 A     NA        
# 3 FAIL  FAIL      
# 4 FAST  FAST 

#or only regex 'hits'
df %>%
  regex_inner_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 2 x 2
# text  regex_name
#   <chr> <chr>     
# 1 FAIL  FAIL      
# 2 FAST  FAST   

text\u input这也可以通过使用包
fuzzyjoin
来实现,该包包含一种基于regex连接df的方法

text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")

library(fuzzyjoin)
library(dplyr)

df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )

# now we can regex match them
df %>%
  regex_left_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 4 x 2
# text  regex_name
#   <chr> <chr>     
# 1 ADOPT NA        
# 2 A     NA        
# 3 FAIL  FAIL      
# 4 FAST  FAST 

#or only regex 'hits'
df %>%
  regex_inner_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 2 x 2
# text  regex_name
#   <chr> <chr>     
# 1 FAIL  FAIL      
# 2 FAST  FAST   

text\u input对于简单的示例,您可以使用注释中已经说明的
intersect()

text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"
默认情况下不区分大小写

matchPhrase(phrases, text_input2, tol=FALSE)
#   Data Scientist         McKinsey    ORGANIZATIONS             FAST 
# "Data Scientist"       "McKinsey"               NA               NA 
不区分大小写也可以查找
“组织”

要获得干净的输出,只需执行以下操作:

as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey"       "organizations" 

注意:您可能需要多次调整该功能以满足您的特定需求/所需输出。实际上,软件包在做这类事情时非常复杂。

对于简单的示例,您可以使用
intersect()
,正如注释中所述

text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"
默认情况下不区分大小写

matchPhrase(phrases, text_input2, tol=FALSE)
#   Data Scientist         McKinsey    ORGANIZATIONS             FAST 
# "Data Scientist"       "McKinsey"               NA               NA 
不区分大小写也可以查找
“组织”

要获得干净的输出,只需执行以下操作:

as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey"       "organizations" 


注意:您可能需要多次调整该功能以满足您的特定需求/所需输出。实际上,软件包在做这类事情时非常复杂。

您的预期输出是什么?
text\u input
test
的长度是否相同?
?intersect
可能吗?因此
intersect(text\u input,test)
是相同的。
test
的长度不同,它由5L记录组成。编辑的预期输出是什么?
text\u input
test
的长度是否相同?
?intersect
可能?因此
intersect(text\u input,test)
是相同的。
test
的长度不同,它由5L条记录组成,这是EDITError的预期输出,如图所示:“在字符范围[x-y]中,x大于y。”。(U_REGEX_INVALID_RANGE)“我对这个问题进行了编辑,是否可以使用
intersect
函数执行同样的操作。@JBH我在您的下面找到了“LateMail”的注释question@JBH您是否需要
str_extract_all(文本输入,粘贴(test,collapse=“|”)
显示错误:“在字符范围[x-y]中,x大于y。”。(U_REGEX_INVALID_RANGE)“我对这个问题进行了编辑,是否可以使用
intersect
函数执行同样的操作。@JBH我在您的下面找到了“LateMail”的注释question@JBH您是否需要
str_extract_all(文本输入,粘贴(test,collapse=“|”)谢谢,但是,例如,当有一个短语叫做“ABOUT SERVICE PVT LTD”时,我得到的输出仅与我的文本输入匹配。另外,在(regexpr(rx1,txt)>0中,“0”是什么意思?@JBH我无法复制这个,但是当您尝试
短语时,谢谢,但是,例如,当有一个短语称为“ABOUT SERVICE PVT LTD”时,我得到的输出仅与我的文本输入匹配。另外,在(regexpr(rx1,txt)>0中,“0”是什么意思?@JBH我无法重现这个,但是当你尝试
短语时