从R中的另一个数据帧进行字符串匹配和替换的快速方法

从R中的另一个数据帧进行字符串匹配和替换的快速方法,r,stringi,R,Stringi,我有两个这样的数据帧(尽管第一个数据帧超过9000万行,第二个数据帧略超过1400万行),第二个数据帧也是随机排序的 df1 <- data.frame( datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",

我有两个这样的数据帧(尽管第一个数据帧超过9000万行,第二个数据帧略超过1400万行),第二个数据帧也是随机排序的

df1 <- data.frame(
  datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539"),
  words = c("anarchist_schools_of_thought  individualism  collectivism", "social_anarchism  individualist_anarchism",
            "anarchy  -ism", "privative  privative_alpha", "infinitive", ""),

  stringsAsFactors=FALSE)

df2 <- data.frame(
  vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
                 "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive"),
  token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
            "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"), 
  stringsAsFactors = F)
我意识到很多单词只是大写了单词的第一个字母,但有些单词却有很大的不同。我可以做一个for循环,但我认为这会花费太多的时间,我更喜欢用data.table方式,也可能是stringi或stringr方式。我通常只做一个合并,但由于在一行中有多个单词需要替换,这会使事情复杂化


提前感谢您的帮助。

您可以使用
str\u replace\u all
stringr
执行此操作:

library(stringr)

str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))
基本上,
str\u replace\u all
允许您提供一个命名向量,其中原始字符串为名称,替换为向量的元素。您通过创建字符串和替换的“字典”完成了所有的艰苦工作
str_replace_all
只需简单地将其取出并自动进行替换

结果:

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"

这个问题的解决方案似乎与您的数据配合得很好:


我通常使用straight
stringi
完成此操作的方法如下:

library(stringi)

Old <- df2[["vocabword"]]
New <- df2[["token"]]

stringi::stri_replace_all_regex(df1[["datalist"]],
                                "\\b"%s+%Old%s+%"\\b",
                                New,
                                vectorize_all = FALSE)

#[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
#[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
#[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
#[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e"                                              
#[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
#[6] "the first known use of this word was in 1539"  
更新2:下面的方法没有考虑到您的标签是另一个标签的子字符串的可能性,即
wiki/personalist
wiki/personalist\u无政府主义
可能会给您错误的结果。我真正知道的避免这种情况的唯一方法是使用regex/word替换单词边界前后的完整单词(
\\b
),这不能基于固定字符串

一个可能给您带来希望的选项依赖于这样一个事实,即您实际上已经用前缀
wiki/
标记了所有所需的替换项。如果您的实际使用是这样的,那么我们可以利用这一点,使用固定替换而不是正则表达式替换前面和后面有单词边界的完整单词(
\\b
)。(这是必要的,以避免像“ism”这样的单字在作为较长单词的一部分出现时被替换)

使用与上述相同的列表:


前缀_vocabword因为您的每个术语都以“wiki/”开头,所以可以重新排列您的数据集,以便更轻松地创建匹配项。我提出的方法是将每个“wiki/term”移动到数据框架中自己的一行,使用一个连接来匹配有效的单词,然后颠倒步骤,将字符串重新组合在一起,但要使用新的术语

library(tidyverse)
df1a <- df1 %>%
  # Create a separator character to identify where to split
  mutate(datalist = str_replace_all(datalist,"wiki/","|wiki/")) %>% 
  mutate(datalist = str_remove(datalist,"^\\|"))

  # Split so that each instance gets its own column
df1a <- 
  str_split(df1a$datalist,"\\|",simplify = TRUE) %>% 
  as.tibble() %>% 
  # Add a rownum column to keep track where to put back together for later
  mutate(rownum = 1:n()) %>% 
  # Gather the dataframe into a tidy form to prepare for joining
  gather("instance","text",-rownum,na.rm = TRUE) %>% 
  # Create a column for joining to the data lookup table
  mutate(keyword = text %>% str_extract("wiki/[^ ]+") %>% str_remove("wiki/")) %>% 
  # Join the keywords efficiently using left_bind
  left_join(df2,by = c("keyword" = "vocabword")) %>% 
  # Put the results back into the text string
  mutate(text = str_replace(text,"wiki/[^ ]+",paste0("wiki/",token))) %>%
  select(-token,-keyword) %>% 
  # Spread the data back out to the original number of rows
  spread(instance,text) %>% 
  # Re-combine the sentences/strings to their original form
  unite("datalist",starts_with("V"),sep="") %>%
  select("datalist")
库(tidyverse)
df1a%
#创建分隔符以标识拆分位置
变异(datalist=str_replace_all(datalist,“wiki/”,“|wiki/”))%>%
变异(datalist=str_remove(datalist,“^\\\\”))
#拆分,以便每个实例都有自己的列
df1a%
as.tible()%>%
#添加一个rownum列,以跟踪稍后要放回的位置
变异(rownum=1:n())%>%
#将数据框收集到一个整洁的表单中,以准备加入
聚集(“实例”、“文本”、-rownum,na.rm=TRUE)%>%
#创建用于连接到数据查找表的列
变异(关键字=文本%>%str\u提取(“wiki/[^]+”)%%>%str\u删除(“wiki/”)%%>%
#使用左绑定有效地连接关键字
左连接(df2,by=c(“关键字”=“vocabword”))%>%
#将结果放回文本字符串中
突变(text=str_replace(text,“wiki/[^]+”,paste0(“wiki/”,token)))%>%
选择(-token,-关键字)%>%
#将数据分散回原始行数
排列(实例,文本)%>%
#将句子/字符串重新组合为其原始形式
unite(“数据列表”,以(“V”)开头,sep=“”)%>%
选择(“数据列表”)
结果:

# A tibble: 6 x 1
  datalist                                                                                                 
  <chr>                                                                                                    
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individ~
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Indiv~
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively~
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an-~
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive su~
6 the first known use of this word was in 1539    
#一个tible:6x1
数据表
1 wiki/无政府主义者\学派\思想可以从根本上不同于极端的wiki/个人~
无政府主义的两种类型通常被分为维基/社会无政府主义和维基/独立主义~
3该词由wiki/Anarchy和后缀wiki/-ism分别构成~
4无政府主义的无政府状态意味着没有来自wiki/私有前缀wiki/Alpha_Privative an的统治者-~
5权威主权领域裁判法院和动词wiki/不定式su中的后缀or-ismos-isma~
6这个词的第一个已知用法是在1539年

这与您昨天提出的问题有何不同?我需要替换文本。我想如果我把一些文本分离出来,我就可以弄明白,但我一直在努力,却一无所获。你能把上一篇文章的代码加上你从那以后做了什么吗?这样我们就知道了您在哪里停了下来,以及下一步需要做什么。我使用了这个:data$words=trimws(gsub(“wiki/(\\S+)))(?:(?!wiki/\\S)。+”,“\\1”,data$datalist,perl=TRUE),就是这样。就像我说的,我可以做一个for循环,但是速度非常慢。从昨天开始我就没有真正取得任何进展这很好,但是在问题的主体中包含这些代码是很重要的。一个循环肯定比向量运算慢,所以我认为上一篇文章让你走上了正确的轨道,有stringi变体吗?我已经在1000行上运行了一分钟,它仍然在运行going@Kayla我注意到“1965-66赛季球队”从来没有出现在你的
数据列表中,你有没有故意把它作为不匹配项添加进去?是的,我添加了一个不匹配项,如果所有df2都是完全随机的,有没有办法做到这一点?大概有1400万排too@Kayla如果未出现在
df1$words
中,逻辑是否可以自动排除不匹配项
library(stringi)

ExtraSentenceCount <- 1e3
ExtraVocabCount <- 1e4

Sentences <- c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539",
               stringi::stri_rand_lipsum(ExtraSentenceCount))

vocabword <- c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
           "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive",
           "a",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = sample.int(8, ExtraVocabCount, replace = TRUE),
                                      pattern = "[a-z]"))

token <- c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
           "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive",
           "XXXX",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = 3,
                                      pattern = "[0-9]"))

system.time({
  Cleaned <- stringi::stri_replace_all_regex(Sentences, "\\b"%s+%vocabword%s+%"\\b", token, vectorize_all = FALSE)
})

#   user  system elapsed 
# 36.652   0.070  36.768 

head(Cleaned)

# [1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism 749 complete wiki/Collectivism"                
# [2] "strains 454 anarchism have often been divided into the categories 454 wiki/Social_anarchism and wiki/Individualist_anarchism 094 similar dual classifications"
# [3] "the word 412 composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek 190.546"                             
# [4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative 358- 190.546"                                            
# [5] "authority sovereignty realm magistracy and the suffix 094 -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                         
# [6] "the first known use 454 this word was 201 1539" 
prefixed_vocabword <- paste0("wiki/",vocabword)
prefixed_token <- paste0("wiki/",token)

system.time({
  Cleaned <- stringi::stri_replace_all_fixed(Sentences, prefixed_vocabword, prefixed_token, vectorize_all = FALSE)
})
library(tidyverse)
df1a <- df1 %>%
  # Create a separator character to identify where to split
  mutate(datalist = str_replace_all(datalist,"wiki/","|wiki/")) %>% 
  mutate(datalist = str_remove(datalist,"^\\|"))

  # Split so that each instance gets its own column
df1a <- 
  str_split(df1a$datalist,"\\|",simplify = TRUE) %>% 
  as.tibble() %>% 
  # Add a rownum column to keep track where to put back together for later
  mutate(rownum = 1:n()) %>% 
  # Gather the dataframe into a tidy form to prepare for joining
  gather("instance","text",-rownum,na.rm = TRUE) %>% 
  # Create a column for joining to the data lookup table
  mutate(keyword = text %>% str_extract("wiki/[^ ]+") %>% str_remove("wiki/")) %>% 
  # Join the keywords efficiently using left_bind
  left_join(df2,by = c("keyword" = "vocabword")) %>% 
  # Put the results back into the text string
  mutate(text = str_replace(text,"wiki/[^ ]+",paste0("wiki/",token))) %>%
  select(-token,-keyword) %>% 
  # Spread the data back out to the original number of rows
  spread(instance,text) %>% 
  # Re-combine the sentences/strings to their original form
  unite("datalist",starts_with("V"),sep="") %>%
  select("datalist")
# A tibble: 6 x 1
  datalist                                                                                                 
  <chr>                                                                                                    
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individ~
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Indiv~
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively~
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an-~
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive su~
6 the first known use of this word was in 1539