部分字符串匹配&；R中的替换_R_String Matching_Data Cleaning

部分字符串匹配&；R中的替换

部分字符串匹配&；R中的替换,r,string-matching,data-cleaning,R,String Matching,Data Cleaning,我有一个这样的数据帧 > myDataFrame company 1 Investment LLC 2 Hyperloop LLC 3 Invezzstment LLC 4 Investment_LLC 5 Haiperloop LLC 6 Inwestment LLC 我需要匹配所有这些模糊字符串，因此最终结果应该如下所示： > myDataFrame company 1 Investment LLC 2

我有一个这样的数据帧

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3 Invezzstment LLC
4   Investment_LLC
5   Haiperloop LLC
6   Inwestment LLC

我需要匹配所有这些模糊字符串，因此最终结果应该如下所示：

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3   Investment LLC
4   Investment LLC
5    Hyperloop LLC
6   Investment LLC

所以，实际上，我必须解决部分匹配，并替换分类变量的任务。BaseR和包中有很多很好的函数来解决字符串匹配问题，但我一直在为这种匹配和替换寻找单一的解决方案。我不在乎哪种情况会取代其他情况，例如“Investment LLC”或“InvezzStatt LLC”都同样好。只是需要他们保持一致

是否有任何单一的一体化功能或循环

如果你有一个正确拼写的向量，

agrep

使这相当容易：

myDataFrame$company <- sapply(myDataFrame$company, 
                              function(val){agrep(val, 
                                                  c('Investment LLC', 'Hyperloop LLC'), 
                                                  value = TRUE)})

myDataFrame
#          company
# 1 Investment LLC
# 2  Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5  Hyperloop LLC
# 6 Investment LLC

myDataFrame$company所以，过了一段时间，我最终得到了这个愚蠢的代码。注意：不是完全自动化更换过程，因为每次正确的匹配都需要人工验证，每次我们都需要对agrepmax.distance
参数进行微调。我完全相信有办法使它更好更快，但这可以帮助完成工作
    ##########
    # Manual renaming with partial matches
    ##########

    # a) Take a look at the desired column of factor variables
    sort(unique(MYDATA$names))   # take a look

    # ****
    Sensthreshold <- 0.2   # sensitivity of agrep, usually 0.1-0.2 get it right
    Searchstring <- "Invesstment LLC"   # what should I search?
    # ****

    # User-defined function: returns similar string on query in column
    Searcher <- function(input, similarity = 0.1) {
      unique(agrep(input, 
                   MYDATA$names,   # <-- define your column here
                   ignore.case = TRUE, value = TRUE,
                   max.distance = similarity))
    }

    # b) Make a search of desired string
    Searcher(Searchstring, Sensthreshold)   # using user-def function 
    ### PLEASE INSPECT THE OUTPUT OF THE SEARCH
    ### Did it get it right?

 =============================================================================#
    ## ACTION! This changes your dataframe!
    ## Please make backup before proceeding
    ## Please execute this code as a whole to avoid errors

    # c) Make a vector of cells indexes after checking output
    vector_of_cells <- agrep(Searchstring, 
                       MYDATA$names, ignore.case = TRUE,
                       max.distance = Sensthreshold)
    # d) Apply the changes
    MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
    # e) Check result
    unique(agrep(Searchstring, MYDATA$names, 
                 ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#

##########
#使用部分匹配手动重命名
##########
#a）查看所需的因子变量列
排序（唯一（MYDATA$names））#看一看
# ****
你能描述一下到目前为止你都做了些什么吗？例如，为什么base:：agrep对你不起作用？亲爱的@Calimo，base:：agrep在寻找类似字符串方面工作得非常好，但我不能强迫他逐行替换字符串。我尝试了一些for和while循环，但没有成功。算法应该如下：1）R在向量中找到一个字符串2）将它与其他字符串进行比较3）与它相似的每个字符串（提供了一些距离测量）都必须替换为该字符串。请发布您已有的代码，以便我们从中获取它。顺便说一句，我从您对答案的评论中了解到，选择拼写错误的InvezzStatt LLC
可以吗？@Calimo，我删除了此代码，但（不幸的是）在提交期间它没有保存在我的git中。无论如何，它几乎没有什么用处，因为它是无效的。我记得我使用了带有部分匹配功能的sapply
（我认为agrep
）。“InvezzStatt LLC”完全可以。事实上，InvezzStatt LLC和Investment LLC是一回事；我需要R来取其中任何一个，并替换所有其他出现的情况，所以我有一个很好的分类变量用于这个类别。当您有超过50000条记录和1200个唯一值时，找出拼写错误的变量将是一项乏味的工作。感谢您的回复，alistaire！我没有拼写正确的实体向量。我尝试了你关于adist函数的建议，但是，R无法计算这个字符串距离矩阵，因为n个记录=59396，所以这个大型矩阵对象超过了26.3 Gb。桌子是个好主意，我要试试。