Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 清理很长的公司名称列表-对data.table的每一行应用函数_R_Stringr_Stringi - Fatal编程技术网

R 清理很长的公司名称列表-对data.table的每一行应用函数

R 清理很长的公司名称列表-对data.table的每一行应用函数,r,stringr,stringi,R,Stringr,Stringi,我有一个带有公司名称和地址信息的data.table。我想删除公司名称中的法人实体和最常见的词语。 因此,我编写了一个函数并将其应用于data.table search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals

我有一个带有公司名称和地址信息的data.table。我想删除公司名称中的法人实体和最常见的词语。 因此,我编写了一个函数并将其应用于data.table

search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals", 
                     "services", "international",  "consulting", "the", "for")

clean_strings <- function(string, search_for=search_for_default){
     clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
     clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
     clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")]  # clean up geographical names
     res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" "))   #remove legal entities and common words
     res <- paste(unique(unlist(str_split(res, " "))), collapse=" ")  # paste string together
     return(res) }

datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]

search\u for\u default您能否添加一些示例,说明
COMPANY\u NAME
的外观?因此,我们可以运行一些测试…我将添加示例shouldnt
Amazon.com,Inc.
变成
Amazon.com
而不是你写的
Amazon.com
?是的,“Amazon.com,Inc.”变成“Amazon.com”
Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")
Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")