R 清理很长的公司名称列表-对data.table的每一行应用函数
我有一个带有公司名称和地址信息的data.table。我想删除公司名称中的法人实体和最常见的词语。 因此,我编写了一个函数并将其应用于data.tableR 清理很长的公司名称列表-对data.table的每一行应用函数,r,stringr,stringi,R,Stringr,Stringi,我有一个带有公司名称和地址信息的data.table。我想删除公司名称中的法人实体和最常见的词语。 因此,我编写了一个函数并将其应用于data.table search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals
search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals",
"services", "international", "consulting", "the", "for")
clean_strings <- function(string, search_for=search_for_default){
clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")] # clean up geographical names
res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" ")) #remove legal entities and common words
res <- paste(unique(unlist(str_split(res, " "))), collapse=" ") # paste string together
return(res) }
datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]
search\u for\u default您能否添加一些示例,说明COMPANY\u NAME
的外观?因此,我们可以运行一些测试…我将添加示例shouldntAmazon.com,Inc.
变成Amazon.com
而不是你写的Amazon.com
?是的,“Amazon.com,Inc.”变成“Amazon.com”
Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")
Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")