R中左连接同义词的识别_R_Left Join_Synonym

R中左连接同义词的识别

R中左连接同义词的识别,r,left-join,synonym,R,Left Join,Synonym,我有几个相当大的数据表，其中包含字符，我希望将这些字符与数据库中的条目连接起来。拼写通常不太正确，因此无法连接。我知道没有办法创建同义词表来替换一些拼写错误的字符。但是，有没有一种方法可以自动检测某些异常（见下面的示例）我的数据表与此类似： data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-ch

我有几个相当大的数据表，其中包含字符，我希望将这些字符与数据库中的条目连接起来。拼写通常不太正确，因此无法连接。我知道没有办法创建同义词表来替换一些拼写错误的字符。但是，有没有一种方法可以自动检测某些异常（见下面的示例）

我的数据表与此类似：

data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))

characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

我的数据库中的字符与此类似：

data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))

characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

目前，如果我执行左键联接，则只有Apple将联接：

data <- data %>%
  left_join(characters.database, by = c('products'))

结果:

产品身份证件土豆片 NA 薯片 NA 薯片 NA 薯片 NA 苹果 NA 苹果 3. 应用 NA 苹果晚会 NA

如果我是你，我会做几件事：

我会去掉所有特殊字符，小写所有字符，删除空格，等等。这会帮助一堆薯片，薯片，薯片都进入土豆片，然后你可以加入。有一个名为fuzzyjoin的软件包，可以让你通过编辑距离等方式加入正则表达式。这将有助于解决Apple vs Apple Gala和拼写错误等问题。您可以去除特殊字符，仅保留字母+小写，例如：

图书馆长图书馆杂志字符串%>% str_remove_all[^A-Za-z]+%>% 小写

谢谢马特·凯的建议，我现在也做了类似的事情。由于我需要数据库中正确的拼写，并且我的一些字符包含相关的符号和数字，我执行了以下操作：

#data
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

#remove spaces and capital letters in data
data <- data %>%
  mutate(products= tolower(products)) %>%
  mutate(products= gsub(" ", "", products))

#add ID to database
characters.database <- characters.database %>%
  dplyr::mutate(ID = row_number())

#remove spaces and capital letters in databasr product names
characters.database_syn <- characters.database %>%
  mutate(products= tolower(products)) %>%
  mutate(products= gsub(" ", "", products))

#join and add correct spelling from database
data <- data %>%
  left_join(characters.database_syn, by = c('products')) %>%
  select(product_syn=products, 'ID') %>%
  left_join(characters.database, by = c('ID'))

#other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)

可以改为按列号联接。此链接可能会有所帮助。建议您创建一个列名称的data.frame，以便轻松引用列索引号。