R：字符串之间的加权逆文档频率（tfidf）相似性_R_Similarity_Quanteda

R：字符串之间的加权逆文档频率（tfidf）相似性

R：字符串之间的加权逆文档频率（tfidf）相似性,r,similarity,quanteda,R,Similarity,Quanteda,我希望能够找到两个字符串之间的相似性，用每个标记（单词）的反向文档频率加权（这些频率不是从这些字符串中获取的）使用quanteda我可以创建一个具有反向频率权重的dfm\u tfidf，但不知道之后如何继续样本数据： ss=c( “ibm马德里研究有限公司”， “马德里研究有限公司”， “有限研究”， “研究” ) 计数=列表（ibm=1，马德里=2，有限=3，研究=4） cor=语料库（长字符串列表）##我们从中提取单词的文档 df=dfm（cor，tolower=T，verbose=T）

我希望能够找到两个字符串之间的相似性，用每个标记（单词）的反向文档频率加权（这些频率不是从这些字符串中获取的）

使用

quanteda

我可以创建一个具有反向频率权重的

dfm\u tfidf

，但不知道之后如何继续

样本数据：

ss=c(
“ibm马德里研究有限公司”，
“马德里研究有限公司”，
“有限研究”，
“研究”
)
计数=列表（ibm=1，马德里=2，有限=3，研究=4）
cor=语料库（长字符串列表）##我们从中提取单词的文档
df=dfm（cor，tolower=T，verbose=T）
dfi=dfm_tfidf（df）

目标是找到一个函数

相似性，该函数将：
res=similarity（dfi，“ibm有限公司”，similarity\u scheme=“简单匹配”）

形式为res（示例为随机数）：
理想情况下，应用于这些频率的函数如下：
sim = sum(Wc) / sqrt(sum(Wi)*sum(Wj)) 

其中：
Wc
是两个字符串共有的单词的权重。

Wi
和Wj
是string1和string2中单词的权重
 我对quanteda
和qdap
软件包有问题，所以我构建了自己的代码，以获得包含单个单词和频率计数的数据帧。代码当然可以改进，但我认为它展示了如何做到这一点
library(RecordLinkage)
library(stringr)
library(dplyr)

searchstring = c(
  "ibm madrid limited research", 
  "madrid limited research", 
  "limited research",
  "research"
)

cleanInput <- function(x) {
  x <- tolower(x)
  x <- removePunctuation(x)
  x <- stripWhitespace(x)
  x <- gsub("-", "", x)
  x <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
  x <- gsub("[[:digit:]]+", "", x)
}

searchstring <- cleanInput(searchstring)
splitted <- str_split(searchstring, " ", simplify = TRUE)
df <- as.data.frame(as.vector(splitted))
df <- df[df$`as.vector(splitted)` != "", , drop = FALSE]
colnames(df)[1] <- "string"
result <- df %>%
  group_by(string) %>%
  summarise(n = n())
result$string <- as.character(result$string)

我希望这就是你想要的：-）
这是一个整洁的解决你问题的方法
我使用tidytext
处理nlp内容，并使用widyr
计算文档之间的余弦相似性
注意，我将原始的ss
向量转换为一个带有ID
列的tidy
数据帧。你可以做那个专栏，但它将是我们最后用来表示相似性的
library(tidytext)
library(widyr)

# turn your original vector into a tibble with an ID column
ss <- c(
  "ibm madrid limited research", 
  "madrid limited research", 
  "limited research",
  "research",
  "ee"
) %>% as.tibble() %>% 
  rowid_to_column("ID")


# create df of words & counts (tf-idf needs this)
ss_words <- ss %>% 
  unnest_tokens(words, value) %>% 
  count(ID, words, sort = TRUE)

# create tf-idf embeddings for your data
ss_tfidf <- ss_words %>% 
  bind_tf_idf(ID, words, n)

# return list of document similarity
ss_tfidf %>% 
  pairwise_similarity(ID, words, tf_idf, sort = TRUE)

库（tidytext）
图书馆（widyr）
#将原始向量转换为具有ID列的TIBLE
ss%as.tible（）%>%
行ID到列（“ID”）
#创建单词和计数的df（tf idf需要）
字数%
unnest_标记（字、值）%%>%
计数（ID、单词、排序=真）
#为您的数据创建tf idf嵌入
ss_tfidf%
绑定（ID，单词，n）
#文档相似性返回列表
ss_tfidf%>%
成对相似性（ID、单词、tf\U idf、sort=TRUE）

上述各项的输出将为：

## A tibble: 12 x 3
#   item1 item2 similarity
#   <int> <int>      <dbl>
# 1     3     2      0.640
# 2     2     3      0.640
# 3     4     3      0.6  
# 4     3     4      0.6  
# 5     2     1      0.545
# 6     1     2      0.545
# 7     4     2      0.384
# 8     2     4      0.384
# 9     3     1      0.349
#10     1     3      0.349
#11     4     1      0.210
#12     1     4      0.210


##一个tibble:12x3
#第1项第2项相似性
#          
# 1     3     2      0.640
# 2     2     3      0.640
# 3     4     3      0.6  
# 4     3     4      0.6  
# 5     2     1      0.545
# 6     1     2      0.545
# 7     4     2      0.384
# 8     2     4      0.384
# 9     3     1      0.349
#10     1     3      0.349
#11     4     1      0.210
#12     1     4      0.210

其中item1
和item2
指的是我们前面创建的ID
列
这个答案有一些奇怪的警告。例如，请注意，我将ee
标记添加到了ss
向量中：当一个文档中有一个标记时，pairwise\u相似度
失败。奇怪的行为，但希望这能让你开始。
你想要textstat\u simil（）
函数来自quanteda。您应该将目标文档添加到语料库中，然后使用selection
参数来关注该文档。“简单匹配”是作为一种相似性方法实现的，但您应该注意，这会查找是否存在术语，因此tf idf权重不会影响这一点
库（“quanteda”）
##软件包版本：1.4.3
## 
ss%
as.matrix（）
##文本1
##text1.00
##text20.50
##文本3 0.25
##text40.50
##文本5 0.25
ssdfm%>%
textstat_simil（method=“simple matching”，selection=“text1”）%%>%
as.matrix（）
##文本1
##text1.00
##text20.50
##文本3 0.25
##text40.50
##文本5 0.25
查看“tidytext”软件包（）。不确定，但也许你能找到解决办法
library(tidytext)
library(widyr)

# turn your original vector into a tibble with an ID column
ss <- c(
  "ibm madrid limited research", 
  "madrid limited research", 
  "limited research",
  "research",
  "ee"
) %>% as.tibble() %>% 
  rowid_to_column("ID")


# create df of words & counts (tf-idf needs this)
ss_words <- ss %>% 
  unnest_tokens(words, value) %>% 
  count(ID, words, sort = TRUE)

# create tf-idf embeddings for your data
ss_tfidf <- ss_words %>% 
  bind_tf_idf(ID, words, n)

# return list of document similarity
ss_tfidf %>% 
  pairwise_similarity(ID, words, tf_idf, sort = TRUE)


## A tibble: 12 x 3
#   item1 item2 similarity
#   <int> <int>      <dbl>
# 1     3     2      0.640
# 2     2     3      0.640
# 3     4     3      0.6  
# 4     3     4      0.6  
# 5     2     1      0.545
# 6     1     2      0.545
# 7     4     2      0.384
# 8     2     4      0.384
# 9     3     1      0.349
#10     1     3      0.349
#11     4     1      0.210
#12     1     4      0.210