Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R dtm稀疏度因tf/tfidf不同而不同,语料库相同_R_Text Processing_Tm_Tf Idf - Fatal编程技术网

R dtm稀疏度因tf/tfidf不同而不同,语料库相同

R dtm稀疏度因tf/tfidf不同而不同,语料库相同,r,text-processing,tm,tf-idf,R,Text Processing,Tm,Tf Idf,有人能解释一下吗 我的理解是: tf >= 0 (absolute frequency value) tfidf >= 0 (for negative idf, tf=0) sparse entry = 0 nonsparse entry > 0 因此,在使用以下代码创建的两个DTM中,精确的稀疏/非解析比例应该相同 library(tm) data(crude) dtm <- DocumentTermMatrix(crude, control=list(w

有人能解释一下吗

我的理解是:

tf >= 0 (absolute frequency value)

tfidf >= 0 (for negative idf, tf=0)



sparse entry = 0

nonsparse entry > 0
因此,在使用以下代码创建的两个DTM中,精确的稀疏/非解析比例应该相同

library(tm)
data(crude)

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2
library(tm)
数据(原油)
dtm dtm2
**非/稀疏条目:2215/23105**
稀疏度:91%
最大学期长度:17
权重:术语频率-逆文档频率(标准化)(tf idf)

稀疏度可能不同。如果TF为零或IDF为零,则TF-IDF值将为零;如果每个文档中都出现一个术语,则IDF值将为零。考虑下面的例子:

txts <- c("super World", "Hello World", "Hello super top world")
library(tm)
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))

inspect(tf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 8/4
# Sparsity           : 33%
# Maximal term length: 5
# Weighting          : term frequency (tf)
# 
#        Docs
# Terms   1 2 3
#   hello 0 1 1
#   super 1 0 1
#   top   0 0 1
#   world 1 1 1

inspect(tfidf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 5/7
# Sparsity           : 58%
# Maximal term length: 5
# Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
# 
#        Docs
# Terms           1         2         3
#   hello 0.0000000 0.2924813 0.1462406
#   super 0.2924813 0.0000000 0.1462406
#   top   0.0000000 0.0000000 0.3962406
#   world 0.0000000 0.0000000 0.0000000
术语world在文档3中出现1次,共有4个术语,并且在所有3个文档中都出现:

1/2 * log2(3/2)
# [1] 0.2924813
1/4 * log2(3/3) # 1/4 * 0
# [1] 0
1/4 * log2(3/3) # 1/4 * 0
# [1] 0