不要在R中使用文本挖掘形成集群
我需要对文本进行聚类(俄语文本的文本挖掘)。 代码如下:不要在R中使用文本挖掘形成集群,r,cluster-analysis,tm,R,Cluster Analysis,Tm,我需要对文本进行聚类(俄语文本的文本挖掘)。 代码如下: mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",") View(mydat) library("tm") library("SnowballC") library("textcat") corpus=Corpus(V
mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",")
View(mydat)
library("tm")
library("SnowballC")
library("textcat")
corpus=Corpus(VectorSource(mydat))
dtm=DocumentTermMatrix(corpus,
control=list(stemming=T, stopwords=F,
minWorldLenght=3,removeNumbers=T,
removePunctuation=T,
#stopwords=c(stopwords('SMART'))
weighting=function(x)
weightTf(x) ))
m<-as.matrix(dtm)
norm_eucl=function(m)
m/apply(m,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(m)
res=kmeans(m_norm,3,100)
我发现错误在预处理中 这是工作代码
mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$name))
tw.corpus <- tm_map(tw.corpus, stripWhitespace)
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus <- tm_map(tw.corpus, removeWords, stopwords("russian"))
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
doc.m <- DocumentTermMatrix(tw.corpus)
dtm_tfxidf<-weightTfIdf(doc.m)
m<-as.matrix(dtm_tfxidf)
rownames(m)<-1:nrow(m)
norm_eucl=function(m)
m/apply(m,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(m)
> dim(m_norm)
[1] 399 860
mydat=read.csv(“C:/Users/Admin/Downloads/kr_csv.csv”,sep=“;”,dec=“,”)
tw.corpus我发现错误在预处理中
这是工作代码
mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$name))
tw.corpus <- tm_map(tw.corpus, stripWhitespace)
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus <- tm_map(tw.corpus, removeWords, stopwords("russian"))
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
doc.m <- DocumentTermMatrix(tw.corpus)
dtm_tfxidf<-weightTfIdf(doc.m)
m<-as.matrix(dtm_tfxidf)
rownames(m)<-1:nrow(m)
norm_eucl=function(m)
m/apply(m,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(m)
> dim(m_norm)
[1] 399 860
mydat=read.csv(“C:/Users/Admin/Downloads/kr_csv.csv”,sep=“;”,dec=“,”)
语料库你可以发布dim(m_norm)
,和head(m_norm,20)
@missue,我编辑的m_norm
有1行4298列字符串。k表示需要比指定数量的簇更多的行(因此出现错误)如何获得多个簇?您能否发布dim(m_norm)
,以及head(m_norm,20)
@missue,我编辑的m_norm
有1行和4298列字符串。k表示期望的行数超过指定的群集数(因此出现错误)。如何获得多个群集?