Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 因子和字符串的无监督聚类_R_Text_Cluster Analysis_Tm_Hierarchical - Fatal编程技术网

R 因子和字符串的无监督聚类

R 因子和字符串的无监督聚类,r,text,cluster-analysis,tm,hierarchical,R,Text,Cluster Analysis,Tm,Hierarchical,我有一个包含因子的数据集和一些自由格式的字段(用户输入的数据)。我尝试使用k-means和层次聚类对这个数据集进行聚类。数据如下所示: prod_name prod_detail_1 company model product_problem_description 1 motor pump 1 123A broken 1 calculator 2 WE4 battery not working 2 motor pump 3 564

我有一个包含因子的数据集和一些自由格式的字段(用户输入的数据)。我尝试使用k-means和层次聚类对这个数据集进行聚类。数据如下所示:

prod_name   prod_detail_1   company model   product_problem_description
1   motor pump  1   123A    broken 
1   calculator  2   WE4     battery not working
2   motor pump  3   564     display broken
3   refrigerator3   12E     regular inspection
4   calculator  4   453E    regular inspection
5   motor pump  1   123A    pump will not turn on
我使用“tm”软件包清理并创建了“产品问题描述”的文档术语矩阵,将每个条目视为一个文档。(我上面写的问题描述非常简单,但在实际数据中更复杂,不能作为因素考虑)

prob\u desc\u语料库
prob_desc_corpus <- Corpus(VectorSource(data$product_problem_description))
removeNumPunct <- function(x) gsub("[^[:alnum:][:space:]]*", "", x)
prob_desc_corpus <- tm_map(prob_desc_corpus,     content_transformer(removeNumPunct))
prob_desc_matrix <- DocumentTermMatrix(prob_desc_corpus, 
                                   control=list(
                                     minWordLength=2,
                                     stopwords = TRUE,
                                     stripWhitespace = TRUE,
                                     stemming = TRUE))
full_data <- cbind(data$prod_name, data$detail_1, data$company, data$model,        as.data.frame(as.matrix(prob_desc_matrix)), stringsAsFactors = FALSE) 
distances= dist(full_data, method = "euclidean") 
hc <- hclust(distances, method = "ward.D") 
plot(hc)