Python spark-mllib中的文档分类_Python_Apache Spark Mllib_Naivebayes_Document Classification

Python spark-mllib中的文档分类

python

Python spark-mllib中的文档分类,python,apache-spark-mllib,naivebayes,document-classification,Python,Apache Spark Mllib,Naivebayes,Document Classification,我想对涉及体育、娱乐、政治的文件进行分类。我创建了一个单词包，输出如下内容：（1，“索拉什特拉”）（1，“saumyajit”）（1，“satyendra”）我想使用Spark mllib实现用于分类的朴素贝叶斯算法。我的问题是如何将此输出转换为NaiveBayes可以用作RDD之类的分类输入的内容，或者如果有任何技巧，我可以直接将html文件转换为mllib NaiveBayes可以使用的内容对于文本分类，您需要：单词词典使用字典将文档转换为矢量标记文档向量： doc_vec

我想对涉及体育、娱乐、政治的文件进行分类。我创建了一个单词包，输出如下内容：

（1，“索拉什特拉”）（1，“saumyajit”）（1，“satyendra”）

我想使用Spark mllib实现用于分类的朴素贝叶斯算法。我的问题是如何将此输出转换为NaiveBayes可以用作RDD之类的分类输入的内容，或者如果有任何技巧，我可以直接将html文件转换为mllib NaiveBayes可以使用的内容

对于文本分类，您需要：

单词词典
使用字典将文档转换为矢量
标记文档向量：
doc_vec1->label1
doc_vec2->label2

这很简单。

对于文本分类，您需要：

    from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
    from pyspark.ml.classification import NaiveBayes

    # regular expression tokenizer
    regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", 
    pattern="\\W")
    # stop words
    add_stopwords = ["http","https","amp","rt","t","c","the"] 
    stopwordsRemover = 
  StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
    # bag of words count
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
    vocabSize=10000, minDF=5)
    (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
    nb = NaiveBayes(smoothing=1)
    model = nb.fit(trainingData)
    predictions = model.transform(testData)
    predictions.filter(predictions['prediction'] == 0) \
     .select("Descript","Category","probability","label","prediction") \
     .orderBy("probability", ascending=False) \
     .show(n = 10, truncate = 30)

单词词典
使用字典将文档转换为矢量
标记文档向量：
doc_vec1->label1
doc_vec2->label2

这很简单。

谢谢你的回答。但我对第二部分有点困惑：使用字典将文档转换为向量，就像我有标签一样（音乐、政治、娱乐）和每个文档中的前50个单词。如何映射它们以构建分类器。例如，他们已经有了文本文件，可以用作朴素贝叶斯的输入。

doc-vec

只是一个长数组，比如说50个元素，数组的索引是dict中单词的id，元素的值是对应单词的计数。建议您将计数标准化。谢谢您的回答。但我对第二部分有点困惑：使用字典将文档转换为向量，就像我有标签一样（音乐、政治、娱乐）和每个文档中的前50个单词。如何映射它们以构建分类器。例如，他们已经有了文本文件，可以用作朴素贝叶斯的输入。

doc-vec

只是一个长数组，比如说50个元素，数组的索引是dict中单词的id，元素的值是对应单词的计数。建议您将计数标准化。以下是一些有关计数的指导原则。提供的答案可能是正确的，但可以从解释中获益。仅代码答案不被视为“好”答案。这里有一些指导原则。提供的答案可能是正确的，但可以从解释中获益。仅代码答案不被视为“好”答案。从…起

    from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
    from pyspark.ml.classification import NaiveBayes

    # regular expression tokenizer
    regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", 
    pattern="\\W")
    # stop words
    add_stopwords = ["http","https","amp","rt","t","c","the"] 
    stopwordsRemover = 
  StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
    # bag of words count
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
    vocabSize=10000, minDF=5)
    (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
    nb = NaiveBayes(smoothing=1)
    model = nb.fit(trainingData)
    predictions = model.transform(testData)
    predictions.filter(predictions['prediction'] == 0) \
     .select("Descript","Category","probability","label","prediction") \
     .orderBy("probability", ascending=False) \
     .show(n = 10, truncate = 30)