python中的文本内容分析器_Python_Statistics_Data Science

python中的文本内容分析器

python statistics

python中的文本内容分析器,python,statistics,data-science,Python,Statistics,Data Science,我用python创建了一个文本内容分析器，它分析文件的输入和输出总字数独特词的计数句子的数量代码如下： import re import string import os import sys def function(s): return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower()) def main(): words_list = [] with open(sys.argv[1

我用python创建了一个文本内容分析器，它分析文件的输入和输出

总字数

独特词的计数

句子的数量

代码如下：

import re
import string
import os
import sys

def function(s):
    return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower())

def main():
    words_list = []

    with open(sys.argv[1], "r") as f:
        for line in f:
            words_list.extend(line.split())

    print "Total word count:", len(words_list)

    new_words = map(function, words_list)

    print "Unique words:", len(set(new_words))

    nb_sentence = 0
    for word in words_list:
        if re.search(r'[.!?][' "'" '"]*', word):
            nb_sentence += 1

    print "Sentences:", nb_sentence

if __name__ == "__main__":
    main()

我现在正试图计算单词的平均句子长度，找到常用短语（一个包含3个或更多单词的短语使用了3次以上），并按频率降序列出使用的单词。有人能帮忙吗？

以下是一些可以帮助您的方法：

对于单词中的平均句子长度，可以按句点拆分以获得句子数组，然后按空格拆分该数组中的每个句子以获得每个句子中的单词数组。然后，您可以计算句子数组中每个单词数组的长度，并平均这些长度
要按降序排列使用的单词列表，可以在每个单词上迭代的空格上拆分文本，并将计数存储在字典中，其中关键字为单词，值为出现次数。然后，您可以迭代该字典中的键，创建单词和计数的元组，并对这些元组进行排序以找出最常见的单词。以下是一个相关问题的解决方案，用于解决字符串中的常见字符：
对于经常使用的短语（3个单词的短语使用了3次以上），您可以执行与上面相同的计算，但每三个空格（使用正则表达式）进行拆分，而不是单独分析每个单词，过滤掉计数小于3的任何内容。计算3个或更多单词的常用短语更为棘手，但如果你解决了前面的所有问题，答案可能会变得更加明显