Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
有没有更好的方法来获得';重要词语';从python中的列表中?_Python_Api_Nltk_Reddit - Fatal编程技术网

有没有更好的方法来获得';重要词语';从python中的列表中?

有没有更好的方法来获得';重要词语';从python中的列表中?,python,api,nltk,reddit,Python,Api,Nltk,Reddit,我使用reddit praw api编写了一些代码来查找reddit上提交标题中最流行的单词 import nltk import praw picksub = raw_input('\nWhich subreddit do you want to analyze? r/') many = input('\nHow many of the top words would you like to see? \n\t> ') print 'Getting the top %d most c

我使用reddit praw api编写了一些代码来查找reddit上提交标题中最流行的单词

import nltk
import praw

picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')

print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)

hey = []

for x in submissions:
    hey.extend(str(x).split(' '))   

fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()

common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1

print '-----------------------'
for word in top_words:  
    if word.lower() not in common_words and word.lower() not in already:
        print str(number) + ". '" + word + "'"
        counter +=1
    number +=1
    already.append(word.lower())
if counter == many:
    break
print '-----------------------\n'
因此,输入subreddit“python”并获得10篇文章返回:


  • “Python”
  • “派比”
  • “代码”
  • “使用”
  • ‘136’
  • “181”
  • “d…”
  • “伊皮顿”
  • ‘133’ 10. '158'

  • 我怎样才能使这个脚本不返回数字和像“d…”这样的错误单词?前4个结果是可以接受的,但我想用有意义的词来代替其余的结果。列出常见的单词是不合理的,并且不能过滤这些错误。我对编写代码还比较陌生,我很感谢您的帮助

    我不同意。列出一个常用词的列表是正确的,没有更简单的方法可以过滤掉,for,I,am等等。。但是,使用common_单词列表来过滤非单词的结果是不合理的,因为这样您就必须包含所有可能不需要的非单词。非单词应该以不同的方式过滤掉

    一些建议:
    1) 常见的单词应该是一个
    set()
    ,因为你的列表很长,这会加快速度。在操作中,对于O(1)中的集合,它的
    ,而对于列表,它是O(n)

    2) 去掉所有数字字符串是很简单的。一种方法是:

    all([w.isdigit() for w in word])
    
    如果返回True,那么单词就是一系列数字

    3) 摆脱d。。。这有点棘手。这取决于你如何定义一个非单词。这:

    tf = [ c.isalpha() for c in word ]
    
    返回真/假值列表(如果字符不是字母,则为假)。然后可以按如下方式计算值:

    t = tf.count(True)
    f = tf.count(False)
    
    然后,您可以将非单词定义为非字母字符多于字母的单词,定义为具有任何非字母字符的单词,等等。例如:

    def check_wordiness(word):
        # This returns true only if a word is all letters
        return all([ c.isalpha() for c in word ])
    
    4) 在顶部单词中的单词:
    块中,您确定没有混淆计数器和数字吗?此外,计数器和数字几乎是冗余的,您可以将最后一位重写为:

    for word in top_words:
        # Since you are calling .lower() so much, 
        # you probably want to define it up here
        w = word.lower() 
        if w not in common_words and w not in already:
            # String formatting is preferred over +'s
            print "%i. '%s'" % (number, word)
            number +=1
        # This could go under the if statement. You only want to add
        # words that could be added again.  Why add words that are being
        # filtered out anyways?
        already.append(w)
    
        # this wasn't indented correctly before
        if number == many:
            break
    

    希望这有帮助。

    而不是
    all([w.isdigit()代表word中的w])
    你可以说
    word.isdigit()
    。对于
    .isalpha()
    ,也是如此。这一点很好。我决定扩展逻辑,因为OP说他对编码是新手,列表理解并没有比这更清楚。