Python 双克频率计数

Python 双克频率计数,python,nlp,arff,Python,Nlp,Arff,我已经编写了一段代码,基本上计算单词频率,并将它们插入ARFF文件中,以便与weka一起使用。我想修改它,以便它可以计算双克频率,即成对的单词,而不是单个单词,尽管我的尝试最多也被证明是不成功的 我意识到有很多东西要看,但在这方面的任何帮助都是非常感谢的。 这是我的密码: import re import nltk # Quran subset filename = raw_input('Enter name of file to convert to ARFF

我已经编写了一段代码,基本上计算单词频率,并将它们插入ARFF文件中,以便与weka一起使用。我想修改它,以便它可以计算双克频率,即成对的单词,而不是单个单词,尽管我的尝试最多也被证明是不成功的

我意识到有很多东西要看,但在这方面的任何帮助都是非常感谢的。 这是我的密码:

    import re
    import nltk

    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

    # create list of lower case words
    word_list = re.split('\s+', file(filename).read().lower())
    print 'Words in text:', len(word_list)
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    word_list = [punctuation.sub("", word) for word in word_list]

    word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



    # create dictionary of word:frequency pairs
    freq_dic = {}


    for word in word_list2:

        # form dictionary
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1


    print '-'*30

    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    freq_list2 = [(val, key) for key, val in freq_dic.items()]
    # sort by val or frequency
    freq_list2.sort(reverse=True)
    freq_list3 = list(freq_list2)
    # display result as top 10 most frequent words
    freq_list4 =[]
    freq_list4=freq_list3[:10]

    words = []

    for item in freq_list4:
        a = str(item[1])
        a = a.lower()
        words.append(a)



    f = open(filename)

    newlist = []

    for line in f:
        line = punctuation.sub("", line)
        line = line.lower()
        newlist.append(line)

    f2 = open('Lines.txt','w')

    newlist2= []
    for line in newlist:
        line = line.split()
        newlist2.append(line)
        f2.write(str(line))
        f2.write("\n")


    print newlist2

    # ARFF Creation

    arff = open('output.arff','w')
    arff.write('@RELATION wordfrequency\n\n')
    for word in words:
        arff.write('@ATTRIBUTE ')
        arff.write(str(word))
        arff.write(' numeric\n')

    arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
    arff.write('@DATA\n')
    # Counting word frequencies for each verse
    for line in newlist2:
        word_occurrences = str("")
        for word in words:
            matches = int(0)
            for item in line:
                if str(item) == str(word):
                matches = matches + int(1)
                else:
                continue
            word_occurrences = word_occurrences + str(matches) + ","
        word_occurrences = word_occurrences + "endofworld"
        arff.write(word_occurrences)
        arff.write("\n")

    print words

这应该让你开始:

def bigrams(words):
    wprev = None
    for w in words:
        yield (wprev, w)
        wprev = w

请注意,第一个双字符是
(None,w1)
,其中
w1
是第一个单词,因此您有一个特殊的双字符标记文本的开头。如果您还想要一个文本结尾的二元图,请在循环后添加
yield(wprev,None)

我已经为您重写了第一位,因为它很讨厌。注意事项:

  • 列表理解是你的朋友,多使用它们
  • 收藏。计数器
    很棒 好的,代码:

    import re
    import nltk
    import collections
    
    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
    
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    
    # create list of lower case words
    word_list = re.split('\s+', open(filename).read().lower())
    print 'Words in text:', len(word_list)
    
    words = (punctuation.sub("", word).strip() for word in word_list)
    words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
    
    # create dictionary of word:frequency pairs
    frequencies = collections.Counter(words)
    
    print '-'*30
    
    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    print frequencies
    
    # display result as top 10 most frequent words
    print frequencies.most_common(10)
    
    [word for word, frequency in frequencies.most_common(10)]
    

    通用到n-gram,带有可选填充,还使用
    defaultdict(int)
    作为频率,在2.6中工作:

    from collections import defaultdict
    
    def ngrams(words, n=2, padding=False):
        "Compute n-grams with optional padding"
        pad = [] if not padding else [None]*(n-1)
        grams = pad + words + pad
        return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
    
    # grab n-grams
    words = ['the','cat','sat','on','the','dog','on','the','cat']
    for size, padding in ((3, 0), (4, 0), (2, 1)):
        print '\n%d-grams padding=%d' % (size, padding)
        print list(ngrams(words, size, padding))
    
    # show frequency
    counts = defaultdict(int)
    for ng in ngrams(words, 2, False):
        counts[ng] += 1
    
    print '\nfrequencies of bigrams:'
    for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
        print c, ng
    
    输出:

    3-grams padding=0
    [('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), 
     ('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'), 
     ('on', 'the', 'cat')]
    
    4-grams padding=0
    [('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'), 
     ('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'), 
     ('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]
    
    2-grams padding=1
    [(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), 
     ('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'), 
     ('the', 'cat'), ('cat', None)]
    
    frequencies of bigrams:
    2 ('the', 'cat')
    2 ('on', 'the')
    1 ('the', 'dog')
    1 ('sat', 'on')
    1 ('dog', 'on')
    1 ('cat', 'sat')
    

    如果您开始使用NLTK的FreqDist函数进行计数,生活会轻松得多。NLTK还具有二进制特征。下页提供了两个示例


    如果您的第一个项目不是(无,第一个单词),而是(第一个单词,第二个单词),这样调用者就不必为第一个项目写特殊的大小写,我会很高兴。@Steven:有一个特殊的双字符标记文本的开头。事实上,对于一个真正的应用程序,我也会在末尾添加一行
    yield(wprev,None)
    。这个答案与itertools模块文档中成对迭代器的配方中使用的想法相同(请参阅)。-1。这是一次出色的重写,但您没有回答有关更改代码以计算双克频率的问题。我似乎不明白为什么,但它不断出现错误
    frequencies=collections.Counter(单词)AttributeError:“module”对象没有属性“Counter”
    @Alex:您使用的是Python 2.6版或更低版本<代码>计数器在2.7中引入。升级或编写自己的
    计数器
    。。。