Python 使用nltk在unicode文本中查找bigram_Python_Nltk

Python 使用nltk在unicode文本中查找bigram

python

Python 使用nltk在unicode文本中查找bigram,python,nltk,Python,Nltk,我试图在unicode文本中找到最常见的bigram。下面是我正在使用的代码： #!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import unicode_literals import nltk from nltk.collocations import * import codecs line = "" open_file = codecs.open('s.txt', 'r', encoding='utf-8').

我试图在unicode文本中找到最常见的bigram。下面是我正在使用的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk
from nltk.collocations import *
import codecs
line = ""
open_file = codecs.open('s.txt', 'r', encoding='utf-8').read()
for val in open_file:
    line += val.lower()
tokens = line.split()

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(1)
a = finder.ngram_fd.viewitems()
for i,j in a:
    print i,j

s.txt

文件包括以下文本：

çalışmak naberçsd bfkd

以下是输出：

(u'\xe7\xf6sd', u'bfkd') 1
(u'naber', u'\xe7\xf6sd') 1
(u'\xe7al\u0131\u015fmak', u'naber') 1

但我想要这种格式：

çalışmak naber 1
naber çösd 1
çösd bfkd 1

如何解决这个unicode问题？

您需要显式打印元组的元素，而不是整个元组

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk
from nltk.collocations import *
import codecs
line = ""
open_file = codecs.open('s.txt', 'r', encoding='utf-8').read()
for val in open_file:
    line += val.lower()
tokens = line.split()

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(1)
a = finder.ngram_fd.viewitems()
for i, j in a:
  print("{0} {1} {2}".format(i[0], i[1], j))

test.py

运行：

14:58 $ python test.py
çösd bfkd 1
naber çösd 1
çalışmak naber 1

从语料库（文本文件）生成n-gram频率后，如何从上面生成的频率获取输入n-gram的频率？（输入n-gram可能在语料库中存在，也可能不存在，如果存在，则应返回频率，否则为零。）谢谢。

14:58 $ python test.py
çösd bfkd 1
naber çösd 1
çalışmak naber 1