使用python计算文件中的bigram（两个字对）_Python_Regex

使用python计算文件中的bigram（两个字对）

python regex

使用python计算文件中的bigram（两个字对）,python,regex,Python,Regex,我想使用python计算文件中所有bigram（相邻单词对）的出现次数。在这里，我处理非常大的文件，所以我正在寻找一种有效的方法。我尝试在文件内容上使用带有regex“\w+\s\w+”的count方法，但它并没有被证明是有效的 e、 g.假设我想从文件a.txt中计算bigram的数量，该文件包含以下内容： "the quick person did not realize his speed and the quick person bumped " 对于上述文件，二元内存集及其计数将为：

我想使用python计算文件中所有bigram（相邻单词对）的出现次数。在这里，我处理非常大的文件，所以我正在寻找一种有效的方法。我尝试在文件内容上使用带有regex“\w+\s\w+”的count方法，但它并没有被证明是有效的

e、 g.假设我想从文件a.txt中计算bigram的数量，该文件包含以下内容：

"the quick person did not realize his speed and the quick person bumped "

对于上述文件，二元内存集及其计数将为：

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

我在Python中遇到了一个计数器对象的示例，它用于计算单字（单字）。它还使用正则表达式方法

示例如下所示：

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

上述代码的输出为：

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

我想知道是否可以使用Counter对象来获得bigram的计数。

除计数器对象或正则表达式之外的任何方法也将受到欢迎。

一些

itertools

magic:

>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+", 
   "the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))

输出：

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

奖金

获取任意n-gram的频率：

from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

>>> Counter(ngrams(words, 3))

输出：

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

这也适用于懒惰的iterables和生成器。因此，您可以编写一个生成器，逐行读取一个文件，生成单词，并将其传递给

ngarms

，以便在不读取内存中的整个文件的情况下轻松使用。

如何

zip（）

这个问题被问到并成功回答已经很久了。我从这些反馈中获益，从而创建了自己的解决方案。我想与大家分享：

    import regex
    bigrams_tst = regex.findall(r"\b\w+\s\w+", open(myfile).read(), overlapped=True)

这将提供所有不会被标点符号打断的双字符。

您可以简单地使用

计数器

处理任何n_gram，如下所示：

from collections import Counter
from nltk.util import ngrams 

text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
         ('did', 'not'): 1,
         ('his', 'speed'): 1,
         ('not', 'realize'): 1,
         ('person', 'bumped'): 1,
         ('person', 'did'): 1,
         ('quick', 'person'): 2,
         ('realize', 'his'): 1,
         ('speed', 'and'): 1,
         ('the', 'quick'): 2})

对于3克，只需将

n_gram

更改为3：

n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
         ('did', 'not', 'realize'): 1,
         ('his', 'speed', 'and'): 1,
         ('not', 'realize', 'his'): 1,
         ('person', 'did', 'not'): 1,
         ('quick', 'person', 'bumped'): 1,
         ('quick', 'person', 'did'): 1,
         ('realize', 'his', 'speed'): 1,
         ('speed', 'and', 'the'): 1,
         ('the', 'quick', 'person'): 2})

在即将推出的

Python 3.10

中，新函数提供了一种通过成对连续元素滑动的方法，这样您的用例就变成：

from itertools import pairwise
import re
from collections import Counter

# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('\w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})

中间结果的详细信息：

re.findall('\w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('\w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]

粘贴有问题的示例文本。您必须处理多行还是每个文件都在一行上？可能是重复的mhawke，文件中的文本在单行上。Ashwini Chaudhary，我已将示例文本包含在上面的代码标记中。很抱歉给您带来不便！itertools ngram功能非常棒！但是，如果您需要执行额外的文本分析，则可能值得检查。它还有一个TextBlob.ngrams（）函数，基本上做同样的事情。我已经测试了itertools和TextBlob函数，它们的执行速度和结果都相当（itertools函数的一个很小的优势）。哎呀，我忘了在比较中计算ngram，TextBlob函数本身并不能做到这一点。我曾尝试编写一个带有计数器的函数，但总的来说，这使它成为一个慢得多的选项。所以itertools赢了。这很聪明。FWIW它的作用如下：L1是

单词

，L2是

islice（单词，1，无）

，它将句子分割成以第二个单词开头的单个单词

izip（words，islice（words，1，None））

然后将L1与L2拉上拉链，以便L1中的“the”与L2中的“quick”匹配，L1中的“quick”与L2中的“person”匹配，等等。然后计数器对配对进行计数。对于Python3，您不再需要导入

izip

，只需使用

zip

。下面来自@st0le的答案实际上做了同样的事情。这很好，但缺少导入-您需要从nltk.util import ngrams添加

。FWIW它似乎比公认的解决方案运行得快一点。