如何计算Python中单词的出现次数_Python_Python 2.7

如何计算Python中单词的出现次数

python python-2.7

如何计算Python中单词的出现次数,python,python-2.7,Python,Python 2.7,我正在尝试创建一个python脚本，查看日志文件，并告诉我们用户bin出现了多少次，因此我有以下内容： #open the auth.log for reading myAuthlog=open('auth.log', 'r') for line in myAuthlog: if re.match("(.*)(B|b)in(.*)", line): print line 这会打印出完整的行 >>> Feb 4 10:43:14 j4-be02 ssh

我正在尝试创建一个python脚本，查看日志文件，并告诉我们用户bin出现了多少次，因此我有以下内容：

#open the auth.log for reading
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        print line

这会打印出完整的行

>>> Feb  4 10:43:14 j4-be02 sshd[1212]: Failed password for bin from 83.212.110.234 port 42670 ssh2

但我只想给出次数，例如用户尝试登录26次

选项1：

count = 0
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        count+=1
print count

如果您的文件不是巨大的，您可以使用

re.findall

获取结果列表的长度：

count = len(re.findall(your_regex, myAuthlog.read()))

备选案文2：

如果文件非常大，请在生成器表达式中的行上迭代，并对匹配项求和：

count = sum(1 for line in myAuthlog if re.search(your_regex, line))

这两个选项都假设您要计算得到匹配的行数，如示例代码所示。选项1还假设用户名每行可以出现一次

关于您的正则表达式的说明：

< > >（b）b（.*）也将匹配字符串“<代码> > CababiReo ，考虑使用词边界，即\b/<代码> <代码> \b（b* b）.< /p> < p>除此之外，搜索没有其他限定符的bin可能会产生很多误报。使用分组参数会使检查更昂贵（它必须存储捕获组）。最后，您应该始终为正则表达式使用原始字符串，否则它最终会咬到您。总之，您可以使用带有

if re.search（r'\b[Bb]in\b'，line）：

的正则表达式来强制执行单词边界，避免不必要的捕获，并且仍然执行您想要的操作

您甚至可以通过预编译regex对其进行一些优化（Python缓存编译的正则表达式，但它仍然需要每次执行Python级别的代码来检查缓存；编译后的对象直接进入C，没有延迟）

这使您可以简化为：

import re

# Compile and store bound method with useful name; use character classes
# to avoid capture of B/b, and word boundary assertions to avoid capturing
# longer words containing name, e.g "binary" when you want bin
hasuser = re.compile(r'\b[Bb]in\b').search

#open the auth.log for reading using with statement to close file deterministically
with open('auth.log') as myAuthlog:
    # Filter for lines with the specified user (in Py3, would need to wrap
    # filter in list or use sum(1 for _ in filter(hasuser, myAuthlog)) idiom
    loginattempts = len(filter(hasuser, myAuthlog))
print "User attempted to log in", loginattempts, "times"

您不需要周围的

除非您明确地试图提取周围的文本

count=sum（如果重新匹配（（.*）（B|B）in（.*），line））

所以增加一个计数器而不是打印。

打印len（re.findall（.*）（B|B）in（.*），myAuthlog