使用Python查找文件大小字符串中数组字符串的频率

使用Python查找文件大小字符串中数组字符串的频率,python,regex,arrays,string,Python,Regex,Arrays,String,我已经看过很多答案,它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做,我的字符串也不是来自文本文件 给定一个大字符串,如文件大小的字符串,如何计算大字符串中每个数组元素的频率(包括单词中的空格) def calculate_commonness(context, links): c = Counter() content = context.translate(string.maketrans("",""), string.punctuation).sp

我已经看过很多答案,它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做,我的字符串也不是来自文本文件

给定一个大字符串,如文件大小的字符串,如何计算大字符串中每个数组元素的频率(包括单词中的空格)

def calculate_commonness(context, links):
    c = Counter()
    content = context.translate(string.maketrans("",""), string.punctuation).split(None)

    for word in content:
        if word in links:
            c[word] += 1
    print c

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
links = ['November', 'Laundress', 'Passage', 'Father had']

# My output should look (something) like this:
# November = 4
# Laundress = 1
# Passage = 2
# Father had = 1
目前,它正在寻找十一月,洗衣店和通道,但不是'父亲有'。我需要能够找到带有空格的字符串元素。我知道这是因为我正在拆分返回父所拥有的上下文,所以我如何恰当地拆分上下文,或者如何将其与regex findall一起使用

编辑: 使用上下文作为大字符串,我有:

    for l in links:
        c[l] = context.lower().count(l)
    print c
返回:

Counter({'Laundress': 0, 'November': 0, 'Father had': 0, 'Passage': 0})
你试过了吗

context.lower()
counts = {word: context.count(word)
          for word in links}
注意:将上下文保留为字符串

你试过了吗

context.lower()
counts = {word: context.count(word)
          for word in links}

注意:将上下文保留为字符串

这是一个使用regex findall的实现

import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links 
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."

result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]

# Now we count regex matches
counts = [0] * len(links)
for x in result:
    for i in range(len(links)):
        if not x[i] == "":
             counts[i] += 1

这是一个使用regex findall的实现

import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links 
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."

result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]

# Now we count regex matches
counts = [0] * len(links)
for x in result:
    for i in range(len(links)):
        if not x[i] == "":
             counts[i] += 1
试试这个

>>> import re
>>> for word in links:
    print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))


November=4
Laundress=1
Passage=2
Father had=1
>>> 
您也可以使用ignorecase

 for word in links:
         print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))
试试这个

>>> import re
>>> for word in links:
    print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))


November=4
Laundress=1
Passage=2
Father had=1
>>> 
您也可以使用ignorecase

 for word in links:
         print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))


如果某些链接是其他链接的串联子字符串,该怎么办?例如,links=['day'、'mayday'、'today'],那么我就不应该计算它们了。因此,在这种情况下,日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子,那么links=['11月','was十一月']呢?结果是{11月4日,是11月2日}还是{11月2日,是11月2日}?在这种情况下,这将是第二组结果。十一月二号,是十一月二号。这是计算一个术语(一个词或多个词)被用作指向文档中某个特定位置的链接的次数。如果某些链接是其他链接的串联子字符串,该怎么办?例如,links=['day'、'mayday'、'today'],那么我就不应该计算它们了。因此,在这种情况下,日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子,那么links=['11月','was十一月']呢?结果是{11月4日,是11月2日}还是{11月2日,是11月2日}?在这种情况下,这将是第二组结果。十一月二号,是十一月二号。这是计算一个术语(一个词或多个词)被用作指向文档中某个特定位置的链接的次数。这只是测试一个更大项目的字符串,因此这样做意味着我必须检查我的数据库,手动键入每个有多个词的链接。不,你不必手动执行,这只是一个例子。我已经更新了我的答案,准备了一个更完整的演示。我设法让它比我现在的更好地工作!但对于链接中的每个术语,它都返回0。我添加了一个编辑。是的,因为链接中仍然有大写字母。如果大小写很重要,请删除小写,否则请将链接全部小写。根据OP下的注释,这是可行的,但会多次计算子字符串,这不是预期的结果。这只是一个更大项目的测试字符串,因此这样做意味着我必须检查我的数据库,然后手动键入每个包含多个单词的链接。不,您不必手动键入,这只是一个示例。我已经更新了我的答案,准备了一个更完整的演示。我设法让它比我现在的更好地工作!但对于链接中的每个术语,它都返回0。我添加了一个编辑。是的,因为链接中仍然有大写字母。如果大小写很重要,请删除小写,否则将链接全部改为小写。根据OP下的注释,这是可行的,但会多次计算子字符串,这不是预期的结果。