使用Python查找文件大小字符串中数组字符串的频率_Python_Regex_Arrays_String

使用Python查找文件大小字符串中数组字符串的频率

python regex arrays string

使用Python查找文件大小字符串中数组字符串的频率,python,regex,arrays,string,Python,Regex,Arrays,String,我已经看过很多答案，它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做，我的字符串也不是来自文本文件给定一个大字符串，如文件大小的字符串，如何计算大字符串中每个数组元素的频率（包括单词中的空格） def calculate_commonness(context, links): c = Counter() content = context.translate(string.maketrans("",""), string.punctuation).sp

我已经看过很多答案，它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做，我的字符串也不是来自文本文件

给定一个大字符串，如文件大小的字符串，如何计算大字符串中每个数组元素的频率（包括单词中的空格）

def calculate_commonness(context, links):
    c = Counter()
    content = context.translate(string.maketrans("",""), string.punctuation).split(None)

    for word in content:
        if word in links:
            c[word] += 1
    print c

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
links = ['November', 'Laundress', 'Passage', 'Father had']

# My output should look (something) like this:
# November = 4
# Laundress = 1
# Passage = 2
# Father had = 1

目前，它正在寻找十一月，洗衣店和通道，但不是'父亲有'。我需要能够找到带有空格的字符串元素。我知道这是因为我正在拆分返回父所拥有的上下文，所以我如何恰当地拆分上下文，或者如何将其与regex findall一起使用

编辑：使用上下文作为大字符串，我有：

    for l in links:
        c[l] = context.lower().count(l)
    print c

Counter({'Laundress': 0, 'November': 0, 'Father had': 0, 'Passage': 0})

你试过了吗

context.lower()
counts = {word: context.count(word)
          for word in links}

注意：将上下文保留为字符串

你试过了吗

context.lower()
counts = {word: context.count(word)
          for word in links}

注意：将上下文保留为字符串

这是一个使用regex findall的实现

import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links 
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."

result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]

# Now we count regex matches
counts = [0] * len(links)
for x in result:
    for i in range(len(links)):
        if not x[i] == "":
             counts[i] += 1

这是一个使用regex findall的实现

import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links 
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."

result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]

# Now we count regex matches
counts = [0] * len(links)
for x in result:
    for i in range(len(links)):
        if not x[i] == "":
             counts[i] += 1

试试这个

>>> import re
>>> for word in links:
    print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))


November=4
Laundress=1
Passage=2
Father had=1
>>>

您也可以使用ignorecase

 for word in links:
         print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))

试试这个

>>> import re
>>> for word in links:
    print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))


November=4
Laundress=1
Passage=2
Father had=1
>>>

您也可以使用ignorecase

 for word in links:
         print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))

如果某些链接是其他链接的串联子字符串，该怎么办？例如，links=['day'、'mayday'、'today']，那么我就不应该计算它们了。因此，在这种情况下，日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子，那么links=['11月'，'was十一月']呢？结果是{11月4日，是11月2日}还是{11月2日，是11月2日}？在这种情况下，这将是第二组结果。十一月二号，是十一月二号。这是计算一个术语（一个词或多个词）被用作指向文档中某个特定位置的链接的次数。如果某些链接是其他链接的串联子字符串，该怎么办？例如，links=['day'、'mayday'、'today']，那么我就不应该计算它们了。因此，在这种情况下，日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子，那么links=['11月'，'was十一月']呢？结果是{11月4日，是11月2日}还是{11月2日，是11月2日}？在这种情况下，这将是第二组结果。十一月二号，是十一月二号。这是计算一个术语（一个词或多个词）被用作指向文档中某个特定位置的链接的次数。这只是测试一个更大项目的字符串，因此这样做意味着我必须检查我的数据库，手动键入每个有多个词的链接。不，你不必手动执行，这只是一个例子。我已经更新了我的答案，准备了一个更完整的演示。我设法让它比我现在的更好地工作！但对于链接中的每个术语，它都返回0。我添加了一个编辑。是的，因为链接中仍然有大写字母。如果大小写很重要，请删除小写，否则请将链接全部小写。根据OP下的注释，这是可行的，但会多次计算子字符串，这不是预期的结果。这只是一个更大项目的测试字符串，因此这样做意味着我必须检查我的数据库，然后手动键入每个包含多个单词的链接。不，您不必手动键入，这只是一个示例。我已经更新了我的答案，准备了一个更完整的演示。我设法让它比我现在的更好地工作！但对于链接中的每个术语，它都返回0。我添加了一个编辑。是的，因为链接中仍然有大写字母。如果大小写很重要，请删除小写，否则将链接全部改为小写。根据OP下的注释，这是可行的，但会多次计算子字符串，这不是预期的结果。