使用Python查找文件大小字符串中数组字符串的频率
我已经看过很多答案,它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做,我的字符串也不是来自文本文件 给定一个大字符串,如文件大小的字符串,如何计算大字符串中每个数组元素的频率(包括单词中的空格)使用Python查找文件大小字符串中数组字符串的频率,python,regex,arrays,string,Python,Regex,Arrays,String,我已经看过很多答案,它们的目的是查找文件中每个单词、大字符串甚至数组的出现情况。但我不想这样做,我的字符串也不是来自文本文件 给定一个大字符串,如文件大小的字符串,如何计算大字符串中每个数组元素的频率(包括单词中的空格) def calculate_commonness(context, links): c = Counter() content = context.translate(string.maketrans("",""), string.punctuation).sp
def calculate_commonness(context, links):
c = Counter()
content = context.translate(string.maketrans("",""), string.punctuation).split(None)
for word in content:
if word in links:
c[word] += 1
print c
context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
links = ['November', 'Laundress', 'Passage', 'Father had']
# My output should look (something) like this:
# November = 4
# Laundress = 1
# Passage = 2
# Father had = 1
目前,它正在寻找十一月,洗衣店和通道,但不是'父亲有'。我需要能够找到带有空格的字符串元素。我知道这是因为我正在拆分返回父所拥有的上下文,所以我如何恰当地拆分上下文,或者如何将其与regex findall一起使用
编辑:
使用上下文作为大字符串,我有:
for l in links:
c[l] = context.lower().count(l)
print c
返回:
Counter({'Laundress': 0, 'November': 0, 'Father had': 0, 'Passage': 0})
你试过了吗
context.lower()
counts = {word: context.count(word)
for word in links}
注意:将上下文保留为字符串 你试过了吗
context.lower()
counts = {word: context.count(word)
for word in links}
注意:将上下文保留为字符串 这是一个使用regex findall的实现
import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))
context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]
# Now we count regex matches
counts = [0] * len(links)
for x in result:
for i in range(len(links)):
if not x[i] == "":
counts[i] += 1
这是一个使用regex findall的实现
import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))
context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]
# Now we count regex matches
counts = [0] * len(links)
for x in result:
for i in range(len(links)):
if not x[i] == "":
counts[i] += 1
试试这个
>>> import re
>>> for word in links:
print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))
November=4
Laundress=1
Passage=2
Father had=1
>>>
您也可以使用ignorecase
for word in links:
print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))
试试这个
>>> import re
>>> for word in links:
print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))
November=4
Laundress=1
Passage=2
Father had=1
>>>
您也可以使用ignorecase
for word in links:
print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))
如果某些链接是其他链接的串联子字符串,该怎么办?例如,links=['day'、'mayday'、'today'],那么我就不应该计算它们了。因此,在这种情况下,日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子,那么links=['11月','was十一月']呢?结果是{11月4日,是11月2日}还是{11月2日,是11月2日}?在这种情况下,这将是第二组结果。十一月二号,是十一月二号。这是计算一个术语(一个词或多个词)被用作指向文档中某个特定位置的链接的次数。如果某些链接是其他链接的串联子字符串,该怎么办?例如,links=['day'、'mayday'、'today'],那么我就不应该计算它们了。因此,在这种情况下,日期应作为一次返回。它应该与链接完全匹配。也许我举了一个错误的例子,那么links=['11月','was十一月']呢?结果是{11月4日,是11月2日}还是{11月2日,是11月2日}?在这种情况下,这将是第二组结果。十一月二号,是十一月二号。这是计算一个术语(一个词或多个词)被用作指向文档中某个特定位置的链接的次数。这只是测试一个更大项目的字符串,因此这样做意味着我必须检查我的数据库,手动键入每个有多个词的链接。不,你不必手动执行,这只是一个例子。我已经更新了我的答案,准备了一个更完整的演示。我设法让它比我现在的更好地工作!但对于链接中的每个术语,它都返回0。我添加了一个编辑。是的,因为链接中仍然有大写字母。如果大小写很重要,请删除小写,否则请将链接全部小写。根据OP下的注释,这是可行的,但会多次计算子字符串,这不是预期的结果。这只是一个更大项目的测试字符串,因此这样做意味着我必须检查我的数据库,然后手动键入每个包含多个单词的链接。不,您不必手动键入,这只是一个示例。我已经更新了我的答案,准备了一个更完整的演示。我设法让它比我现在的更好地工作!但对于链接中的每个术语,它都返回0。我添加了一个编辑。是的,因为链接中仍然有大写字母。如果大小写很重要,请删除小写,否则将链接全部改为小写。根据OP下的注释,这是可行的,但会多次计算子字符串,这不是预期的结果。