Python：如何在字符串列表中找到最匹配的字符串_Python_String_List_Rss_String Matching

Python：如何在字符串列表中找到最匹配的字符串

python string list rss

Python：如何在字符串列表中找到最匹配的字符串,python,string,list,rss,string-matching,Python,String,List,Rss,String Matching,我将尝试详细解释我需要什么：我正在使用feedparser解析Python中的RSS提要。当然，这个提要有一个项目列表，带有标题、链接和描述，就像普通的RSS提要一样另一方面，我有一个字符串列表，其中包含一些我需要在项目描述中找到的关键字我需要做的是查找关键字匹配最多的项目例如： RSS源因此，在本例中，匹配最多（唯一）的项是第一个，因为它包含所有4个关键字（不管它说的是“cats”而不是“cat”，我只需要在字符串中找到文本关键字）让我澄清一下，即使某些描述包含“cat”关键字10

我将尝试详细解释我需要什么：

我正在使用feedparser解析Python中的RSS提要。当然，这个提要有一个项目列表，带有标题、链接和描述，就像普通的RSS提要一样

另一方面，我有一个字符串列表，其中包含一些我需要在项目描述中找到的关键字

我需要做的是查找关键字匹配最多的项目

例如：

RSS源

因此，在本例中，匹配最多（唯一）的项是第一个，因为它包含所有4个关键字（不管它说的是“cats”而不是“cat”，我只需要在字符串中找到文本关键字）

让我澄清一下，即使某些描述包含“cat”关键字100次（其他关键字都没有），这也不会是赢家，因为我要查找的是包含最多的关键字，而不是出现最多次的关键字

现在，我在rss项目上循环并“手动”执行，计算关键字出现的次数（但我遇到了上面一段中提到的问题）

我对Python非常陌生，我来自另一种语言（C#），因此如果这是一个非常琐碎的问题，我很抱歉

你将如何解决这个问题？其他答案非常优雅，但对于现实世界来说可能太简单了。它们可能破裂的一些方式包括：

texts = [ "The lion (Panthera leo) ...", "Panthera ...", "..." ]
keywords  = ['cat', 'lion', 'panthera', 'family']

# gives the count of `word in text`
def matches(text):
    return sum(word in text.lower() for word in keywords)

# or inline that helper function as a lambda:
# matches = lambda text:sum(word in text.lower() for word in keywords)

# print the one with the highest count of matches
print max(texts, key=matches)

部分单词匹配-是否“cat”匹配“concatenate”？“猫”怎么样
区分大小写-是否“cat”与“cat”匹配？“猫”怎么样

我下面的解决方案考虑了这两种情况

import re

test_text = """
Cat?

The domestic cat is a small, usually furry, domesticated, 
carnivorous mammal. It is often called the housecat, or simply the 
cat when there is no need to distinguish it from other felids and felines.
"""

wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'

# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)

# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]

# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]

# Count unique words
num_unique_matched_words = len(unique_matched_words)

输出如下所示：

>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>

下面的答案都很好，但请注意部分匹配（是否将

串联

视为出现

cat

？）和大写（是否将

cat

视为匹配？是否将

cat

？）是的，现在将“串联”视为出现“cat”，不必区分大小写。谢谢你的警告，这是一个了不起的解决方案。你介意解释一下这个代码是如何工作的吗？我现在正在阅读Python中的lambda。有一个到小写的转换。@NiklasB。是的，你是对的，我刚刚替换了max函数中的'text'参数：

[x.lower（）代表文本中的x]

@emzero:你也可以做

求和（word in text.lower（）代表关键字中的word）

@NiklasB。对，更好，因为它打印的是原始字符串，而不是小写的。谢谢。根据问题，部分匹配是可以的，搜索应该不区分大小写。旁注：不区分大小写的regexp有时是个坏主意（它们有时会因为回溯而变慢）。您可以

lower（）

首先查看整个字符串，但要小心unicode字符串（什么是

'Гааааааааааааа（）

？）

import re

test_text = """
Cat?

The domestic cat is a small, usually furry, domesticated, 
carnivorous mammal. It is often called the housecat, or simply the 
cat when there is no need to distinguish it from other felids and felines.
"""

wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'

# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)

# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]

# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]

# Count unique words
num_unique_matched_words = len(unique_matched_words)

>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>