Python 在字符串列表中查找重复模式_Python_Regex_String

Python 在字符串列表中查找重复模式

python regex string

Python 在字符串列表中查找重复模式,python,regex,string,Python,Regex,String,我正在寻找一种方法来清除字符串中最长的重复模式我有一个大约1000个网页标题的列表，它们都有一个共同的后缀，即网站的名称它们遵循以下模式： ['art gallery - museum and visits | expand knowledge', 'lasergame - entertainment | expand knowledge', 'coffee shop - confort and food | expand knowledge', ... ] 我如何自动将所有字符串从

我正在寻找一种方法来清除字符串中最长的重复模式

我有一个大约1000个网页标题的列表，它们都有一个共同的后缀，即网站的名称

它们遵循以下模式：

['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
 ...
]

我如何自动将所有字符串从其公共后缀

“expand knowledge”中删除

谢谢

编辑：对不起，我说得不够清楚。我事先没有关于

“|扩展知识”

后缀的信息。

我希望能够清除潜在公共后缀的字符串列表，即使我不知道它是什么。

如果您确定所有字符串都有公共后缀，那么这将实现以下功能：

strings = [
  'art gallery - museum and visits | expand knowledge',
  'lasergame - entertainment | expand knowledge']
suffixlen = len(" | expand knowledge")
print [s[:-suffixlen] for s in strings]

输出：

['art gallery - museum and visits', 'lasergame - entertainment']

如果您确定所有字符串都有公共后缀，那么这将实现以下功能：

strings = [
  'art gallery - museum and visits | expand knowledge',
  'lasergame - entertainment | expand knowledge']
suffixlen = len(" | expand knowledge")
print [s[:-suffixlen] for s in strings]

输出：

['art gallery - museum and visits', 'lasergame - entertainment']

如果您确实知道要去除的后缀，您可以简单地执行以下操作：

suffix = " | expand knowledge"

your_list = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
...]

new_list = [name.rstrip(suffix) for name in your_list]

如果您确实知道要去除的后缀，您可以简单地执行以下操作：

suffix = " | expand knowledge"

your_list = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
...]

new_list = [name.rstrip(suffix) for name in your_list]

下面是使用反向标题上的函数的解决方案：

titles = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
]

# Find the longest common suffix by reversing the strings and using a 
# library function to find the common "prefix".
common_suffix = os.path.commonprefix([title[::-1] for title in titles])[::-1]

# Strips all titles from the number of characters in the common suffix.
stripped_titles = [title[:-len(common_suffix)] for title in titles]

结果:

[“艺术画廊-博物馆和参观”，“lasergame-娱乐”， “咖啡馆-舒适与食品”]

因为它可以自己找到通用后缀，所以它应该适用于任何一组标题，即使您不知道后缀。

这里有一个解决方案，在反向标题上使用函数：

titles = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
]

# Find the longest common suffix by reversing the strings and using a 
# library function to find the common "prefix".
common_suffix = os.path.commonprefix([title[::-1] for title in titles])[::-1]

# Strips all titles from the number of characters in the common suffix.
stripped_titles = [title[:-len(common_suffix)] for title in titles]

结果:

[“艺术画廊-博物馆和参观”，“lasergame-娱乐”， “咖啡馆-舒适与食品”]

因为它可以自己找到通用后缀，所以它应该适用于任何一组标题，即使您不知道后缀。

您可以稍微扩展一下您的需求吗。现在看起来好像你想要的东西需要花费一些疯狂的计算时间。@SamIam我正在开发一个爬虫程序，它需要对目标网站的HTML结构了解最少。我正在从HTML标记中删除页面标题。这个网站的所有页面都包含一个通用模式（“扩展知识”），我非常想去掉它，以避免任何冗余。主要的问题是，我事先没有关于后缀的信息，因为爬虫程序将在几个网站上发布。你能稍微扩展一下你的需求吗。现在看起来好像你想要的东西需要花费一些疯狂的计算时间。@SamIam我正在开发一个爬虫程序，它需要对目标网站的HTML结构了解最少。我正在从HTML标记中删除页面标题。这个网站的所有页面都包含一个通用模式（“扩展知识”），我非常想去掉它，以避免任何冗余。主要问题是，我事先没有关于后缀的信息，因为爬虫程序将在多个网站上发布。@BalthazarRouberol要小心，如果巧合，每个条目中的最后一个字母都是相同的，这也会返回最后一个字母。@BalthazarRouberol要小心，如果巧合，每个条目中的最后一个字母是相同的，这也将返回最后一个字母。