python从字符串列表中获取最常用前缀的最快算法_Python_Python 3.x_Algorithm

python从字符串列表中获取最常用前缀的最快算法

python python-3.x algorithm

python从字符串列表中获取最常用前缀的最快算法,python,python-3.x,algorithm,Python,Python 3.x,Algorithm,我需要一个函数： def get_prefix(list_of_strings): # Should give me the most common prefix # out of the given list_of_strings # of the lowest order of time possible 此外，对于后续调用，应该可以获得第二个最常见的前缀，依此类推。如果前缀的长度小于全局变量（例如前缀的min\u length），则应丢弃该前缀如：你可以用a来做这个插

我需要一个函数：

def get_prefix(list_of_strings):
  # Should give me the most common prefix
  # out of the given list_of_strings
  # of the lowest order of time possible

此外，对于后续调用，应该可以获得第二个最常见的前缀，依此类推。如果前缀的长度小于全局变量（例如前缀的

min\u length），则应丢弃该前缀
如：

你可以用a来做这个
插入每个字符串需要O（n）（n=字符串长度）。
查找最小长度内的所有前缀是通过在树上运行DFS来完成的
下面是我如何实现它的。它返回所有前缀的成对（前缀，频率）
，这些前缀至少有min\u长度\u个前缀
字符长（按频率降序排列）
输出：
[('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2), ('not_a_f', 2), ('not_a_', 2), ('not_a_file_2', 1), ('not_a_file_1', 1), ('file_3', 1), ('file_2', 1), ('file_1', 1)]

首先对列表进行排序，这样我们就可以使用itertools.groupby
将每个字符串的第一个字符作为前缀进行分组，对于每个具有多个成员的组，将该字符与通过递归调用相同的get_prefix
函数返回的每个前缀与字符串的其余部分连接起来，除非不再返回前缀，否则将返回一个空字符串。每个递归级别的每个组中的成员数也会以前缀作为元组返回，因此最终我们可以将其用作排序键，以确保更常见的前缀排在第一位
from itertools import groupby
from operator import itemgetter
list_of_strings = ['file_4', 'not_a_f', 'file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']
def get_prefix(l, m):
    if not l: return []
    if m is not None: l.sort()
    r = [(k + p, f or len(g)) for k, g in [(k, list(g)) for k, g in groupby(l, itemgetter(0))] if len(g) > 1 for p, f in get_prefix([s[1:] for s in g if len(s) > 1], None)] + [('', 0)]
    if m: return sorted([(p, f) for p, f in r if len(p) >= m], key=itemgetter(1), reverse=True)
    return r
print(get_prefix(list_of_strings, 4))
print(get_prefix(list_of_strings, 6))

这将产生：
[('file_', 4), ('file', 4), ('not_a_f', 3), ('not_a_', 3), ('not_a', 3), ('not_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]
[('not_a_f', 3), ('not_a_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]

这与其说是一个问题，不如说是对你目标的描述。您当前的代码是否有问题，我们可以帮助您解决？是马车吗？太慢了？等等@RoadRunner很抱歉，编辑我的问题，应该是不是文件
@DSM是的，我处理的是大数据集和我制作的第一个工作程序，我用非常基本的方法，首先查找第一个字符的最大频率，然后选择下一个字符，使其覆盖最大数量的字符串，但处理的条件太多。如果我添加更多字符，满足前缀的字符串数量会减少，并且很快就会变得模糊，无论我尝试记录什么，我们可以假设字符串列表已经排序了吗？既然您希望每个调用有不同的值，那么例程应该是生成器而不是函数吗？此外，如果您显示一个代码尝试，您将避免怀疑您的代码试图让他人完成您的工作。@RoryDaulton不，除非算法的工作时间与排序时间相同或更大。我将编辑这个问题并添加我解决这个问题的尝试。如果没有你的解决方案，我甚至不会想到使用这样的DAG！谢谢，你启发了我，我完全能够修改你的解决方案，使之成为我自己的解决方案。此外，如果将前缀及其频率分组如果not_a_f附加到输入中，您的输出是否应该更改？它不是p.s.我没有仔细看你的algorithm@IshanSrivastava我现在明白你的要求了。我已经相应地更新了我的答案。请再看一看。通过使用递归，代码可以变得更加简洁和高效。
[('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2), ('not_a_f', 2), ('not_a_', 2), ('not_a_file_2', 1), ('not_a_file_1', 1), ('file_3', 1), ('file_2', 1), ('file_1', 1)]

from itertools import groupby
from operator import itemgetter
list_of_strings = ['file_4', 'not_a_f', 'file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']
def get_prefix(l, m):
    if not l: return []
    if m is not None: l.sort()
    r = [(k + p, f or len(g)) for k, g in [(k, list(g)) for k, g in groupby(l, itemgetter(0))] if len(g) > 1 for p, f in get_prefix([s[1:] for s in g if len(s) > 1], None)] + [('', 0)]
    if m: return sorted([(p, f) for p, f in r if len(p) >= m], key=itemgetter(1), reverse=True)
    return r
print(get_prefix(list_of_strings, 4))
print(get_prefix(list_of_strings, 6))

[('file_', 4), ('file', 4), ('not_a_f', 3), ('not_a_', 3), ('not_a', 3), ('not_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]
[('not_a_f', 3), ('not_a_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]