Python 查找字符串中每个不同字符的所有位置的最快方法_Python_String

Python 查找字符串中每个不同字符的所有位置的最快方法

python string

Python 查找字符串中每个不同字符的所有位置的最快方法,python,string,Python,String,假设我们有一个字符串，比如说，“122113”，我们应该查找字符串中每个字符的所有匹配项。天真的方法是这样的： string = str( raw_input() ) # for example: "122113" distinct_char = list( set(string) ) occurrences=[] for element in distinct_char: temp=[] for j in range(len(string)): if(str

假设我们有一个字符串，比如说，“122113”，我们应该查找字符串中每个字符的所有匹配项。

天真的方法是这样的：

string = str( raw_input() )  # for example: "122113"
distinct_char = list( set(string) )
occurrences=[]
for element in distinct_char:
    temp=[]
    for j in range(len(string)):
        if(string[j]==element):
            temp.append(j)
    occurrences.append(temp)
print(occurrences)  # output for "122113" would be [[0, 3, 4], [1, 2], [5]]
                    #because 1 occurrs at : 0, 3, 4
                    #        2 occurrs at : 1, 2
                    #        3 occurrs at : 5

但是，如果字符串的长度很大，则速度相当慢。 那么，有没有更快的解决方案？

（假设字符串仅由较低的英文字母组成，且字符串的长度可能为$10^12$

您应使用defaultdict（默认值为空列表），并在遍历字符串时更新索引列表：

from collections import defaultdict
string = str(raw_input())
occurences = defaultdict(list)
for i, c in enumerate(string):
  occurences[c].append(i)
print occurences

然后使用列表理解获取事件列表：

occurences = [l for l in occurences.values()]

（对不起，我先前的回答误解了这个问题。）

您可以使用以下方法：

导入集合
非常长的字符串=“abcdefghij”*1000000
索引=集合.defaultdict（列表）
对于枚举中的i，c（非常长的字符串）：
索引[c].附加（i）

索引将是一个dict，它将每个字符映射到它们的索引中（显然不是上面的很长的字符串，而是一个较短的字符串）
{
“a”：[0,10]，
“b”：[1,11]，
“c”：[2,12]，
“d”：[3,13]，
“e”：[4,14]，
“f”：[5,15]，
“g”：[6,16]，
“h”：[7,17]，
“一”：[8,18]，
“j”：[9,19]，
}

在我的机器上对10000个字符执行此操作大约需要3秒钟。
一种可能的解决方案是将字符串转换为数字，并使用数字来增加数组中的值。代码可能如下所示：
import numpy as np

def alph_to_num(alph):
    return ord(alph.lower())-97

string='alsblnasdglasdaoerngeaglbneronbiwernblnerl'
array=np.zeros(26)

for alph in string:
    index=alph_to_num(alph)
    array[index]=array[index]+1
print(array)

它给出了：[5.4.0.2.5.0.3.0.1.0.0.6.0.6.2.0.0.4.3.0.0.0.1.0。
0.0.]

这里我创建了长度为26的数组，因为您知道它只是小写英文字母。这也意味着更容易解释生成的列表。
无导入解决方案-如果您知道它是唯一的小写字母表，您可以预先创建大小为26的列表，然后遍历，只需附加每个字符的索引即可acter找到了合适的位置
input_lst="abcdefgaabbfegddsa"
occurence_lst = [[] for i in range(26)]
for index in range(len(input_lst)):
    occurence_lst[ord(input_lst[index]) - 97].append(index)

print(occurence_lst)

[0, 7, 8, 17], [1, 9, 10], [2], [3, 14, 15], [4, 12], [5, 11], [6, 13], [], [], [], [], [], [], [], [], [], [], [], [16], [], [], [], [], [], [], []]

假设Python2.7，选项1（我编写了字典，这样就可以知道哪个字母对应于索引）：
“122113”上10000次运行的平均时间：2.55961418152e-06
在“a；LKDSFOWQUBTGAFDNGA；llkl；uihnbr，afdh；Glakhehjehjejeojeoguhaberna”上运行10000次的平均时间：2.39794969559e-05
在“alkdsfowquebtgafdngallkl”上运行500次的平均时间*1000:0.00993875598907
备选案文2：
s = raw_input()
occurances = {}

for i,let in enumerate(s):
  if let in occurances:
    occurances[let].append(i)
  else:
    occurances[let] = [i]

print(occurances)

“122113”上10000次运行的平均时间：7.02269077301e-06
在“a；LKDSFOWQUBTGAFDNGA；llkl；uihnbr，afdh；Glakhehjehjejeojeoguhaberna”上运行10000次的平均时间：2.39794969559e-05
在“alkdsfowquebtgafdngallkl”上运行500次的平均时间*1000:0.00974810600281
（来自运行python 2.7的repl.it的测试时间）
编辑：取决于它在脚本中的具体使用方式，defaultdict
可能比仅使用dict
更快或更慢提示：在列表中迭代一次，并更新字典，将字符映射到所需的位置，这样，它可以线性缩放。如果，长度（s），是否足够快变得非常大？@TuhinKarmakar嗯，在我的超慢速计算机上，一个10000000长的字符串需要5秒钟，如果你想知道它是否快，你可以自己测试。
s = raw_input()
occurances = {}

for i,let in enumerate(s):
  if let in occurances:
    occurances[let].append(i)
  else:
    occurances[let] = [i]

print(occurances)