Python正在计算子字符串的唯一字符串源的数量_Python_Pandas_String_Count_Substring

Python正在计算子字符串的唯一字符串源的数量

python pandas string

Python正在计算子字符串的唯一字符串源的数量,python,pandas,string,count,substring,Python,Pandas,String,Count,Substring,假设我有一个包含5个字符串的列表，如： AAAAB BBBBA BBBBA ABBBB 我想找到并计算每个可能的4个字符的子字符串，并跟踪它们来自的唯一5个字符的字符串的数量。这意味着虽然BBBB存在于三个不同的字符串源中，但只有两个唯一的源示例输出： substring repeats unique sources 0 AAAA 1 1 1 AAAB 1 1 2

假设我有一个包含5个字符串的列表，如：

AAAAB
BBBBA
BBBBA
ABBBB

我想找到并计算每个可能的4个字符的子字符串，并跟踪它们来自的唯一5个字符的字符串的数量。这意味着虽然BBBB存在于三个不同的字符串源中，但只有两个唯一的源

示例输出：

    substring    repeats    unique sources
0     AAAA          1              1
1     AAAB          1              1
2     BBBB          3              2
3     BBBA          2              1
4     ABBB          1              1

我已经设法在一个小规模上做到了这一点，只使用了Python，一个更新的字典，以及两个用于比较现有子字符串和全长字符串的列表。然而，当将其应用于我的完整数据集（约160000个全长字符串（12个字符）产生1.5亿个子字符串（4个字符））时，持续的字典更新和列表比较过程太慢（我的脚本已经运行了一周了）。在Python和pandas中，计算所有全长字符串中存在的子字符串的数量既容易又便宜

因此，我的问题是：如何有效地计算和更新数据帧中子字符串的唯一完整长度源的计数？

TLDR：根据您描述的数据规模，在我的计算机上进行此尝试大约需要2小时

import numpy as np
import pandas as pd

def substring_search(fullstrings, sublen=4):
    '''
    fullstrings: array like of strings
    sublen: length of substring to search
    '''
    # PART 1: FIND SUBSTRINGS

    # length of full strings, assumes all are same
    strsize = len(fullstrings[0])

    # get unique strings, # occurences
    strs, counts = np.unique(fullstrings, return_counts=True)
    fullstrings = pd.DataFrame({'string':strs,
                                'count':counts})
    unique_n = len(fullstrings)

    # create array to hold substrings
    substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
    substrings = pd.Series(substrings)

    # slice to find each substring
    c = 0
    while c + sublen <= strsize:
        sliced = fullstrings['string'].str.slice(c, c+sublen)
        s = c * unique_n
        e = s + unique_n
        substrings[s: e] = sliced
        c += 1

    # take the set of substrings, save in output df
    substrings = np.unique(substrings)
    output = pd.DataFrame({'substrings':substrings,
                           'repeats': 0,
                           'unique_sources': 0})

    # PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS

    for i, s in enumerate(output['substrings']):
        # check which fullstrings contain each substring
        idx = fullstrings['string'].str.contains(s)
        count = fullstrings['count'][idx].sum()
        output.loc[i, 'repeats'] = count
        output.loc[i, 'unique_sources'] = idx.sum()
    print('Finished!')

    return output

解释上述代码的基本思想是循环所有唯一的子字符串，并（对于每个子字符串）使用

pandas

str

方法对照完整字符串列表进行检查。这将为循环保存一个（即，不循环每个子字符串的每个完整字符串）。另一个想法是只检查唯一的完整字符串（除了唯一的子字符串）；您可以事先保存每个完整字符串的出现次数，并在结尾更正计数

基本结构是：

获取输入中唯一的字符串，并记录每次出现的次数

在输入中查找所有唯一的子字符串（我使用）

在每个子字符串上循环，并使用（按元素）检查完整字符串。由于这些是唯一的，并且我们知道每次发生的次数，因此我们可以同时填充

重复

和

唯一源

测试下面是我用来创建较大输入数据的代码：

n = 100
size = 12

letters = list(string.ascii_uppercase[:20])
bigger = [''.join(np.random.choice(letters, size)) for i in range(n)]

所以

biger

是

size

长度字符串：

['FQHMHSOIEKGO',
 'FLLNCKAHFISM',
 'LDKKRKJROIRL',
 ...
 'KDTTLOKCDMCD',
 'SKLNSAQQBQHJ',
 'TAIAGSIEQSGI']

使用打印进度的修改代码（发布在下面），我尝试了

n=150000

和

size=12

，得到了以下初始输出：

Starting main loop...
5%, 344.59 seconds
10.0%, 685.28 seconds

因此10*685秒/60（秒/分钟）=~114分钟。因此2小时并不理想，但实际上比1周更有用。我不怀疑有一些更聪明的方法可以做到这一点，但如果没有其他发布，这可能会有所帮助
如果您确实使用了这段代码，您可能需要用一些较小的示例来验证结果是否正确。我不确定的一件事是，是否要计算子字符串是否仅出现在每个完整字符串中（即
包含
），或者是否要计算它在完整字符串中出现的次数（即）。这至少有望是一个小变化
以下是执行搜索时打印进度的附加代码；在
第2部分中只有一些附加语句： def substring_search_progress(fullstrings, sublen=4): ''' fullstrings: array like of strings sublen: length of substring to search ''' # PART 1: FIND SUBSTRINGS # length of full strings, assumes all are same strsize = len(fullstrings[0]) # get unique strings, # occurences strs, counts = np.unique(fullstrings, return_counts=True) fullstrings = pd.DataFrame({'string':strs, 'count':counts}) unique_n = len(fullstrings) # create array to hold substrings substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str) substrings = pd.Series(substrings) # slice to find each substring c = 0 while c + sublen <= strsize: sliced = fullstrings['string'].str.slice(c, c+sublen) s = c * unique_n e = s + unique_n substrings[s: e] = sliced c += 1 # take the set of substrings, save in output df substrings = np.unique(substrings) output = pd.DataFrame({'substrings':substrings, 'repeats': 0, 'unique_sources': 0}) # PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS # for marking progress total = len(output) every = 5 progress = every # main loop print('Starting main loop...') start = time.time() for i, s in enumerate(output['substrings']): # progress if (i / total * 100) > progress: now = round(time.time() - start, 2) print(f'{progress}%, {now} seconds') progress = (((i / total * 100) // every) + 1) * every # check which fullstrings contain each substring idx = fullstrings['string'].str.contains(s) count = fullstrings['count'][idx].sum() output.loc[i, 'repeats'] = count output.loc[i, 'unique_sources'] = idx.sum() print('Finished!') return output def子字符串搜索进度（完整字符串，子字符串=4）： ''' fullstrings：类似数组的字符串子字符串：要搜索的子字符串的长度 ''' #第1部分：查找子字符串 #完整字符串的长度，假定所有字符串都相同 strsize=len（完整字符串[0]） #获取唯一字符串，#出现次数 strs，counts=np.unique（fullstrings，return\u counts=True） fullstrings=pd.DataFrame（{'string'：strs， “计数”：计数}）唯一\u n=len（完整字符串） #创建数组以容纳子字符串 substring=np.empty（唯一的_n*（strsize-subcn+1），dtype=str）子串=pd系列（子串） #切片以查找每个子字符串 c=0 c+转租进度： now=round（time.time（）-start，2）打印（f'{progress}%，{now}秒'）进度=（（i/总计*100）//每班）+1）*每班 #检查包含每个子字符串的完整字符串 idx=fullstrings['string'].str.contains count=fullstrings['count'][idx].sum（） output.loc[i，'repeats']=计数 output.loc[i，'unique_sources']=idx.sum（）打印（'Finished！'）返回输出在看到列表值之前，列表中的值的字符集是已知的吗？例如A-Z等，是的。每个全长字符串总是由相同的20个字符组成（这是20个主要氨基酸的一个字母代码，所以字母表中的前20个大写字母同样适用于示例）。前20个包括在内？A-T还是A-S？如果我们使用字母表：ABCDEFGHIJKLMNOPQRST。或者如果我们选择正确的氨基酸列表：arndcqeghilkmfstwyv。 def substring_search_progress(fullstrings, sublen=4): ''' fullstrings: array like of strings sublen: length of substring to search ''' # PART 1: FIND SUBSTRINGS # length of full strings, assumes all are same strsize = len(fullstrings[0]) # get unique strings, # occurences strs, counts = np.unique(fullstrings, return_counts=True) fullstrings = pd.DataFrame({'string':strs, 'count':counts}) unique_n = len(fullstrings) # create array to hold substrings substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str) substrings = pd.Series(substrings) # slice to find each substring c = 0 while c + sublen <= strsize: sliced = fullstrings['string'].str.slice(c, c+sublen) s = c * unique_n e = s + unique_n substrings[s: e] = sliced c += 1 # take the set of substrings, save in output df substrings = np.unique(substrings) output = pd.DataFrame({'substrings':substrings, 'repeats': 0, 'unique_sources': 0}) # PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS # for marking progress total = len(output) every = 5 progress = every # main loop print('Starting main loop...') start = time.time() for i, s in enumerate(output['substrings']): # progress if (i / total * 100) > progress: now = round(time.time() - start, 2) print(f'{progress}%, {now} seconds') progress = (((i / total * 100) // every) + 1) * every # check which fullstrings contain each substring idx = fullstrings['string'].str.contains(s) count = fullstrings['count'][idx].sum() output.loc[i, 'repeats'] = count output.loc[i, 'unique_sources'] = idx.sum() print('Finished!') return output