Python 快速替换法_Python_String

Python 快速替换法

python string

Python 快速替换法,python,string,Python,String,我写了一个函数，它接受对齐的DNA序列，如果“-”前面和后面都有其他碱基，则用“Z”替换“-”。我们的目标是将这些碱基变成一个“Z”，这样我就可以区分基因组中未排序的区域和插入/删除。下面是函数： def find_insertion_deletion(sequence): pattern = r'[A-Z]+-+(?=[A-Z]+)' new_sequence = re.sub(pattern, lambda x: x.group().replace('-', 'Z'),

我写了一个函数，它接受对齐的DNA序列，如果“-”前面和后面都有其他碱基，则用“Z”替换“-”。我们的目标是将这些碱基变成一个“Z”，这样我就可以区分基因组中未排序的区域和插入/删除。下面是函数：

  def find_insertion_deletion(sequence):
    pattern = r'[A-Z]+-+(?=[A-Z]+)'
    new_sequence = re.sub(pattern, lambda x: x.group().replace('-', 'Z'), sequence)
    return new_sequence

这显然是使用正则表达式来查找模式，然后进行替换。他就是一个例子

sequence = '-----------AGCATCGACGTCTAGTAC---CGTACGTA--CGTACGTAGCTA-GCTAGCTAGCTGATCGATGCTAGCA---------------'
new_sequence = find_insertion_deletion(sequence)

输出：

new_sequence = '-----------AGCATCGACGTCTAGTACZZZCGTACGTAZZCGTACGTAGCTAZGCTAGCTAGCTGATCGATGCTAGCA---------------'

它的工作方式正是我所希望的，但在对齐的许多序列上运行它时，速度非常慢。有没有办法让我的速度大大加快？我以为正则表达式是最快的方法，但也许还有另一种方法我不知道

谢谢

我假设您的示例具有代表性，也就是说，您只需要替换所有连字符，除了开头和结尾的连字符。这使用了基本的字符串函数，速度更快：

def find_insertion_deletion(sequence):
    stripped = sequence.strip('-')
    if not stripped:
        return sequence
    start = sequence.index(stripped[0])
    end = len(sequence) - start - len(stripped)
    return '-' * start + stripped.replace('-', 'Z') + '-' * end

或：

基准

使用您的示例序列：

   80.5 us  original
   20.5 us  Wiktor_1
   14.4 us  Wiktor_2
    3.6 us  Kelly_1
    3.3 us  Kelly_2

序列较长（

sequence*=1000

）：

代码：

你确定这个功能是性能的瓶颈吗？是的。这是我为分析基因组序列而构建的更大管道的一部分，但我确实有多个进度条和打印语句，让用户知道它们在进行中的位置。这是一个增加了很多timeWiktor的函数，我刚刚试过，它似乎确实加快了很多！谢谢大家!@就我所见，“每个人”最多是我（如果你认为我的答案是暗示的话）。为什么要删除那些原始标签？对我来说，它们看起来既合适又有用。谢谢！这太完美了。我不知道基本字符串函数会快多少。

   80.5 us  original
   20.5 us  Wiktor_1
   14.4 us  Wiktor_2
    3.6 us  Kelly_1
    3.3 us  Kelly_2

 5931.9 us  original
20896.0 us  Wiktor_1
 7498.8 us  Wiktor_2
  150.5 us  Kelly_1
  160.9 us  Kelly_2

from timeit import repeat
import re
import regex

def original(sequence):
    pattern = r'[A-Z]+-+(?=[A-Z]+)'
    new_sequence = re.sub(pattern, lambda x: x.group().replace('-', 'Z'), sequence)
    return new_sequence

def Wiktor_1(sequence):
    return regex.sub(r'(?:\G(?!\A)|[A-Z](?=-+[A-Z]))\K-', 'Z', sequence)

def Wiktor_2(sequence):
    return re.sub(r'\b-+\b', lambda x: x.group().replace('-', 'Z'), sequence)

def Kelly_1(sequence):
    stripped = sequence.strip('-')
    if not stripped:
        return sequence
    start = sequence.index(stripped[0])
    end = len(sequence) - start - len(stripped)
    return '-' * start + stripped.replace('-', 'Z') + '-' * end

def Kelly_2(sequence):
    lstripped = sequence.lstrip('-')
    start = len(sequence) - len(lstripped)
    stripped = lstripped.rstrip('-')
    end = len(lstripped) - len(stripped)
    return '-' * start + stripped.replace('-', 'Z') + '-' * end

funcs = original, Wiktor_1, Wiktor_2, Kelly_1, Kelly_2

sequence = '-----------AGCATCGACGTCTAGTAC---CGTACGTA--CGTACGTAGCTA-GCTAGCTAGCTGATCGATGCTAGCA---------------'
sequence *= 1   # or 1000 with number = 10**2
number = 10**5

expect = original(sequence)
for func in funcs:
    print(func(sequence) == expect, func.__name__)

for _ in range(3):
    print()
    for func in funcs:
        t = min(repeat(lambda: func(sequence), number=number)) / number
        print('%7.1f us ' % (t * 1e6), func.__name__)