Python 基于另一个字段从字段中提取子字符串的循环优化_Python_Performance_Optimization_Pandas

Python 基于另一个字段从字段中提取子字符串的循环优化

python performance optimization pandas

Python 基于另一个字段从字段中提取子字符串的循环优化,python,performance,optimization,pandas,Python,Performance,Optimization,Pandas,我正在处理大型数据集（400万乘4）。第一列是名称标识符，许多行具有相同的名称。第二列是一个位置，从-6开始一直向上，直到遇到一个新的标识符，然后再次开始计数。第三列是一个随机数，在这里不重要。第四列是一个长长的数字序列，就像一个长长的条形码。数据看起来有点像这样： YKLOI -6 01 123456789012345678901234 YKLOI -5 25 123456789012345678901234 YKLOI -4 05 123

我正在处理大型数据集（400万乘4）。第一列是名称标识符，许多行具有相同的名称。第二列是一个位置，从-6开始一直向上，直到遇到一个新的标识符，然后再次开始计数。第三列是一个随机数，在这里不重要。第四列是一个长长的数字序列，就像一个长长的条形码。数据看起来有点像这样：

YKLOI    -6    01    123456789012345678901234
YKLOI    -5    25    123456789012345678901234
YKLOI    -4    05    123456789012345678901234
YKLOI    -3    75    123456789012345678901234
YKLOI    -2    83    123456789012345678901234
YKLOI    -1    05    123456789012345678901234
YKLOI     0    34    123456789012345678901234
YKLOI     1    28    123456789012345678901234
YKLJW    -6    87    569845874254658425485
YKLJW    -5    87    569845874254658425485
...

YKLOI    -6    01    123   #puts 1st triplet in position -6
YKLOI    -5    25    456   #puts 2nd triplet in position -5
YKLOI    -4    05    789   #puts 3rd triplet in position -4
YKLOI    -3    75    012   #puts 4th triplet in position -3
YKLOI    -2    83    345                ...
YKLOI    -1    05    678
YKLOI     0    34    901
YKLOI     1    28    234   #puts last triplet in the last position
YKLJW    -6    87    569   #puts 1st triplet in position -6
YKLJW    -5    87    845   #puts 2nd triplet in position -5
...

我想让它看起来像这样：

YKLOI    -6    01    123456789012345678901234
YKLOI    -5    25    123456789012345678901234
YKLOI    -4    05    123456789012345678901234
YKLOI    -3    75    123456789012345678901234
YKLOI    -2    83    123456789012345678901234
YKLOI    -1    05    123456789012345678901234
YKLOI     0    34    123456789012345678901234
YKLOI     1    28    123456789012345678901234
YKLJW    -6    87    569845874254658425485
YKLJW    -5    87    569845874254658425485
...

YKLOI    -6    01    123   #puts 1st triplet in position -6
YKLOI    -5    25    456   #puts 2nd triplet in position -5
YKLOI    -4    05    789   #puts 3rd triplet in position -4
YKLOI    -3    75    012   #puts 4th triplet in position -3
YKLOI    -2    83    345                ...
YKLOI    -1    05    678
YKLOI     0    34    901
YKLOI     1    28    234   #puts last triplet in the last position
YKLJW    -6    87    569   #puts 1st triplet in position -6
YKLJW    -5    87    845   #puts 2nd triplet in position -5
...

第四列的长度变化很大，但第二列的数字总是按顺序排列的

下面的代码是我得到的一个，它实际上正在做这项工作，但它需要花很长时间才能完成。到目前为止，它已经运行了将近18个小时，仅仅超过100万条生产线

我尝试了一些替代方法，例如仅当连续行中第一列中的名称不同时才构建一个映射，但这只是在其中添加了一条语句，并使代码速度慢得多

是否有人对如何提高此任务的性能提出建议

import pandas as pd

#imports data
d = pd.read_csv('INPUT_FILE', sep='\t') 

#acknowledges that data was imported
print "Import Okay" 

#sets output path
output='OUTPUT_FILE' 

#loops from the first row till the end
for z in xrange(0,len(d)-1): 

    #goes to the fourth column, split the content every 3 characters and creates 
    #a list of these triplets.
    mop=map(''.join, zip(*[iter(d.loc[z][3])]*3)) 

    #substitutes the content of the fourth column in the z line by the triplet in 
    #the z+6 positon
    d.ix[z,3] = mop[int(d.loc[z][1])+6]

    #writes the new z line into the output file
    d.loc[[z]].to_csv(output, sep='\t', header=False, index=False, mode='a')

#acknowledges that the code is through
print "Done"

两个简单的变化开始。第一，不要递增地写入输出文件，它会增加很多不必要的开销，这是迄今为止您最大的问题

第二，你似乎经历了很多步骤来拉出三胞胎。这样做效率更高，

.apply

删除了一些循环开销

def triplet(row):
    loc = (row[1] + 6) * 3
    return row[3][loc:loc+3]

d[3] = d.apply(triplet, axis=1)

# save the whole file once
d.to_csv(output2, sep='\t', header=False, index=False)

也许让程序更高效的唯一方法就是用更高效的编程语言重写它。你将获得10倍到100倍或更多的速度。众所周知，Python不是一种特别有效的语言，循环效率非常低。最后一列中的数据是字符串还是长整数？@Renzo-因此，一旦编写了Python程序，除了用另一种语言重写它之外，没有其他方法可以使它更快了？您的Python IDE是一块石碑吗？@Renzo-众所周知，许多人不了解如何有效地使用Python。@Alexander，许多人在优化Python代码方面存在问题是完全正确的，并且通常有很大的空间来改进任何语言中的任何程序。但我认为，许多人在某项工作中使用了错误的工具，这也是事实。如果OP说“这是永远的”，那么在我看来只有两件事可以真正改善这种情况：要么从根本上改变算法，降低其复杂性，要么改变语言，使用机器代码编译的语言。如果给出的答案降低了复杂性，问题就解决了。否则……事实上，持续的储蓄是一种拖累。只是这已经有所帮助，现在我每个循环只需要3秒，而不是15秒。但是这个函数在某种程度上不起作用——要么我做错了什么（很可能！），要么它实际上将循环周期增加到了8秒。关于我可能做错了什么有什么想法吗？你能发布你的代码，它不工作或者你得到的错误吗？请记住，

.apply

是矢量化的，它不应该在循环中使用，只使用一次。您应该会看到另一个显著的加速效果（根据数据的大小，我得到了20-100倍的加速）。我现在刚刚重新编辑了它，不知怎么的，它运行得太快了，以至于我一开始没有看到结果-这会添加一个新列，而不是替换它，并且考虑到数据的大小，很容易在其中丢失。但它就像一个符咒！所有数据在1分钟内处理完毕。我对它的速度感到震惊。谢谢。