Python 比赛需要很长时间才能结束_Python_Regex_Pandas

Python 比赛需要很长时间才能结束

python regex pandas

Python 比赛需要很长时间才能结束,python,regex,pandas,Python,Regex,Pandas,我是python新手，编写了以下运行速度非常慢的代码我已经调试了代码，发现是最后一个re.match（）导致代码运行非常慢。即使前一个匹配对同一数据帧进行相同类型的匹配，它也会很快返回代码如下： My_Cells = pd.read_csv('SomeFile',index_col = 'Gene/Cell Line(row)').T My_Cells_Others = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_

我是python新手，编写了以下运行速度非常慢的代码

我已经调试了代码，发现是最后一个

re.match（）

导致代码运行非常慢。即使前一个匹配对同一数据帧进行相同类型的匹配，它也会很快返回

代码如下：

My_Cells = pd.read_csv('SomeFile',index_col = 'Gene/Cell Line(row)').T
My_Cells_Others = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)])
My_Cells_Genes = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None ])
for col in My_Cells.columns:
   if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):
          My_Cells_Others [col] = pd.DataFrame(My_Cells[col])
   if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None:
          My_Cells_Genes [col] =  pd.DataFrame(My_Cells[col])

我认为这个问题与正则表达式无关。下面的代码仍然运行缓慢

for col in My_Cells_Others.columns:
    if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'):
          My_Cells_Others [col] = My_Cells[col]
for col in My_Cells_Genes.columns:
    if  not ((col in lst) or col.endswith(' CN') or col.endswith(' MUT')):
        My_Cells_Genes [col] =  My_Cells[col]

“设计糟糕”的正则表达式可能会非常慢

我的猜测是，

*\sCN

和

*\sMUT

与一个不匹配的大字符串相结合，会使它变得如此缓慢，因为它会强制脚本检查所有可能的组合

正如@jedwards所说，您可以替换这段代码

if  re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):
          My_Cells_Others [col] = pd.DataFrame(My_Cells[col])

与：

lst = ['bladder', 'blood', 'bone', 'breast', 'CNS', 'GI tract', 'kidney', 'lung', 'other', 'ovary', 'pancreas', 'skin',
       'soft tissue', 'thyroid', 'upper aerodigestive', 'uterus']

if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'):
    # Do stuff

或者，如果出于某种原因想使用

re

，将

*\sCN

和

*\sMUT

移动到正则表达式的末尾可能会有所帮助，具体取决于您的数据，因为除非真的需要，否则不会强制检查所有这些组合

如果col.endswith（'CN'）或col.endswith（'MUT'）或col在['BACKLE'、'blood'、'bone'、…]中如何：你可以像这样编译正则表达式

p=re.compile（ur.*\sCN$.\sMUT$^膀胱^血液^骨骼^乳房^ CNS ^胃肠道^肾^ ^肺^ ^ ^ ^其他^ ^卵巢^胰腺^甲状腺^ ^上皮肤^软组织

在循环之外。然后，使用as

if（p.match（col））

…特别是上面的第二个for循环。数据框很大，大约有14000列，但我不确定这是否是正确的reason@user1050702您得到的实际时间是多少？在两个数据帧上迭代至少需要15分钟，但最终完成：）