Python/Pandas-性能改进-将列分成多个部分并将字符串序列转换为列表

Python/Pandas-性能改进-将列分成多个部分并将字符串序列转换为列表,python,performance,list,loops,pandas,Python,Performance,List,Loops,Pandas,我有一个叫做target的数据帧。该数据帧有一个名为“CNAE2”的列 如果我打印(target.CNAE2)我会得到以下信息: id 3 NaN 7 NaN 17 50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05 18

我有一个叫做target的数据帧。该数据帧有一个名为“CNAE2”的列

如果我
打印(target.CNAE2)
我会得到以下信息:

id
3                                                       NaN
7                                                       NaN
17           50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05
18                                               32.67-1-00
19                                   46.93-1-00, 49.40-0-00
20                                                      NaN
列的非NaN值是字符串。他们反对某种关系逻辑,我的意图是: a) 把它变成列表 b) 将它们分为多个级别(我称之为“pai”、“vo”、“bisavo”),并在不同的列中分开

id                                  CNAE2                     CNAE2pai                CNAE2vo   CNAE2bisavo
3                                     NaN                          NaN                    NaN           NaN  
7                                     NaN                          NaN                    NaN           NaN  
17   [50.30-1-02, 52.32-0-00, 52.50-8-05]  [50.30-1, 52.32-0, 52.50-8]  [50.30, 52.32, 52.50]  [50, 52, 52]
18                           [32.67-1-00]                    [32.67-1]                [32.67]          [32]
19               [46.93-1-00, 46.40-0-00]           [46.93-1, 46.40-0]         [46.93, 46.40]      [46, 46]
20                                    NaN                          NaN                    NaN           NaN 
我能够实现这个结果,但是,我的代码依赖于很多循环,而且因为我运行的是一个相当大的数据帧,所以需要花费很多时间。这是不可行的。我使用了以下代码:

    for i in target.index:
        cnaes=str(target['CNAE2'][i]).split(', ')
        target.CNAE2[i]=cnaes
        if cnaes == ['nan'] or cnaes == 'NaN' or cnaes == "":
            target.CNAE2[i]='NaN'
        else:
            target.CNAE2pai[i]=[]
            target.CNAE2vo[i]=[]
            target.CNAE2bisavo[i]=[]

            for k in range(len(cnaes)):
                y=cnaes[k][:7]
                target['CNAE2pai'][i].append(y)
            for k in range(len(cnaes)):
                y=cnaes[k][:5]
                target['CNAE2vo'][i].append(y)
            for k in range(len(cnaes)):
                y=cnaes[k][:2]
                target['CNAE2bisavo'][i].append(y)
            target.CNAE2pai[i]=list(set(target.CNAE2pai[i]))
            target.CNAE2vo[i]=list(set(target.CNAE2vo[i]))
            target.CNAE2bisavo[i]=list(set(target.CNAE2bisavo[i]))

有人能提出一种更有效的方法来实现这个结果吗?

还没有尝试过,但最好避免。追加。最好先制作一个列表并附加到该列表中,当结果完成后,将其输入到数据框中。

还没有尝试过,但最好避免。附加。最好先创建一个列表并附加到该列表中,当结果完成后,将其输入到数据框中。

我在这里使用了
apply
函数,它应该比遍历行更快,
循环设置比or函数更快的查找,最后列出比嵌套的
更快的理解。我还没有测试过这个,但希望它能有所帮助

import pandas as pd

# Create dummy data and  dataframe
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
     "19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])

# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
    target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
编辑 在我的系统上,利用
lambda
函数和列表理解比
groupby
产生更快的结果:

d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
     "19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])

def lambda_func(target):
    # Loop across desired columns
    nans = set(["nan","NaN",""])
    for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
        target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
    target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
    return target

def groupby_func(target):
    s = target.CNAE.str.split(', ', expand=True).stack()

    pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
    vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
    bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
    base = s.groupby(level=0).apply(list)

    target = pd.concat(
        [base, pai, vo, bisavo], axis=1,
        keys=['', 'pai', 'vo', 'bisavo']
    ).add_prefix('CNAE2').reindex(target.index)

    return target
结果:

%timeit lambda_func(target) 1000 loops, best of 3: 930 µs per loop
%timeit groupby_func(target) 100 loops, best of 3: 6.3 ms per loop

我在这里使用了
apply
函数,它应该比遍历行更快,
set
查找,查找速度应该比or函数更快,最后列出理解,这一理解往往比嵌套的
快。我还没有测试过这个,但希望它能有所帮助

import pandas as pd

# Create dummy data and  dataframe
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
     "19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])

# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
    target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
编辑 在我的系统上,利用
lambda
函数和列表理解比
groupby
产生更快的结果:

d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
     "19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])

def lambda_func(target):
    # Loop across desired columns
    nans = set(["nan","NaN",""])
    for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
        target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
    target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
    return target

def groupby_func(target):
    s = target.CNAE.str.split(', ', expand=True).stack()

    pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
    vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
    bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
    base = s.groupby(level=0).apply(list)

    target = pd.concat(
        [base, pai, vo, bisavo], axis=1,
        keys=['', 'pai', 'vo', 'bisavo']
    ).add_prefix('CNAE2').reindex(target.index)

    return target
结果:

%timeit lambda_func(target) 1000 loops, best of 3: 930 µs per loop
%timeit groupby_func(target) 100 loops, best of 3: 6.3 ms per loop