在Python Dataframe中动态添加列的数据处理_Python_Pandas_Dataframe_Data Processing

在Python Dataframe中动态添加列的数据处理

python pandas dataframe

在Python Dataframe中动态添加列的数据处理,python,pandas,dataframe,data-processing,Python,Pandas,Dataframe,Data Processing,我有以下问题。假设这是我的CSV id f1 f2 f3 1 4 5 5 1 3 1 0 1 7 4 4 1 4 3 1 1 1 4 6 2 2 6 0 .......... 所以，我有可以按id分组的行。我想创建一个如下所示的csv作为输出 f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t 4 5 5 3 1 0 7 4 4 1

我有以下问题。假设这是我的CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

所以，我有可以按id分组的行。我想创建一个如下所示的csv作为输出

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6

因此，我希望能够选择要转换为列的行数（始终从id的第一行开始）。在这种情况下，我抓取了3排。然后，我还将跳过一行或多行（在本例中仅跳过一行），以从同一id组的最后一行中获取最后一列。出于某些原因，我想使用数据帧

在挣扎了3-4个小时之后。我找到了如下所示的解决方案。但我的解决方案非常缓慢。我有大约700000行，可能有70000组ID。上面型号为3的代码在我的4GB 4核联想上几乎需要一个小时。我需要去看模特儿，可能是10或15号。我仍然是Python的新手，我相信会有一些改变来加快速度。有人能深入解释我如何改进代码吗

非常感谢

型号：要抓取的行数

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)

他是你的朋友

这将很好地扩展；特征数量中只有一个小常量。大约为O（组数）

创建一些测试数据，组大小为7-12，70k组

In [29]: def create_df(i):
   ....:     l = np.random.randint(7,12)
   ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
   ....:     df['A'] = i
   ....:     return df
   ....: 

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1    629885 non-null int64
f2    629885 non-null int64
f3    629885 non-null int64
A     629885 non-null int64
dtypes: int64(4)

而且相当快

In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop

对于进一步的操作，您通常应该停在这里并使用它（因为它是一种很好的分组格式，很容易处理）

如果你想把它转换成广泛的格式

In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()
Out[41]: 
   f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
A                                                                        
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9

[5 rows x 12 columns]

In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1    70000 non-null int64
f2_1    70000 non-null int64
f3_1    70000 non-null int64
f1_2    70000 non-null int64
f2_2    70000 non-null int64
f3_2    70000 non-null int64
f1_3    70000 non-null int64
f2_3    70000 non-null int64
f3_3    70000 non-null int64
f1_4    70000 non-null int64
f2_4    70000 non-null int64
f3_4    70000 non-null int64
dtypes: int64(12)

[35]中的

：dfg=groups.groupby（level=0）.apply（lambda x:Series（x.values.ravel（））
在[36]中：%timeit groups.groupby（level=0）.apply（lambda x:Series（x.values.ravel（）））
财务总监（）
groups.info（）
1圈，最好3圈：每个圈14.5秒
在[40]中：dfg.columns=[“{0}{1}”。在（1,5）范围内为i设置（f，i）格式，在特征中为f设置]
In[41]：dfg.head（）
出[41]：
f1_1 f2_1 f3_1 f1_2 f2_2 f3_2 f1_3 f2_3 f3_3 f1_4 f2_4 f3_4
A.
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9
[5行x 12列]
在[42]：dfg.info（）中
INT64索引：70000个条目，0到69999
数据列（共12列）：
f1_1 70000非空int64
f2_1 70000非空int64
f3_1 70000非空int64
f1_2 70000非空int64
f2_2 70000非空int64
f3_2 70000非空int64
f1_3 70000非空int64
f2_3 70000非空int64
f3_3 70000非空int64
f1_4 70000非空int64
f2_4 70000非空int64
f3_4 70000非空int64
数据类型：int64（12）
是你的朋友
这将很好地扩展；特征数量中只有一个小常量。大约为O（组数）
创建一些测试数据，组大小为7-12，70k组
In [29]: def create_df(i):
   ....:     l = np.random.randint(7,12)
   ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
   ....:     df['A'] = i
   ....:     return df
   ....: 

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1    629885 non-null int64
f2    629885 non-null int64
f3    629885 non-null int64
A     629885 non-null int64
dtypes: int64(4)

而且相当快
In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop

对于进一步的操作，您通常应该停在这里并使用它（因为它是一种很好的分组格式，很容易处理）
如果你想把它转换成广泛的格式
In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()
Out[41]: 
   f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
A                                                                        
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9

[5 rows x 12 columns]

In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1    70000 non-null int64
f2_1    70000 non-null int64
f3_1    70000 non-null int64
f1_2    70000 non-null int64
f2_2    70000 non-null int64
f3_2    70000 non-null int64
f1_3    70000 non-null int64
f2_3    70000 non-null int64
f3_3    70000 non-null int64
f1_4    70000 non-null int64
f2_4    70000 non-null int64
f3_4    70000 non-null int64
dtypes: int64(12)

[35]中的：dfg=groups.groupby（level=0）.apply（lambda x:Series（x.values.ravel（））
在[36]中：%timeit groups.groupby（level=0）.apply（lambda x:Series（x.values.ravel（）））
财务总监（）
groups.info（）
1圈，最好3圈：每个圈14.5秒
在[40]中：dfg.columns=[“{0}{1}”。在（1,5）范围内为i设置（f，i）格式，在特征中为f设置]
In[41]：dfg.head（）
出[41]：
f1_1 f2_1 f3_1 f1_2 f2_2 f3_2 f1_3 f2_3 f3_3 f1_4 f2_4 f3_4
A.
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9
[5行x 12列]
在[42]：dfg.info（）中
INT64索引：70000个条目，0到69999
数据列（共12列）：
f1_1 70000非空int64
f2_1 70000非空int64
f3_1 70000非空int64
f1_2 70000非空int64
f2_2 70000非空int64
f3_2 70000非空int64
f1_3 70000非空int64
f2_3 70000非空int64
f3_3 70000非空int64
f1_4 70000非空int64
f2_4 70000非空int64
f3_4 70000非空int64
数据类型：int64（12）
Wow！这就是为什么我绝对爱你。杰夫：我会慢慢研究你的答案。我会很快给你回复。有一次我犯了一个错误，我错过了我的代码中的第一行，在那里我使用groupby获得了csv\u id。我正在我的代码中添加/编辑这一行。杰夫，这很有效。它将我的代码减少到6行。谢谢这两行dfg=groups.groupby（level=0）.apply（lambda x:pd.Series（x.values.ravel（））
和dfg.columns=[“{0}}{1}”。coreffeatures中范围（1,5）中i的格式（f，i）是f的杀手。Python是一门艺术。诀窍是不惜一切代价始终矢量化避免循环，并且只做一次测试。我明白了。想想向量。谢谢，哇！这就是为什么我绝对爱你。杰夫：我会慢慢研究你的答案。我会很快给你回复。有一次我犯了一个错误，我错过了我的代码中的第一行，在那里我使用groupby获得了csv\u id。我正在我的代码中添加/编辑这一行。杰夫，这很有效。它将我的代码减少到6行。谢谢这两行dfg=groups.groupby（level=0）.apply（lambda x:pd.Series（x.values.ravel（））
和dfg.columns=[“{0}}{1}”。coreffeatures中范围（1,5）中i的格式（f，i）是f的杀手。Python是一门艺术。诀窍是始终矢量化避免循环