Python 3.x 熊猫使用条件从其他数据框中的行创建列_Python 3.x_Pandas

Python 3.x 熊猫使用条件从其他数据框中的行创建列

python-3.x pandas

Python 3.x 熊猫使用条件从其他数据框中的行创建列,python-3.x,pandas,Python 3.x,Pandas,给定以下数据帧： import pandas as pd import numpy as np pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'], 'Position':['Contractor','President'], 'Site(s)':['A,B','A'], 'Item(s)':['1','1,2']

给定以下数据帧：

import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']
                   })

pos[['Position','Site(s)','Station(s)','Item(s)']]

pos

    Position    Site(s)     Station(s)  Item(s)
0   Contractor  A,B         ,1,2,,      1
1   President   A          0,1,2,3,4    1,2

及

最后，我想说：

    Contractor  President   Site(s)     Station(s)  Item 1  Item 2
0      1           1           A         ,1,2,,       1     1
1      1           0           B         ,1,2,,       1     0
2      0           0           B         ,,,,         0     0
3      0           0           C         ,1,2,,       0     1
4      0           1           A         0,1,2,,      1     1
5      1           1           A         ,,2,,        0     1

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Site(s)':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})
results[['Contractor','President','Site(s)','Station(s)','Item 1','Item 2']]

基于这一逻辑：

对于每种职位：

在sd中使用该位置的名称创建一个新列

对于满足以下条件的每一行，使其值等于1（对于其他行，则为0）：

a。sd['Site']在pos['Site']中至少包含1个值

b。sd['Station（s）]至少包含一个在pos['Station（s）]中找到的号码，但没有额外的号码

我开始这么做，但很快就被击败，屈服了：

for i in pos['Position']:
    sd[i]= 1 if lambda x: 'x' if x for x in pos['Site(s)'] if x in sd['Site']

我粗略地试过，你可以改进下面的代码

sd['Contractor'] = 0
sd['President'] = 0

def check(x):
    for p in pos['Position'].tolist():
        if x['Site'] in pos.set_index('Position').loc[p, 'Site(s)'].split(','):
            ss = pd.Series(x['Station(s)'].split(',')).replace('', np.nan).dropna()
            ps = pd.Series(pos.set_index('Position').loc[p, 'Station(s)'].split(',')).replace('', np.nan).dropna()
            if not ss.empty and ss.isin(ps).all():
                x[p] = 1

    return x

print sd.apply(check, axis=1)


   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

我粗略地试过，你可以改进下面的代码

sd['Contractor'] = 0
sd['President'] = 0

def check(x):
    for p in pos['Position'].tolist():
        if x['Site'] in pos.set_index('Position').loc[p, 'Site(s)'].split(','):
            ss = pd.Series(x['Station(s)'].split(',')).replace('', np.nan).dropna()
            ps = pd.Series(pos.set_index('Position').loc[p, 'Station(s)'].split(',')).replace('', np.nan).dropna()
            if not ss.empty and ss.isin(ps).all():
                x[p] = 1

    return x

print sd.apply(check, axis=1)


   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

因为数据是以逗号分隔的字符串存储的 values——需要代码来遍历行，分离值，遍历另一个数据帧并提取其值，然后比较两个，等等。。。。我看不出有什么办法能真正改善这种情况，只要输入保留逗号分隔的值

考虑到这些限制，我认为这是相当好的

然而，如果你接受“整洁数据”（tidy data）更好的想法——如果你允许我们要将起始点更改为整洁格式的数据帧，则需要不同的方法可能更有效，尤其是当

sd

有许多排。使用

sd.apply（检查，轴=1）

的问题在于发动机罩下面有它使用Python循环迭代

sd

的行。调用

检查一次
每一行的速度都可能比同等代码慢
Panda更快矢量化方法的优势（如merge
或groupby）。
但是，要使用merge
和groupby
，您需要数据格式整齐
因此，假设我们从tidypos
和tidysd
开始，而不是pos
和tidysd。（在
在这篇文章的结尾，你会发现一个可运行的例子，它可以将pos
和sd转换为它们的代码
等价物。）
tidypos
和tidysd
包含与pos
和sd
相同的信息（忽略项，因为它们在该问题中不起作用。）
区别主要在于tidypos
和tidysd中的每一行对应一个“观察”。每个观察都是相互独立的。
从本质上讲，这归结为简单地分割逗号分隔的值，以便每个值都位于单独的行上
现在，我们可以基于公共列，Site
和Station
连接两个数据帧：
In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged
Out[241]: 
    index Site Station    Position
0       0    A       1  Contractor
1       0    A       1   President
2       0    A       2  Contractor
3       0    A       2   President
4       1    B       1  Contractor
5       1    B       2  Contractor
6       3    C       1         NaN
7       3    C       2         NaN
8       4    A       0   President
9       4    A       1  Contractor
10      4    A       1   President
11      4    A       2  Contractor
12      4    A       2   President
13      5    A       2  Contractor
14      5    A       2   President

In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count
Out[256]: 
Position  Contractor  President
index                          
0                2.0        2.0
1                2.0        NaN
4                2.0        3.0
5                1.0        1.0

现在，merged
中的每一行表示一行tidysd和一行之间的匹配
属于tidypos。因此，行的存在意味着存在匹配
在sd['Site']
和pos['Site']之间，而且，在
tidysd['Station']
和tidypos['Station']。换句话说，对于那一排，
sd['Station（s）]
必须包含在pos['Station（）']中找到的数字。唯一的
我们还不能确定的标准是，是否有额外的数字
sd['Station（s）]
未出现在pos['Station（）']
中
我们可以通过计算每个索引
和位置
，因为每一行对应于不同的站。如果这
number等于该索引的可能站点的总数
s
sd['Station（s）]
不包含“额外数字”
我们可以使用groupby/nunique
统计每个索引和位置的站点数
：
In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged
Out[241]: 
    index Site Station    Position
0       0    A       1  Contractor
1       0    A       1   President
2       0    A       2  Contractor
3       0    A       2   President
4       1    B       1  Contractor
5       1    B       2  Contractor
6       3    C       1         NaN
7       3    C       2         NaN
8       4    A       0   President
9       4    A       1  Contractor
10      4    A       1   President
11      4    A       2  Contractor
12      4    A       2   President
13      5    A       2  Contractor
14      5    A       2   President

In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count
Out[256]: 
Position  Contractor  President
index                          
0                2.0        2.0
1                2.0        NaN
4                2.0        3.0
5                1.0        1.0

我们可以计算每个索引的站点的总数
：
In [243]: total_count = tidysd.groupby(['index'])['Station'].nunique(); total_count
Out[243]: 
index
0    2
1    2
3    2
4    3
5    1
Name: Station, dtype: int64

最后，我们可以将1和0分配给承包商
和总裁列，
根据标准（pos\u count[col]==总计数）
：
如果您确实愿意，您可以将此结果连接到原始的sd
，以产生准确的所需结果：
In [246]: result = pd.concat([sd, pos_count], axis=1); result
Out[246]: 
   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

但是，如果您接受数据应该整洁的观点，那么应该避免将多行数据打包成逗号分隔的字符串

如何整理pos
和sd
：
您可以使用向量化字符串方法.str.findall
和.str.split来
将逗号分隔的字符串转换为值列表。然后使用列表
迭代行和列表以构建tidypos的理解，以及
tidysd

总而言之
import itertools as IT
import pandas as pd

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']})

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})

mypos = pos.copy()
mypos['Station(s)'] = mypos['Station(s)'].str.findall(r'(\d+)')
mypos['Site(s)'] = mypos['Site(s)'].str.split(r',')
tidypos = pd.DataFrame(
    [(row['Position'], site, station) 
     for index, row in mypos.iterrows() 
     for site, station in IT.product(
             *[row[col] for col in ['Site(s)', 'Station(s)']])], 
    columns=['Position', 'Site', 'Station'])

mysd = sd[['Site', 'Station(s)']].copy()
mysd['Station(s)'] = mysd['Station(s)'].str.findall(r'(\d+)')

tidysd = pd.DataFrame(
    [(index, row['Site'], station)
     for index, row in mysd.iterrows() 
     for station in row['Station(s)']], 
    columns=['index', 'Site', 'Station'])

merged = pd.merge(tidysd, tidypos, how='left')
pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack()
total_count = tidysd.groupby(['index'])['Station'].nunique()
pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
    pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
result = pd.concat([sd, pos_count], axis=1)
print(result)

屈服
   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

因为数据是以逗号分隔的字符串存储的
values——需要代码来遍历行，分离值，
遍历另一个数据帧并提取其值，然后比较
两个，等等。。。。我看不出有什么办法能真正改善这种情况，只要
输入保留逗号分隔的值
考虑到这些限制，我认为这是相当好的
然而，如果你接受“整洁数据”（tidy data）更好的想法——如果你允许我们
要将起始点更改为整洁格式的数据帧，则需要
不同的方法可能更有效，尤其是当sd有许多
排。使用sd.apply（检查，轴=1）的问题在于发动机罩下面有它
使用Python循环迭代sd
的行。调用检查一次
每一行的速度都可能比同等代码慢
Panda更快矢量化方法的优势（如merge
或groupby）。
但是，要使用merge
和groupby
您需要