Python 仅当同一行中的其他值为True时,才从数据帧添加值

Python 仅当同一行中的其他值为True时,才从数据帧添加值,python,pandas,dataframe,Python,Pandas,Dataframe,我对熊猫不熟悉 我有一个如下所示的数据集: Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3 2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC 2019-12-04 00 ABCD 2019-12-04 0

我对熊猫不熟悉

我有一个如下所示的数据集:

Date_1       Hour_1    id_1    Date_2       Hour_2    id_2    Date_3       Hour_3    id_3    
2019-12-04   00        ABC     2019-12-04   01        ABC     2019-12-04   02        ABC
2019-12-04   00        ABCD    2019-12-04   01        ABCD    2019-12-04   02        ABCD
2019-12-04   00        ABCDEF  2019-12-04   01        ABCDE   2019-12-04   02        ABCDEF
2019-12-04   03        ABCDEFG 2019-12-04   01        ABCDEFG 2019-12-04   02        ABCDEF
...

我的目标

检查
id\u 2
id\u 3
中是否存在
id\u 1
。并创建一个新的数据帧,其中结构如下所示:

Date_1       Hour_1    id_1    Date_2       Hour_2    Exists   Date_3       Hour_3    Exists    
2019-12-04   00        ABC     2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCD    2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCDEF                         False    2019-12-04   02        True
2019-12-04   03        ABCDEFG 2019-12-04   01        True                            False
我现在面临的问题是,我不知道如何包括Date_2、Hour_2、Date_3、Hour_3,或者根据id_2和id_3是真是假来排除它们

当我创建数据框时,我只需添加所有的信息源(日期、小时、id),我就得到了一个大的数据框,其中有日期\ 1-10、小时\ 1-10、id \ 1-10

final_export['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_export['Date from id_2'] = final_data['Date from id_2 other source']
final_export['Hour from id_2'] = final_data['Hour from id_2 other source']
当我使用
.isin()
方法时,它会正确过滤数据,但如果包含或不包含同一行中的小时和日期,它不会改变。例如,如果id_1存在于id_3中,则其日期和小时为True;如果id_1不存在,则为False,且日期和小时为空

在我使用
.isin()时,
日期和小时未链接到id值

如果问题解释正确,请告诉我


谢谢您的建议。

我建议将数据帧分为三个数据帧,每个数据帧具有id、日期、小时,并使用合并功能将数据帧与id合并为一个值,并在id不存在的地方分配空值

df = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_1": ["00", "00", "00", "03"],
    "id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_2": ["01", "01", "01", "01"],
    "id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_3": ["02", "02", "02", "02"],
    "id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

ids = df["id_1"]
# You can choose whichever columns you want
df_1 = df.loc[df["id_1"].isin(ids), ["Date_1", "Hour_1", "id_1"]]
df_2 = df.loc[df["id_2"].isin(ids), ["Date_2", "Hour_2", "id_2"]]
df_3 = df.loc[df["id_3"].isin(ids), ["Date_3", "Hour_3", "id_3"]]

df_concat = pd.concat([df_1, df_2, df_3], axis=1)
输出

Date_1  Hour_1  id_1    Date_2  Hour_2  id_2    Date_3  Hour_3  id_3
0   2019-12-04  00  ABC 2019-12-04  01  ABC 2019-12-04  02  ABC
1   2019-12-04  00  ABCD    2019-12-04  01  ABCD    2019-12-04  02  ABCD
2   2019-12-04  00  ABCDEF  NaN NaN NaN 2019-12-04  02  ABCDEF
3   2019-12-04  03  ABCDEFG 2019-12-04  01  ABCDEFG 2019-12-04  02  ABCDEF

类似的方法应该会奏效:

mask_id2 = df.id_1 == df.id_2
mask_id3 = df.id_1 == df.id_3

df.id_2 = mask_id2
df.id_3 = mask_id3

df.loc[~mask_id2, ['Date_2', 'Hour_2']] = ""
df.loc[~mask_id3, ['Date_3', 'Hour_3']] = ""
输出:

       Date_1  Hour_1     id_1      Date_2 Hour_2   id_2      Date_3 Hour_3   id_3
0  2019-12-04       0      ABC  2019-12-04      1   True  2019-12-04      2   True
1  2019-12-04       0     ABCD  2019-12-04      1   True  2019-12-04      2   True
2  2019-12-04       0   ABCDEF                     False  2019-12-04      2   True
3  2019-12-04       3  ABCDEFG  2019-12-04      1   True                     False

如果我正确理解您的问题,
isin()
是一个错误的函数:它检查
id\u 1
的值是否在
id\u 2
或(
id\u 3
)列中的任何位置:它不检查
id\u 1
是否是同一行中
id\u 2
值的子字符串。请尝试以下代码:

import pandas as pd
testdf = pd.DataFrame({
    "hour_1": ["00", "01"],
    "id_1":["ABC", "ABC"], 
    "id_2":["ABCD", "AB"], 
})
testdf["exists_in_2"] = testdf['id_1'].isin(testdf['id_2'])
testdf
要首先修复该位,请执行以下操作:

eltwise_contains =  lambda frag, text: frag in text
testdf["exists_in_2"] = testdf[['id_1', 'id_2']].apply(lambda x : eltwise_contains(*x), axis = 1)

testdf
接下来,您的问题是:如果同一行的
id_2
id_3
的值中不存在
id_1
,则将天和小时设置为空字符串。我们可以使用与上面相同的模式:定义一个接受两个输入的lambda表达式,然后在下一行中,从数据帧中提取两列,并在该子数据帧上应用另一个lambda,该子数据帧将未打包的lambda变量传递给原始lambda

empty_string_if_false = lambda a_bool, val: val if a_bool else ""
testdf["hour_1"] = testdf[['exists_in_2', 'hour_1']].apply(lambda x : empty_string_if_false(*x), axis = 1)

testdf

如果铁腕的答案不是你之前的答案,这会给你一个df格式的-

import pandas as pd

final_data = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_1": ["00", "00", "00", "03"],"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_2": ["01", "01", "01", "01"],"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_3": ["02", "02", "02", "02"],"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

final_data['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_data['Exists in id_3'] = final_data['id_1'].isin(final_data['id_3'])    final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Hour_2']=final_data.apply(lambda r: r['Hour_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_3']=final_data.apply(lambda r: r['Date_3'] if r['Exists in id_3'] is True else '',axis=1)
final_data['Hour_3']=final_data.apply(lambda r: r['Hour_3'] if r['Exists in id_3'] is True else '',axis=1)
print(final_data[['id_1','id_2','id_3','Hour_2','Hour_3']])
它给出了一个df,其中包含除id2之外的所有原始信息,当id2不在id1中时,删除了hour2,同样,对于id3也是如此。所选行看起来像-

      id_1     id_2    id_3 Hour_2      Date_2 Hour_3      Date_3
0      ABC      ABC     ABC     01  2019-12-04     02  2019-12-04
1     ABCD     ABCD    ABCD     01  2019-12-04     02  2019-12-04
2   ABCDEF    ABCDE  ABCDEF                        02  2019-12-04
3  ABCDEFG  ABCDEFG  ABCDEF     01  2019-12-04              

为什么id的日期3/Hour 3为真因为id中存在日期1小时id日期2小时id日期3小时id日期2019-12-04 00 ABC假2019-12-04 00 ABCDEF假3 2019-12-04 03 ABCDEFG False False在jupyter笔记本中,您能再检查一遍吗?我正在使用你的精确数据集。这不应该依赖于环境。这个方法是通过整列执行还是只检查同一行中的值?因为在我的dataframe中,值会分散,不会在同一行中找到。相等性检查是按元素进行的。相反,你可以用
df.id\u 1.isin(df.id\u 2)
来屏蔽它,例如,如果这是你想要的。所以它看起来是这样的:
mask\u id2=df.id\u 1.isin*df.id\u 2)
,对吗?