Python 仅当同一行中的其他值为True时,才从数据帧添加值
我对熊猫不熟悉 我有一个如下所示的数据集:Python 仅当同一行中的其他值为True时,才从数据帧添加值,python,pandas,dataframe,Python,Pandas,Dataframe,我对熊猫不熟悉 我有一个如下所示的数据集: Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3 2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC 2019-12-04 00 ABCD 2019-12-04 0
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC
2019-12-04 00 ABCD 2019-12-04 01 ABCD 2019-12-04 02 ABCD
2019-12-04 00 ABCDEF 2019-12-04 01 ABCDE 2019-12-04 02 ABCDEF
2019-12-04 03 ABCDEFG 2019-12-04 01 ABCDEFG 2019-12-04 02 ABCDEF
...
我的目标
检查id\u 2
,id\u 3
中是否存在id\u 1
。并创建一个新的数据帧,其中结构如下所示:
Date_1 Hour_1 id_1 Date_2 Hour_2 Exists Date_3 Hour_3 Exists
2019-12-04 00 ABC 2019-12-04 01 True 2019-12-04 02 True
2019-12-04 00 ABCD 2019-12-04 01 True 2019-12-04 02 True
2019-12-04 00 ABCDEF False 2019-12-04 02 True
2019-12-04 03 ABCDEFG 2019-12-04 01 True False
我现在面临的问题是,我不知道如何包括Date_2、Hour_2、Date_3、Hour_3,或者根据id_2和id_3是真是假来排除它们
当我创建数据框时,我只需添加所有的信息源(日期、小时、id),我就得到了一个大的数据框,其中有日期\ 1-10、小时\ 1-10、id \ 1-10
final_export['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_export['Date from id_2'] = final_data['Date from id_2 other source']
final_export['Hour from id_2'] = final_data['Hour from id_2 other source']
当我使用.isin()
方法时,它会正确过滤数据,但如果包含或不包含同一行中的小时和日期,它不会改变。例如,如果id_1存在于id_3中,则其日期和小时为True;如果id_1不存在,则为False,且日期和小时为空
在我使用.isin()时,
日期和小时未链接到id值
如果问题解释正确,请告诉我
谢谢您的建议。我建议将数据帧分为三个数据帧,每个数据帧具有id、日期、小时,并使用合并功能将数据帧与id合并为一个值,并在id不存在的地方分配空值
df = pd.DataFrame({
"Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_1": ["00", "00", "00", "03"],
"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
"Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_2": ["01", "01", "01", "01"],
"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
"Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_3": ["02", "02", "02", "02"],
"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})
ids = df["id_1"]
# You can choose whichever columns you want
df_1 = df.loc[df["id_1"].isin(ids), ["Date_1", "Hour_1", "id_1"]]
df_2 = df.loc[df["id_2"].isin(ids), ["Date_2", "Hour_2", "id_2"]]
df_3 = df.loc[df["id_3"].isin(ids), ["Date_3", "Hour_3", "id_3"]]
df_concat = pd.concat([df_1, df_2, df_3], axis=1)
输出
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
0 2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC
1 2019-12-04 00 ABCD 2019-12-04 01 ABCD 2019-12-04 02 ABCD
2 2019-12-04 00 ABCDEF NaN NaN NaN 2019-12-04 02 ABCDEF
3 2019-12-04 03 ABCDEFG 2019-12-04 01 ABCDEFG 2019-12-04 02 ABCDEF
类似的方法应该会奏效:
mask_id2 = df.id_1 == df.id_2
mask_id3 = df.id_1 == df.id_3
df.id_2 = mask_id2
df.id_3 = mask_id3
df.loc[~mask_id2, ['Date_2', 'Hour_2']] = ""
df.loc[~mask_id3, ['Date_3', 'Hour_3']] = ""
输出:
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
0 2019-12-04 0 ABC 2019-12-04 1 True 2019-12-04 2 True
1 2019-12-04 0 ABCD 2019-12-04 1 True 2019-12-04 2 True
2 2019-12-04 0 ABCDEF False 2019-12-04 2 True
3 2019-12-04 3 ABCDEFG 2019-12-04 1 True False
如果我正确理解您的问题,
isin()
是一个错误的函数:它检查id\u 1
的值是否在id\u 2
或(id\u 3
)列中的任何位置:它不检查id\u 1
是否是同一行中id\u 2
值的子字符串。请尝试以下代码:
import pandas as pd
testdf = pd.DataFrame({
"hour_1": ["00", "01"],
"id_1":["ABC", "ABC"],
"id_2":["ABCD", "AB"],
})
testdf["exists_in_2"] = testdf['id_1'].isin(testdf['id_2'])
testdf
要首先修复该位,请执行以下操作:
eltwise_contains = lambda frag, text: frag in text
testdf["exists_in_2"] = testdf[['id_1', 'id_2']].apply(lambda x : eltwise_contains(*x), axis = 1)
testdf
接下来,您的问题是:如果同一行的id_2
和id_3
的值中不存在id_1
,则将天和小时设置为空字符串。我们可以使用与上面相同的模式:定义一个接受两个输入的lambda表达式,然后在下一行中,从数据帧中提取两列,并在该子数据帧上应用另一个lambda,该子数据帧将未打包的lambda变量传递给原始lambda
empty_string_if_false = lambda a_bool, val: val if a_bool else ""
testdf["hour_1"] = testdf[['exists_in_2', 'hour_1']].apply(lambda x : empty_string_if_false(*x), axis = 1)
testdf
如果铁腕的答案不是你之前的答案,这会给你一个df格式的-
import pandas as pd
final_data = pd.DataFrame({
"Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_1": ["00", "00", "00", "03"],"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
"Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_2": ["01", "01", "01", "01"],"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
"Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_3": ["02", "02", "02", "02"],"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})
final_data['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_data['Exists in id_3'] = final_data['id_1'].isin(final_data['id_3']) final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Hour_2']=final_data.apply(lambda r: r['Hour_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_3']=final_data.apply(lambda r: r['Date_3'] if r['Exists in id_3'] is True else '',axis=1)
final_data['Hour_3']=final_data.apply(lambda r: r['Hour_3'] if r['Exists in id_3'] is True else '',axis=1)
print(final_data[['id_1','id_2','id_3','Hour_2','Hour_3']])
它给出了一个df,其中包含除id2之外的所有原始信息,当id2不在id1中时,删除了hour2,同样,对于id3也是如此。所选行看起来像-
id_1 id_2 id_3 Hour_2 Date_2 Hour_3 Date_3
0 ABC ABC ABC 01 2019-12-04 02 2019-12-04
1 ABCD ABCD ABCD 01 2019-12-04 02 2019-12-04
2 ABCDEF ABCDE ABCDEF 02 2019-12-04
3 ABCDEFG ABCDEFG ABCDEF 01 2019-12-04
为什么id的日期3/Hour 3为真因为id中存在日期1小时id日期2小时id日期3小时id日期2019-12-04 00 ABC假2019-12-04 00 ABCDEF假3 2019-12-04 03 ABCDEFG False False在jupyter笔记本中,您能再检查一遍吗?我正在使用你的精确数据集。这不应该依赖于环境。这个方法是通过整列执行还是只检查同一行中的值?因为在我的dataframe中,值会分散,不会在同一行中找到。相等性检查是按元素进行的。相反,你可以用
df.id\u 1.isin(df.id\u 2)
来屏蔽它,例如,如果这是你想要的。所以它看起来是这样的:mask\u id2=df.id\u 1.isin*df.id\u 2)
,对吗?