Python 数据帧上的条件正则表达式函数
我有以下Python 数据帧上的条件正则表达式函数,python,regex,pandas,if-statement,dataframe,Python,Regex,Pandas,If Statement,Dataframe,我有以下df和函数(见下文)。我可能把事情复杂化了。一双崭新的眼睛将不胜感激 df: Site Name Plan Unique ID Atlas Placement ID Affectv we11080301 11087207850894 Mashable we14880202 11087208009031 Alphr uk10790301 11087208005229 Alphr uk19350201 110
df和函数(见下文)
。我可能把事情复杂化了。一双崭新的眼睛将不胜感激
df:
Site Name Plan Unique ID Atlas Placement ID
Affectv we11080301 11087207850894
Mashable we14880202 11087208009031
Alphr uk10790301 11087208005229
Alphr uk19350201 11087208005228
目标是:
df['Plan Unique ID']
,搜索特定值(we\u match
或uk\u match
),如果存在匹配项we12720203
或uk11350200
)we或uk值添加到新列df['consolidatedid']
new\u ID\u search
df['consolidatedid']
df['consolidatedid]
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match > "we12720203":
return we_match.group(0)
else:
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match > "uk11350200":
return uk_match.group(0)
else:
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)
编辑:清除公式
我以如下方式修改了gzl的函数(见下文):首先看看df1中是否有14个数字。如果是,请加上
理想情况下,下一步是从df2
中抓取一列MediaPlanUnique
,并将其转换为一系列过滤位置
:
we11080301
we12880304
we14880202
uk19350201
uk11560205
uk11560305
并查看filtered\u placement
中的任何值是否存在于df['Plan Unique ID]
中。如果存在匹配项,则将df['Plan Unique ID]
添加到我们的结束列=df[ConsolidatedID]
当前的问题是,它会导致所有0。我认为这是因为比较是以1:1(新匹配的第一个结果vs过滤放置的第一个结果
)而不是1:多(新匹配的第一个结果vs过滤放置的所有结果
)
有什么想法吗
def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):
if type(df['PlacementID']) is str:
old_match = re.search(old_id_search, df['PlacementID'])
if old_match:
return old_match.group(0)
else:
if type(df['Plan Unique ID']) is str:
if type(filtered_placements) is str:
new_match = re.search(new_id_search, df['Plan Unique ID'])
if new_match:
if filtered_placements.str.contains(new_match.group(0)):
return new_match.group(0)
return 0
mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)
我建议不要使用如此复杂的嵌套
if
语句。正如菲尔指出的,每一张支票都是互斥的。因此,您可以在同一缩进的if
语句中检查“we”和“uk”,然后返回到默认过程
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match.group(0) > "we12720203":
return we_match.group(0)
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match.group(0) > "uk11350200":
return uk_match.group(0)
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
测试:
我重新组织了逻辑,并简化了regex操作,以展示另一种方法。重组对于答案来说并不是绝对必要的,但当你询问另一种意见/方法时,我认为这可能对你未来有所帮助:
# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
# Extracted to shorter temp variable
plan_id = row["Plan Unique ID"]
# Using parenthesis to get two separate groups - code and numeric
# Means you can do the match just once
result = re.match("(we|uk)(.+)",plan_id)
if result:
code, numeric = result.groups()
# We can get away with these simple tests as the earlier regex guarantees
# that the string starts with either "we" or "uk"
if code == "we" and plan_id > we_search:
return_val = plan_id
elif code == "uk" and plan_id > uk_search:
return_val = plan_id
else:
# It looked like this column was used whatever happened at the
# end, so there's no need to check against a regex
#
# The Atlas Placement is the default option if it either fails
# the prefix check OR the "greater than" test
return_val = row["Atlas Placement ID"]
# A single return statement is often easier to debug
return return_val
然后在apply
语句中使用(同时查看assign
):
你们能不能提供一个@Matt的数据样本,让我们验证一下?嗨,菲尔,好的,让我上传一下。另外,关于第一个“if”:是否存在“Plan Unique ID”不是字符串的情况?i、 这只是一个错误检查,还是您明确希望不使用非字符串值(例如整数)。谢谢Matt。逻辑看起来确实很复杂,所以我从不同的角度看它,做了几个不同的过程,因为每个过程都是互斥的,所以没有覆盖它的风险。忍受我…嗨,马特。如果其中一个答案解决了您的问题,您是否能够将其中一个标记为答案,以便我或@gzc将结果添加到我们的个人资料中。谢谢你可以用更简单的字符串操作替换正则表达式操作(例如,
if plan_id.startswith(“we”):etc
),但这是超出你问题范围的另一步。您好,我在编辑中添加了一个额外的变量。我想我已经发现了问题,但我仍在思考如何解决它。任何帮助都将不胜感激
# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
# Extracted to shorter temp variable
plan_id = row["Plan Unique ID"]
# Using parenthesis to get two separate groups - code and numeric
# Means you can do the match just once
result = re.match("(we|uk)(.+)",plan_id)
if result:
code, numeric = result.groups()
# We can get away with these simple tests as the earlier regex guarantees
# that the string starts with either "we" or "uk"
if code == "we" and plan_id > we_search:
return_val = plan_id
elif code == "uk" and plan_id > uk_search:
return_val = plan_id
else:
# It looked like this column was used whatever happened at the
# end, so there's no need to check against a regex
#
# The Atlas Placement is the default option if it either fails
# the prefix check OR the "greater than" test
return_val = row["Atlas Placement ID"]
# A single return statement is often easier to debug
return return_val
$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0 Affectv we11080301 11087207850894 11087207850894
1 Mashable we14880202 11087208009031 we14880202
2 Alphr uk10790301 11087208005229 11087208005229
3 Alphr uk19350201 11087208005228 uk19350201