Python 数据帧上的条件正则表达式函数_Python_Regex_Pandas_If Statement_Dataframe

Python 数据帧上的条件正则表达式函数

python regex pandas if-statement dataframe

Python 数据帧上的条件正则表达式函数,python,regex,pandas,if-statement,dataframe,Python,Regex,Pandas,If Statement,Dataframe,我有以下df和函数（见下文）。我可能把事情复杂化了。一双崭新的眼睛将不胜感激 df: Site Name Plan Unique ID Atlas Placement ID Affectv we11080301 11087207850894 Mashable we14880202 11087208009031 Alphr uk10790301 11087208005229 Alphr uk19350201 110

我有以下

df和函数（见下文）

。我可能把事情复杂化了。一双崭新的眼睛将不胜感激

df:

Site Name   Plan Unique ID  Atlas Placement ID
Affectv     we11080301      11087207850894
Mashable    we14880202      11087208009031
Alphr       uk10790301      11087208005229
Alphr       uk19350201      11087208005228

目标是：

Iter首先通过

df['Plan Unique ID']

，搜索特定值（

we\u match

或

uk\u match

），如果存在匹配项

检查字符串值是否大于该组中的某个值（

we12720203

或

uk11350200

）

如果该值大于，则将该

we或uk值添加到新列df['consolidatedid']


如果值较低或不匹配，则使用new\u ID\u search

如果存在匹配项，则将其添加到df['consolidatedid']

如果不是，则将0返回到df['consolidatedid]

当前的问题是它返回一个空列
 def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

        if type(df['Plan Unique ID']) is str:
            we_match = re.search(we_search, df['Plan Unique ID'])
            if we_match:
                if we_match > "we12720203":
                    return we_match.group(0)
                else:
                    uk_match =  re.search(uk_search, df['Plan Unique ID'])
                    if uk_match:
                        if uk_match > "uk11350200":
                            return uk_match.group(0)
                        else:
                            match_new =  re.search(new_id_search, df['Atlas Placement ID'])
                            if match_new:
                                return match_new.group(0)

                            return 0


    mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)

编辑：清除公式
我以如下方式修改了gzl的函数（见下文）

：首先看看df1中是否有14个数字。如果是，请加上

理想情况下，下一步是从

df2

中抓取一列

MediaPlanUnique

，并将其转换为一系列

过滤位置

：

并查看

filtered\u placement

中的任何值是否存在于

df['Plan Unique ID]

中。如果存在匹配项，则将

df['Plan Unique ID]

添加到我们的结束列=

df[ConsolidatedID]

当前的问题是，它会导致所有0。我认为这是因为比较是以1:1（新匹配的第一个结果vs

过滤放置的第一个结果

）而不是1:多（新匹配的第一个结果vs

过滤放置的所有结果

）

有什么想法吗

def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):

    if type(df['PlacementID']) is str:

        old_match =  re.search(old_id_search, df['PlacementID'])
        if old_match:
            return old_match.group(0)

        else:

            if type(df['Plan Unique ID']) is str:
                if type(filtered_placements) is str:


                    new_match = re.search(new_id_search, df['Plan Unique ID'])
                    if new_match:
                        if filtered_placements.str.contains(new_match.group(0)):
                            return new_match.group(0)          


        return 0

mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)

我建议不要使用如此复杂的嵌套

if

语句。正如菲尔指出的，每一张支票都是互斥的。因此，您可以在同一缩进的

if

语句中检查“we”和“uk”，然后返回到默认过程

def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

    if type(df['Plan Unique ID']) is str:
        we_match = re.search(we_search, df['Plan Unique ID'])
        if we_match:
            if we_match.group(0) > "we12720203":
                return we_match.group(0)

        uk_match =  re.search(uk_search, df['Plan Unique ID'])
        if uk_match:
            if uk_match.group(0) > "uk11350200":
                return uk_match.group(0)


        match_new =  re.search(new_id_search, df['Atlas Placement ID'])

        if match_new:
            return match_new.group(0)

        return 0

测试：

我重新组织了逻辑，并简化了regex操作，以展示另一种方法。重组对于答案来说并不是绝对必要的，但当你询问另一种意见/方法时，我认为这可能对你未来有所帮助：

# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
    # Extracted to shorter temp variable
    plan_id = row["Plan Unique ID"]
    # Using parenthesis to get two separate groups - code and numeric
    # Means you can do the match just once
    result = re.match("(we|uk)(.+)",plan_id)
    if result:
        code, numeric = result.groups()
        # We can get away with these simple tests as the earlier regex guarantees
        # that the string starts with either "we" or "uk"
        if code == "we" and plan_id > we_search:
            return_val = plan_id
        elif code == "uk" and plan_id > uk_search:
            return_val = plan_id
        else:
            # It looked like this column was used whatever happened at the
            # end, so there's no need to check against a regex
            #
            # The Atlas Placement is the default option if it either fails
            # the prefix check OR the "greater than" test
            return_val = row["Atlas Placement ID"]
    # A single return statement is often easier to debug
    return return_val

然后在

apply

语句中使用（同时查看

assign

）：

你们能不能提供一个@Matt的数据样本，让我们验证一下？嗨，菲尔，好的，让我上传一下。另外，关于第一个“if”：是否存在“Plan Unique ID”不是字符串的情况？i、这只是一个错误检查，还是您明确希望不使用非字符串值（例如整数）。谢谢Matt。逻辑看起来确实很复杂，所以我从不同的角度看它，做了几个不同的过程，因为每个过程都是互斥的，所以没有覆盖它的风险。忍受我…嗨，马特。如果其中一个答案解决了您的问题，您是否能够将其中一个标记为答案，以便我或@gzc将结果添加到我们的个人资料中。谢谢你可以用更简单的字符串操作替换正则表达式操作（例如，

if plan_id.startswith（“we”）：etc

），但这是超出你问题范围的另一步。您好，我在编辑中添加了一个额外的变量。我想我已经发现了问题，但我仍在思考如何解决它。任何帮助都将不胜感激

# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
    # Extracted to shorter temp variable
    plan_id = row["Plan Unique ID"]
    # Using parenthesis to get two separate groups - code and numeric
    # Means you can do the match just once
    result = re.match("(we|uk)(.+)",plan_id)
    if result:
        code, numeric = result.groups()
        # We can get away with these simple tests as the earlier regex guarantees
        # that the string starts with either "we" or "uk"
        if code == "we" and plan_id > we_search:
            return_val = plan_id
        elif code == "uk" and plan_id > uk_search:
            return_val = plan_id
        else:
            # It looked like this column was used whatever happened at the
            # end, so there's no need to check against a regex
            #
            # The Atlas Placement is the default option if it either fails
            # the prefix check OR the "greater than" test
            return_val = row["Atlas Placement ID"]
    # A single return statement is often easier to debug
    return return_val

$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>   
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0   Affectv     we11080301     11087207850894  11087207850894
1  Mashable     we14880202     11087208009031      we14880202
2     Alphr     uk10790301     11087208005229  11087208005229
3     Alphr     uk19350201     11087208005228      uk19350201