Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 数据帧上的条件正则表达式函数_Python_Regex_Pandas_If Statement_Dataframe - Fatal编程技术网

Python 数据帧上的条件正则表达式函数

Python 数据帧上的条件正则表达式函数,python,regex,pandas,if-statement,dataframe,Python,Regex,Pandas,If Statement,Dataframe,我有以下df和函数(见下文)。我可能把事情复杂化了。一双崭新的眼睛将不胜感激 df: Site Name Plan Unique ID Atlas Placement ID Affectv we11080301 11087207850894 Mashable we14880202 11087208009031 Alphr uk10790301 11087208005229 Alphr uk19350201 110

我有以下
df和函数(见下文)
。我可能把事情复杂化了。一双崭新的眼睛将不胜感激

df:

Site Name   Plan Unique ID  Atlas Placement ID
Affectv     we11080301      11087207850894
Mashable    we14880202      11087208009031
Alphr       uk10790301      11087208005229
Alphr       uk19350201      11087208005228
目标是:

  • Iter首先通过
    df['Plan Unique ID']
    ,搜索特定值(
    we\u match
    uk\u match
    ),如果存在匹配项

  • 检查字符串值是否大于该组中的某个值(
    we12720203
    uk11350200

  • 如果该值大于,则将该
    we或uk值添加到新列
    df['consolidatedid']

  • 如果值较低或不匹配,则使用
    new\u ID\u search

  • 如果存在匹配项,则将其添加到
    df['consolidatedid']

  • 如果不是,则将0返回到
    df['consolidatedid]

  • 当前的问题是它返回一个空列

     def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
    
            if type(df['Plan Unique ID']) is str:
                we_match = re.search(we_search, df['Plan Unique ID'])
                if we_match:
                    if we_match > "we12720203":
                        return we_match.group(0)
                    else:
                        uk_match =  re.search(uk_search, df['Plan Unique ID'])
                        if uk_match:
                            if uk_match > "uk11350200":
                                return uk_match.group(0)
                            else:
                                match_new =  re.search(new_id_search, df['Atlas Placement ID'])
                                if match_new:
                                    return match_new.group(0)
    
                                return 0
    
    
        mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)
    
    编辑:清除公式

    我以如下方式修改了gzl的函数(见下文)
    :首先看看df1中是否有14个数字。如果是,请加上

    理想情况下,下一步是从
    df2
    中抓取一列
    MediaPlanUnique
    ,并将其转换为一系列
    过滤位置

    we11080301  
    we12880304  
    we14880202  
    uk19350201  
    uk11560205  
    uk11560305  
    
    并查看
    filtered\u placement
    中的任何值是否存在于
    df['Plan Unique ID]
    中。如果存在匹配项,则将
    df['Plan Unique ID]
    添加到我们的结束列=
    df[ConsolidatedID]

    当前的问题是,它会导致所有0。我认为这是因为比较是以1:1(新匹配的第一个结果vs
    过滤放置的第一个结果
    )而不是1:多(新匹配的第一个结果vs
    过滤放置的所有结果

    有什么想法吗

    def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):
    
        if type(df['PlacementID']) is str:
    
            old_match =  re.search(old_id_search, df['PlacementID'])
            if old_match:
                return old_match.group(0)
    
            else:
    
                if type(df['Plan Unique ID']) is str:
                    if type(filtered_placements) is str:
    
    
                        new_match = re.search(new_id_search, df['Plan Unique ID'])
                        if new_match:
                            if filtered_placements.str.contains(new_match.group(0)):
                                return new_match.group(0)          
    
    
            return 0
    
    mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)
    

    我建议不要使用如此复杂的嵌套
    if
    语句。正如菲尔指出的,每一张支票都是互斥的。因此,您可以在同一缩进的
    if
    语句中检查“we”和“uk”,然后返回到默认过程

    def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
    
        if type(df['Plan Unique ID']) is str:
            we_match = re.search(we_search, df['Plan Unique ID'])
            if we_match:
                if we_match.group(0) > "we12720203":
                    return we_match.group(0)
    
            uk_match =  re.search(uk_search, df['Plan Unique ID'])
            if uk_match:
                if uk_match.group(0) > "uk11350200":
                    return uk_match.group(0)
    
    
            match_new =  re.search(new_id_search, df['Atlas Placement ID'])
    
            if match_new:
                return match_new.group(0)
    
            return 0
    
    测试:


    我重新组织了逻辑,并简化了regex操作,以展示另一种方法。重组对于答案来说并不是绝对必要的,但当你询问另一种意见/方法时,我认为这可能对你未来有所帮助:

    # Inline comments to explain the main changes.
    def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
        # Extracted to shorter temp variable
        plan_id = row["Plan Unique ID"]
        # Using parenthesis to get two separate groups - code and numeric
        # Means you can do the match just once
        result = re.match("(we|uk)(.+)",plan_id)
        if result:
            code, numeric = result.groups()
            # We can get away with these simple tests as the earlier regex guarantees
            # that the string starts with either "we" or "uk"
            if code == "we" and plan_id > we_search:
                return_val = plan_id
            elif code == "uk" and plan_id > uk_search:
                return_val = plan_id
            else:
                # It looked like this column was used whatever happened at the
                # end, so there's no need to check against a regex
                #
                # The Atlas Placement is the default option if it either fails
                # the prefix check OR the "greater than" test
                return_val = row["Atlas Placement ID"]
        # A single return statement is often easier to debug
        return return_val
    
    然后在
    apply
    语句中使用(同时查看
    assign
    ):


    你们能不能提供一个@Matt的数据样本,让我们验证一下?嗨,菲尔,好的,让我上传一下。另外,关于第一个“if”:是否存在“Plan Unique ID”不是字符串的情况?i、 这只是一个错误检查,还是您明确希望不使用非字符串值(例如整数)。谢谢Matt。逻辑看起来确实很复杂,所以我从不同的角度看它,做了几个不同的过程,因为每个过程都是互斥的,所以没有覆盖它的风险。忍受我…嗨,马特。如果其中一个答案解决了您的问题,您是否能够将其中一个标记为答案,以便我或@gzc将结果添加到我们的个人资料中。谢谢你可以用更简单的字符串操作替换正则表达式操作(例如,
    if plan_id.startswith(“we”):etc
    ),但这是超出你问题范围的另一步。您好,我在编辑中添加了一个额外的变量。我想我已经发现了问题,但我仍在思考如何解决它。任何帮助都将不胜感激
    # Inline comments to explain the main changes.
    def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
        # Extracted to shorter temp variable
        plan_id = row["Plan Unique ID"]
        # Using parenthesis to get two separate groups - code and numeric
        # Means you can do the match just once
        result = re.match("(we|uk)(.+)",plan_id)
        if result:
            code, numeric = result.groups()
            # We can get away with these simple tests as the earlier regex guarantees
            # that the string starts with either "we" or "uk"
            if code == "we" and plan_id > we_search:
                return_val = plan_id
            elif code == "uk" and plan_id > uk_search:
                return_val = plan_id
            else:
                # It looked like this column was used whatever happened at the
                # end, so there's no need to check against a regex
                #
                # The Atlas Placement is the default option if it either fails
                # the prefix check OR the "greater than" test
                return_val = row["Atlas Placement ID"]
        # A single return statement is often easier to debug
        return return_val
    
    $ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
    $ mediaplan_df
    >   
    Site Name Plan Unique ID Atlas Placement ID Consolidated ID
    0   Affectv     we11080301     11087207850894  11087207850894
    1  Mashable     we14880202     11087208009031      we14880202
    2     Alphr     uk10790301     11087208005229  11087208005229
    3     Alphr     uk19350201     11087208005228      uk19350201