Python 如果一列中的文本包含特定的字符串模式，那么如何创建新列？_Python_Regex_Pandas_Dataframe_Conditional Statements

Python 如果一列中的文本包含特定的字符串模式，那么如何创建新列？

python regex pandas dataframe

Python 如果一列中的文本包含特定的字符串模式，那么如何创建新列？,python,regex,pandas,dataframe,conditional-statements,Python,Regex,Pandas,Dataframe,Conditional Statements,我当前的数据如下所示 +-------+----------------------------+-------------------+-----------------------+ | Index | 0 | 1 | 2 | +-------+----------------------------+-------------------+---------------

我当前的数据如下所示

+-------+----------------------------+-------------------+-----------------------+
| Index |             0              |         1         |           2           |
+-------+----------------------------+-------------------+-----------------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date |
|     1 | V50011 Tech Comp           | nan               | Phone:0177222222      |
|     2 | Regis Place                | nan               | Fax:017757575789      |
|     3 | Catenberry                 | nan               | nan                   |
|     4 | Manhattan, NY              | nan               | nan                   |
|     5 | V7484 Pipe                 | nan               | Phone:                   |
|     6 | Japan                      | nan               | nan                   |
|     7 | nan                        | nan               | nan                   |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   |
+-------+----------------------------+-------------------+-----------------------+

我正在尝试创建一个新列，

df['Company']

，如果它以“V”开头，并且如果

df[2]

中有“Phone”，那么它应该包含

df[0]

中的内容。如果不满足该条件，则可以是

nan

。下面是我要找的

+-------+----------------------------+-------------------+-----------------------+------------+
| Index |             0              |         1         |           2           | Company    |
+-------+----------------------------+-------------------+-----------------------+------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date | nan        |
|     1 | V50011 Tech                | nan               | Phone:0177222222      |V50011 Tech |
|     2 | Regis Place                | nan               | Fax:017757575789      | nan        |
|     3 | Catenberry                 | nan               | nan                   | nan        |
|     4 | Manhattan, NY              | nan               | nan                   | nan        |
|     5 | V7484 Pipe                 | nan               | Phone:                | V7484 Pipe |
|     6 | Japan                      | nan               | nan                   | nan        |
|     7 | nan                        | nan               | nan                   | nan        |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   | nan        |
+-------+----------------------------+-------------------+-----------------------+------------+

我正在尝试下面的脚本，但我得到一个错误

ValueError:传递的项目数错误1420

，位置意味着1

df['Company']=pd.np.where（df[2].str.contains（“Ph”），df[0].str.extract（r“（^V[A-Za-z0-9]+）”，“stop”）

我把“stop”作为else部分，因为我不知道在不满足条件时如何让python使用

nan

我还希望能够解析出df[0]的一个部分，例如，仅解析v5001部分，而不解析其余的单元格内容。我使用AMCs答案尝试了类似的方法，但出现了一个错误：

df.loc[df[0].str.startswith（'V'）&df[2].str.contains（'Phone'），'Company']=df[0].str.extract（r“（^V[A-Za-z0-9]+））

谢谢

解决这个问题的一个可能方法是使用列表理解。你可能会得到一个速度提升使用熊猫的一些内置功能，但这将使你达到那里

#/usr/bin/env python
将numpy作为np导入
作为pd进口熊猫
df=pd.DataFrame({
0:[“参考”、“v5001技术公司”、“catenberry”、“非常不同”]，
1:[“非”、“电话”、“其他”、“文本”]
})
df[“new_column”]=[x if（x[0].lower（）==“v”）&（“电话”在y.lower（）中）
df.loc[：，[0,1]]中x，y的else np.nan.值]
打印（df）

那会产生什么

                 0      1       new_column
0        reference    not              NaN
1  v5001 tech comp  phone  v5001 tech comp
2       catenberry  other              NaN
3   very different   text              NaN

我所做的就是接受你的两个条件，建立一个新的列表，然后分配给你的新专栏

一个潜在的解决方案是使用列表理解。你可能会得到一个速度提升使用熊猫的一些内置功能，但这将使你达到那里

#/usr/bin/env python
将numpy作为np导入
作为pd进口熊猫
df=pd.DataFrame({
0:[“参考”、“v5001技术公司”、“catenberry”、“非常不同”]，
1:[“非”、“电话”、“其他”、“文本”]
})
df[“new_column”]=[x if（x[0].lower（）==“v”）&（“电话”在y.lower（）中）
df.loc[：，[0,1]]中x，y的else np.nan.值]
打印（df）

那会产生什么

                 0      1       new_column
0        reference    not              NaN
1  v5001 tech comp  phone  v5001 tech comp
2       catenberry  other              NaN
3   very different   text              NaN

我所做的就是接受你的两个条件，建立一个新的列表，然后分配给你的新专栏

您没有为我们提供一种简单的方法来测试潜在的解决方案，但这应该可以完成这项工作：

df.loc[df[0].str.startswith（'V'）&df[2].str.contains（'Phone'），'Company']=df[0]

您没有为我们提供一种简单的方法来测试潜在的解决方案，但这应该可以完成以下工作：

df.loc[df[0].str.startswith（'V'）&df[2].str.contains（'Phone'），'Company']=df[0]

您可以使用pandas

apply

功能执行此操作：

import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)

编辑：

要调整到@AMC答案下的注释，可以使用pandas

apply

功能：

import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)

编辑：根据@AMC的回答调整评论

IIUC

我们可以使用一个布尔条件，用一些基本正则表达式来提取V数

或者我们可以在where语句中应用相同的公式

要将值设置为

NaN

，我们可以使用

np.NaN

如果您想获取V之后的整个字符串，我们可以使用

[V]\w+.*

在第一次匹配之后获取所有内容

from IO import StringIO

 d = """+-------+----------------------------+-------------------+-----------------------+
| Index |             0              |         1         |           2           |
+-------+----------------------------+-------------------+-----------------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date |
|     1 | V50011 Tech Comp           | nan               | Phone:0177222222      |
|     2 | Regis Place                | nan               | Fax:017757575789      |
|     3 | Catenberry                 | nan               | nan                   |
|     4 | Manhattan, NY              | nan               | nan                   |
|     5 | Ultilagro, CT              | nan               | nan                   |
|     6 | Japan                      | nan               | nan                   |
|     7 | nan                        | nan               | nan                   |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   |
+-------+----------------------------+-------------------+-----------------------+"""

df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()

df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
    r"([V]\w+)"
)

如果要将其作为where语句，请执行以下操作：

import numpy as np



df["3"] = np.where(
    df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
            print(df[['0','2','3']])
                                   0                      2       3
        1              Reference Curr  Invoice Date Due Date     NaN
        2            V50011 Tech Comp       Phone:0177222222  V50011
        3                 Regis Place       Fax:017757575789     NaN
        4                  Catenberry                    nan     NaN
        5               Manhattan, NY                    nan     NaN
        6               Ultilagro, CT                    nan     NaN
        7                       Japan                    nan     NaN
        8                         nan                    nan     NaN
        9  4543.34GBP (British Pound)                    nan     NaN

IIUC

我们可以使用一个布尔条件，用一些基本正则表达式来提取V数

或者我们可以在where语句中应用相同的公式

要将值设置为

NaN

，我们可以使用

np.NaN

如果您想获取V之后的整个字符串，我们可以使用

[V]\w+.*

在第一次匹配之后获取所有内容

from IO import StringIO

 d = """+-------+----------------------------+-------------------+-----------------------+
| Index |             0              |         1         |           2           |
+-------+----------------------------+-------------------+-----------------------+
|     0 | Reference Curr             | Daybook / Voucher | Invoice Date Due Date |
|     1 | V50011 Tech Comp           | nan               | Phone:0177222222      |
|     2 | Regis Place                | nan               | Fax:017757575789      |
|     3 | Catenberry                 | nan               | nan                   |
|     4 | Manhattan, NY              | nan               | nan                   |
|     5 | Ultilagro, CT              | nan               | nan                   |
|     6 | Japan                      | nan               | nan                   |
|     7 | nan                        | nan               | nan                   |
|     8 | 4543.34GBP (British Pound) | nan               | nan                   |
+-------+----------------------------+-------------------+-----------------------+"""

df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()

df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
    r"([V]\w+)"
)

如果要将其作为where语句，请执行以下操作：

import numpy as np



df["3"] = np.where(
    df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
            print(df[['0','2','3']])
                                   0                      2       3
        1              Reference Curr  Invoice Date Due Date     NaN
        2            V50011 Tech Comp       Phone:0177222222  V50011
        3                 Regis Place       Fax:017757575789     NaN
        4                  Catenberry                    nan     NaN
        5               Manhattan, NY                    nan     NaN
        6               Ultilagro, CT                    nan     NaN
        7                       Japan                    nan     NaN
        8                         nan                    nan     NaN
        9  4543.34GBP (British Pound)                    nan     NaN

这是另一种获得结果的方法

condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')

df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)

这是另一种获得结果的方法

condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')

df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)

我认为你甚至不需要在这里使用

where（）

。我认为你甚至不需要在这里使用

where（）

。我认为这是非常pythonic的，但也有点争议。@AMC被删除是为了避免争论。我认为它非常pythonic的，也有点争议。@AMC被删除是为了避免争论。我没有对此进行投票，但我认为AMC的方法更干净、更。。。熊猫？我更喜欢使用内置的pandas过滤，而不是在lambda中进行过滤。我明白了，但这并不意味着投反对票，我的答案正确地完成了OP要求的任务，并且是有效的。。。谢谢你的回复，虽然我的答案正确地完成了OP要求的任务，并且是有效的。。。但并非所有产生正确输出的解决方案都是相等的，对吗？这就是为什么你应该获得更多的投票。然而，否决票意味着（至少对我来说）答案有问题。但这显然引发了很多讨论，例如，如果有人问如何对数字列表求和

my_list=[1,2,3]

，那么答案

eval（'sum（my_list）'）

是否值得否决？它工作得很好，不是吗？我没有投票，但我认为AMC的方法更干净，更。。。熊猫？我更喜欢使用内置的pandas过滤，而不是在lambda中进行过滤。我明白了，但这并不意味着投反对票，我的答案正确地完成了OP要求的任务，并且是有效的。。。谢谢你的回复，虽然我的答案正确地完成了OP要求的任务，并且是有效的。。。但并非所有产生正确输出的解决方案都是相等的，对吗？这就是为什么你应该获得更多的投票。然而，否决票意味着（至少对我来说）答案有问题。但这显然引发了很多讨论，例如，如果有人问如何对数字列表求和，答案是不是