Python 从dataframe中的列的字符串中提取数字_Python_Regex_Pandas_Isnull

Python 从dataframe中的列的字符串中提取数字

python regex pandas

Python 从dataframe中的列的字符串中提取数字,python,regex,pandas,isnull,Python,Regex,Pandas,Isnull,我有一个名为data的数据框，我试图清除数据框中的一列，这样我就可以将价格转换为数值。这就是我如何筛选列以查找那些不正确的值。 data[data['error_price'].astype（str）.str.contains（'A-Za-z]'）] 我尝试了data['error_Price'][20:51].str.findall（r“（\d+）美元”）和data['error_Price'][20:51].str.findall（r“（\d+）美分”）来查找包含“美分”和“美元”的行在它

我有一个名为data的数据框，我试图清除数据框中的一列，这样我就可以将价格转换为数值。这就是我如何筛选列以查找那些不正确的值。

data[data['error_price'].astype（str）.str.contains（'A-Za-z]'）]

我尝试了

data['error_Price'][20:51].str.findall（r“（\d+）美元”）

和

data['error_Price'][20:51].str.findall（r“（\d+）美分”）

来查找包含“美分”和“美元”的行在它们中，我可以提取美元和美分的金额，但在迭代数据帧中的所有行时，无法合并这一点

  I would like the results to like look this:  

    Incorrect_Price        Desired    Occurences    errors
23  99 cents                .99           732         1
50  3 dollars and 49 cents  3.49          211         1
72  the price is 625        625           128         3
86  new price is 4.39       4.39           19         2
138 4 bucks                 4.00           3          1
199 new price 429           429            13         1
225 price is 9.99           9.99           5          1
240 new price is 499        499            8          2

只要字符串

不正确的\u Price

保留示例中的结构（数字不是用文字表示），任务就相对容易解决

使用正则表达式，您可以从中提取数字部分和可选的“美分”/“美分”或“美元”/“美元”。两个主要区别在于，您正在寻找数值对和“美分”或“美元”，并且它们可能不止一次出现

import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))

现在，您需要将此函数应用于包含文字价格的列中的所有错误值。为了简单起见，我在这里将其应用于所有值（但我相信您将能够处理子集）：

瞧

regex

'（？p[\d]*[.]？[\d]{1,2}）\s*（？Pcent | dollar）s？'的细分
有两个捕获命名组（？p…）

第一个捕获组（？p[\d]*[.]？[\d]{1,2}）
捕获：
[\d]
-数字
[\d]*
-重复0次或更多次
[.]？
-后跟可选（？
）点
[\d]{1,2}
-后跟一个重复1到2次的数字
\s*
-表示0个或更多空白
现在是更简单的第二个捕获组：（？Pcent | dollar）

cent | dollar
-它归结为捕获的cent
和dollar
字符串之间的替代
s？
是“cent s”或“dollar s”的可选复数形式
只要字符串不正确的\u Price
保留示例中的结构（数字不是用文字表示），任务就相对容易解决
使用正则表达式，您可以从中提取数字部分和可选的“美分”/“美分”或“美元”/“美元”。两个主要区别在于，您正在寻找数值对和“美分”或“美元”，并且它们可能不止一次出现
import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))

现在，您需要将此函数应用于包含文字价格的列中的所有错误值。为了简单起见，我在这里将其应用于所有值（但我相信您将能够处理子集）：
瞧

regex'（？p[\d]*[.]？[\d]{1,2}）\s*（？Pcent | dollar）s？'的细分
有两个捕获命名组（？p…）

第一个捕获组（？p[\d]*[.]？[\d]{1,2}）
捕获：
[\d]
-数字
[\d]*
-重复0次或更多次
[.]？
-后跟可选（？
）点
[\d]{1,2}
-后跟一个重复1到2次的数字
\s*
-表示0个或更多空白
现在是更简单的第二个捕获组：（？Pcent | dollar）

cent | dollar
-它归结为捕获的cent
和dollar
字符串之间的替代
s？
是“cent s”或“dollar s”的可选复数。对于初学者来说，您的示例中没有“购物者输入的价格”列。对于初学者来说，您的示例中没有“购物者输入的价格”列。哇，这太令人印象深刻了。非常感谢你！不管怎样，你能提供一个正则表达式逻辑的分解吗？哇，太令人印象深刻了。非常感谢你！您是否可以提供正则表达式逻辑的分解？
3.49

data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)