Python 3.x 如何在字符串列中找到特定的数字模式,并用该序号的文本版本替换该值?
请原谅,我是python新手。但我正在构建一个功能,我可以用来清理各种调查的文本。我觉得我接近于将序数的数字版本转换为文本版本,但我还不太清楚。下面是我试图构建的函数(注意,我尝试了两种方法在函数中的*nbr=*行上查找正则表达式模式,但我在下面解释了这两种方法的错误): 错误: 当我在函数中的“nbr=”行上运行Python 3.x 如何在字符串列中找到特定的数字模式,并用该序号的文本版本替换该值?,python-3.x,regex,pandas,spyder,Python 3.x,Regex,Pandas,Spyder,请原谅,我是python新手。但我正在构建一个功能,我可以用来清理各种调查的文本。我觉得我接近于将序数的数字版本转换为文本版本,但我还不太清楚。下面是我试图构建的函数(注意,我尝试了两种方法在函数中的*nbr=*行上查找正则表达式模式,但我在下面解释了这两种方法的错误): 错误: 当我在函数中的“nbr=”行上运行words.str.findall时,我得到错误:AttributeError:'str'对象没有属性“str”,当我运行re.findall时,我能够得到一个数据帧,但“字符串清理”
words.str.findall
时,我得到错误:AttributeError:'str'对象没有属性“str”
,当我运行re.findall
时,我能够得到一个数据帧,但“字符串清理”列不能反映每行上的字符串。相反,我得到的是:
record the_string the_string_clean
0 47 This is the first string "0This is the first string 1This is the 2nd string 2nothing to
see here 3 4th string has the date: today is the 8th 4This has
a typo10th"
Name: the_string, dtype: object
1 56 This is the 2nd string "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
2 59 nothing to see here "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
3 134 4th string has the "0This is the first string 1This is the 2nd string 2 nothing to
date: today is the 8th see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
4 454 this has a typo10th "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
预期输出:这是我预期的输出:
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth
我希望我足够清楚。我是Python新手,非常感谢您的帮助。您可以通过在lambda函数中使用和调用
num2words
作为替换,简化您的替换序数
函数。然后仅使用在列上运行函数:
将熊猫作为pd导入
从num2words导入num2words
进口稀土
my_df=pd.DataFrame({“记录”:[47,56,59134454],
“the_string”:[“这是第一个字符串”,
“这是第二个字符串”,
“这里没什么可看的”,
“第四个字符串有日期:今天是第八个”,
“这有一个输入错误”]})
def替换序号(文字):
返回re.sub(r'(\d+)(:st | nd | rd | th'),lambda m:num2words(m.group(1),序号=True),words)
my_df['the_string']=my_df['the_string'].应用(替换序号)
我的
输出
记录\u字符串
0 47这是第一个字符串
156这是第二个字符串
2 59这里没什么可看的
第四个字符串有日期:今天是第八个
454这有一个输入错误
请注意,您需要在正则表达式中使用一个替代项
(?:st | nd | rd | th)
,以匹配st
、nd
、rd
或th
中的一个;您正在使用的字符类:[st | nd | rd | th]
将匹配包含dnrst |
中任何字符的任何字符串。您可以通过在lambda函数中使用并调用num2words
来简化替换序数
函数。然后仅使用在列上运行函数:
将熊猫作为pd导入
从num2words导入num2words
进口稀土
my_df=pd.DataFrame({“记录”:[47,56,59134454],
“the_string”:[“这是第一个字符串”,
“这是第二个字符串”,
“这里没什么可看的”,
“第四个字符串有日期:今天是第八个”,
“这有一个输入错误”]})
def替换序号(文字):
返回re.sub(r'(\d+)(:st | nd | rd | th'),lambda m:num2words(m.group(1),序号=True),words)
my_df['the_string']=my_df['the_string'].应用(替换序号)
我的
输出
记录\u字符串
0 47这是第一个字符串
156这是第二个字符串
2 59这里没什么可看的
第四个字符串有日期:今天是第八个
454这有一个输入错误
请注意,您需要在正则表达式中使用一个替代项(?:st | nd | rd | th)
,以匹配st
、nd
、rd
或th
中的一个;您正在使用的字符类:[st | nd | rd | th]
将匹配包含dnrst
中任何字符的任何字符串
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth