Regex 基于其他列向Panda dataframe添加新列
我正在尝试向Panda数据集中添加一个新列。 这个新列df['Year_Prod']来自另一个df['title'],我从中提取年份 数据示例:Regex 基于其他列向Panda dataframe添加新列,regex,python-3.x,pandas,dataframe,Regex,Python 3.x,Pandas,Dataframe,我正在尝试向Panda数据集中添加一个新列。 这个新列df['Year_Prod']来自另一个df['title'],我从中提取年份 数据示例: country designation title Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)
country designation title
Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna)
Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)
import re
import pandas as pd
df=pd.read_csv(r'test.csv', index_col=0)
df['Year_Prod']=re.findall('\\d+', df['title'])
print(df.head(10))
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)
File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')
**ValueError: Length of values does not match length of index**
代码:
country designation title
Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna)
Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)
import re
import pandas as pd
df=pd.read_csv(r'test.csv', index_col=0)
df['Year_Prod']=re.findall('\\d+', df['title'])
print(df.head(10))
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)
File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')
**ValueError: Length of values does not match length of index**
我收到以下错误:
country designation title
Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna)
Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)
import re
import pandas as pd
df=pd.read_csv(r'test.csv', index_col=0)
df['Year_Prod']=re.findall('\\d+', df['title'])
print(df.head(10))
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)
File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')
**ValueError: Length of values does not match length of index**
请告诉我您对此的想法,谢谢。您没有指定分隔符-默认为
,
表示。请阅读\u csv
您可以使用pd.Series.apply:
import re
import pandas as pd
def year_finder(x):
return re.findall('\\d+', x)[0] # First match I find
df=pd.read_csv(r'test.csv', delimiter='||', index_col=0)
df['Year_Prod']= df["title"].apply(year_finder)
print(df.head(10))
编辑:对于str.extract
方法,请参见@Vaishali的答案您可以使用熊猫
编辑:正如@Paul H.在注释中所建议的那样,代码无法工作的原因是re.findall需要一个字符串,但您正在传递一个序列。它可以使用apply来完成,其中在每一行,传递的值都是字符串,但没有多大意义,因为str.extract更有效
df.title.apply(lambda x: re.findall('\d{4}', x)[0])
pandas
也有findall
df.title.str.findall('\d+').str[0]
Out[239]:
0 2013
1 2011
Name: title, dtype: object
#df['Year_Prod']= df.title.str.findall('\d+').str[0] from pygo
另一种方法是基于
iloc
方法
>>> df['Year_Prod'] = df.iloc[:,2].str.extract('(\d{4})', expand=False)
>>> df
country designation title Year_Prod
0 Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) 2013
1 Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro) 2011
str.translate
而不是regex
您的任何标题中是否有多个数字?@G.Anderson,问得好,我之前检查过,每个标题只有一个外观。可能值得解释的是,
re.findall
需要一个字符串作为其第二个参数,但OP通过了pandas.Series
。此外,OP应该知道,标准库中的函数通常不会接受pandas ObjectsCellent@W-B,添加到我的列表:)+1,但是,您能否将其添加到答案中以完成所需的输出,以便有人可以从中受益df['Year\u Prod']=df.title.str.findall('\d+').str[0]
@pygo-sure:-)added@W-B、 thnx Dude:-)