Regex 基于其他列向Panda dataframe添加新列_Regex_Python 3.x_Pandas_Dataframe

Regex 基于其他列向Panda dataframe添加新列

regex python-3.x pandas dataframe

Regex 基于其他列向Panda dataframe添加新列,regex,python-3.x,pandas,dataframe,Regex,Python 3.x,Pandas,Dataframe,我正在尝试向Panda数据集中添加一个新列。这个新列df['Year_Prod']来自另一个df['title']，我从中提取年份数据示例： country designation title Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)

我正在尝试向Panda数据集中添加一个新列。这个新列df['Year_Prod']来自另一个df['title']，我从中提取年份

数据示例：

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)

import re

import pandas as pd

df=pd.read_csv(r'test.csv', index_col=0)

df['Year_Prod']=re.findall('\\d+', df['title'])

print(df.head(10))

 File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

代码：

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)

import re

import pandas as pd

df=pd.read_csv(r'test.csv', index_col=0)

df['Year_Prod']=re.findall('\\d+', df['title'])

print(df.head(10))

 File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

我收到以下错误：

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)

import re

import pandas as pd

df=pd.read_csv(r'test.csv', index_col=0)

df['Year_Prod']=re.findall('\\d+', df['title'])

print(df.head(10))

 File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

请告诉我您对此的想法，谢谢。

您没有指定分隔符-默认为

，

表示

。请阅读\u csv

您可以使用pd.Series.apply：

import re    
import pandas as pd

def year_finder(x):
    return re.findall('\\d+', x)[0] # First match I find

df=pd.read_csv(r'test.csv', delimiter='||', index_col=0)
df['Year_Prod']= df["title"].apply(year_finder)

print(df.head(10))

编辑：对于

str.extract

方法，请参见@Vaishali的答案您可以使用熊猫

编辑：正如@Paul H.在注释中所建议的那样，代码无法工作的原因是re.findall需要一个字符串，但您正在传递一个序列。它可以使用apply来完成，其中在每一行，传递的值都是字符串，但没有多大意义，因为str.extract更有效

df.title.apply(lambda x: re.findall('\d{4}', x)[0])

pandas

也有

findall

df.title.str.findall('\d+').str[0]
Out[239]: 
0    2013
1    2011
Name: title, dtype: object

#df['Year_Prod']= df.title.str.findall('\d+').str[0] from pygo

另一种方法是基于

iloc

方法

>>> df['Year_Prod'] = df.iloc[:,2].str.extract('(\d{4})', expand=False)
>>> df
    country   designation                                          title Year_Prod
0     Italy  Vulkà Bianco               Nicosia 2013 Vulkà Bianco (Etna)      2013
1  Portugal      Avidagos  Quinta dos Avidagos 2011 Avidagos Red (Douro)      2011

str.translate

而不是

regex

您的任何标题中是否有多个数字？@G.Anderson，问得好，我之前检查过，每个标题只有一个外观。可能值得解释的是，

re.findall

需要一个字符串作为其第二个参数，但OP通过了

pandas.Series

。此外，OP应该知道，标准库中的函数通常不会接受pandas ObjectsCellent@W-B，添加到我的列表：）+1，但是，您能否将其添加到答案中以完成所需的输出，以便有人可以从中受益

df['Year\u Prod']=df.title.str.findall（'\d+'）.str[0]

@pygo-sure:-）added@W-B、 thnx Dude:-）