Python 比loc更有效地清理数据帧

Python 比loc更有效地清理数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我的代码如下所示: import pandas as pd df = pd.read_excel("Energy Indicators.xls", header=None, footer=None) c_df = df.copy() c_df = c_df.iloc[18:245, 2:] c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewab

我的代码如下所示:

import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
c_df.loc[c_df['Country'] == 'Korea, Rep.'] = 'South Korea'
c_df.loc[c_df['Country'] == 'United States of America20'] = 'United States'
c_df.loc[c_df['Country'] == 'United Kingdom of Great Britain and Northern Ireland'] = 'United Kingdom'
c_df.loc[c_df['Country'] == 'China, Hong Kong Special Administrative Region'] = 'Hong Kong'
c_df.loc[c_df['Country'] == 'Venezuela (Bolivarian Republic of)'] = 'Venezuela'
c_df.loc[c_df['Country'] == 'Bolivia (Plurinational State of)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Switzerland17'] = 'Switzerland'
c_df.loc[c_df['Country'] == 'Australia1'] = 'Australia'
c_df.loc[c_df['Country'] == 'China2'] = 'China'
c_df.loc[c_df['Country'] == 'Falkland Islands (Malvinas)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Greenland7'] = 'Greenland'
c_df.loc[c_df['Country'] == 'Iran (Islamic Republic of'] = 'Iran'
c_df.loc[c_df['Country'] == 'Italy9'] = 'Italy'
c_df.loc[c_df['Country'] == 'Japan10'] = 'Japan'
c_df.loc[c_df['Country'] == 'Kuwait11'] = 'Kuwait'
c_df.loc[c_df['Country'] == 'Micronesia (Federal States of)'] = 'Micronesia'
c_df.loc[c_df['Country'] == 'Netherlands12'] = 'Netherlands'
c_df.loc[c_df['Country'] == 'Portugal13'] = 'Portugal'
c_df.loc[c_df['Country'] == 'Saudi Arabia14'] = 'Saudi Arabia'
c_df.loc[c_df['Country'] == 'Serbia15'] = 'Serbia'
c_df.loc[c_df['Country'] == 'Sint Maarteen (Dutch part)'] = 'Sint Marteen'
c_df.loc[c_df['Country'] == 'Spain16'] = 'Spain'
c_df.loc[c_df['Country'] == 'Ukraine18'] = 'Ukraine'
c_df.loc[c_df['Country'] == 'Denmark5'] = 'Denmark'
c_df.loc[c_df['Country'] == 'France6'] = 'France'
c_df.loc[c_df['Country'] == 'Indonesia8'] = 'Indonesia'

我觉得必须有一种更简单的方法来更改名称中带有括号和数字的国家的值。我可以使用什么方法在列中查找带有括号的名称
isin

您可以先去掉括号中的数字和文本。之后,对于所有其他需要非平凡替换的内容,声明一个映射并使用
pd.Series.replace
应用它

mapper = {'Korea, Rep' : 'South Korea', 'Falkland Islands' : 'Bolivia', ...} 

df['Country'] = (
    df['Country'].str.replace(r'\d+|\s*\(.*\)', '').str.strip().replace(mapper)
)
很简单,完成了

详细信息

\d+     # one or more digits
|       # regex OR pipe
\s*     # zero or more whitespace characters
\(      # literal parentheses (opening brace)
.*      # match anything 
\)      # closing brace

使用字典,然后:


您应该制作一个映射国家名称的
dict
,并使用
map
功能。@TwistedSim-psst,
map
将非dict条目替换为NaN,在这种情况下可能不可取(您想要
replace
)。@COLDSPEED,不知道!谢谢:)^1,但请记住,由于每个条目中的特定数字和括号,要替换的dict将变得不必要的复杂。最后,仅在国家/地区进行替换,但数据框的其余部分会毫无理由地重复:o)真!我编辑了我的答案以解决第二个问题,但是像您那样剥离它并在原始数据帧中去除不必要的东西可能更聪明,最终得到一个更干净的命令您介意在replace方法(r'\d+\s*(*),'')中解释测试吗?
dict_to_replace = {'Korea, Rep.':'South Korea',
                         'United States of America20':'United States',
                         'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom'
                   ...}

df['c_df'] = df['c_df'].replace(dict_to_replace)