Python 合并两个数据帧而不使用公共列
我正在将一列“state”添加到现有数据框中,该数据框与其他数据框不共享公共列。因此,我需要将zipcodes转换为状态(例如,00704将是PR)以加载到具有新列状态的dataframe中Python 合并两个数据帧而不使用公共列,python,loops,dataframe,multiple-columns,Python,Loops,Dataframe,Multiple Columns,我正在将一列“state”添加到现有数据框中,该数据框与其他数据框不共享公共列。因此,我需要将zipcodes转换为状态(例如,00704将是PR)以加载到具有新列状态的dataframe中 reviewers = pd.read_csv('reviewers.txt', sep='|', header=None, names=['user id','
reviewers = pd.read_csv('reviewers.txt',
sep='|',
header=None,
names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""
user id age gender occupation zipcode state
0 1 24 M technician 85711
1 2 53 F other 94043
zipcodes = pd.read_csv('zipcodes.txt',
usecols = [1,4],
converters={'Zipcode':str})
Zipcode State
0 00704 PR
1 00704 PR
2 00704 PR
3 00704 PR
4 00704 PR
zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)
State
Zipcode
00704 PR
00704 PR
00704 PR
zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))
states = []
for zipcode in reviewers['Zipcode']:
if re.search('[a-zA-Z]+', zipcode):
append.states['canada']
elif zipcode in zips.index:
append.states(zips['zipcode'])
else:
append.states('unkown')
我也不确定我的循环是否正确。我必须按照美国邮政编码(数字)、加拿大邮政编码(字母顺序)和其他我们定义为(未知)的邮政编码对邮政编码进行排序。如果您需要数据文件,请告诉我。您的循环需要修复:
states = []
for zipcode in reviewers['Zipcode']:
if re.match(r'\w+', zipcode):
states.extend('Canada')
elif zipcode in zips.index:
states.extend(zips[zipcode])
else:
states.extend('Unknown')
另外,我假设您希望状态列表插回数据帧。在这种情况下,不需要for循环。您可以在数据帧上使用apply
,以获取新列:
def findState(code):
res='Unknown'
if re.match(r'\w+', code):
res='Canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State'] = reviewers['Zipcode'].apply(findstate)
您的循环需要修复:
states = []
for zipcode in reviewers['Zipcode']:
if re.match(r'\w+', zipcode):
states.extend('Canada')
elif zipcode in zips.index:
states.extend(zips[zipcode])
else:
states.extend('Unknown')
另外,我假设您希望状态列表插回数据帧。在这种情况下,不需要for循环。您可以在数据帧上使用apply
,以获取新列:
def findState(code):
res='Unknown'
if re.match(r'\w+', code):
res='Canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State'] = reviewers['Zipcode'].apply(findstate)
使用:
使用应用的循环版本
:
import re
def f(code):
res="unknown"
#if possible small letter change to [a-zA-Z]+
if re.match('[A-Z]+', code):
res='canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))
user id age gender occupation zipcode state State1
933 934 61 M engineer 22902 VA VA
934 935 42 M doctor 66221 KS KS
935 936 24 M other 32789 FL FL
936 937 48 M educator 98072 WA WA
937 938 38 F technician 55038 MN MN
938 939 26 F student 33319 FL FL
939 940 32 M administrator 02215 MA MA
940 941 20 M student 97229 OR OR
941 942 48 F librarian 78209 TX TX
942 943 22 M student 77841 TX TX
#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True
计时:
In [56]: %%timeit
...: mask = reviewers['zipcode'].str.match('[A-Z]+')
...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
...: reviewers['state'] = reviewers['state'].fillna('unknown')
...:
100 loops, best of 3: 2.08 ms per loop
In [57]: %%timeit
...: reviewers['State1'] = reviewers['zipcode'].apply(f)
...:
100 loops, best of 3: 17 ms per loop
使用:
使用应用的循环版本
:
import re
def f(code):
res="unknown"
#if possible small letter change to [a-zA-Z]+
if re.match('[A-Z]+', code):
res='canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))
user id age gender occupation zipcode state State1
933 934 61 M engineer 22902 VA VA
934 935 42 M doctor 66221 KS KS
935 936 24 M other 32789 FL FL
936 937 48 M educator 98072 WA WA
937 938 38 F technician 55038 MN MN
938 939 26 F student 33319 FL FL
939 940 32 M administrator 02215 MA MA
940 941 20 M student 97229 OR OR
941 942 48 F librarian 78209 TX TX
942 943 22 M student 77841 TX TX
#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True
计时:
In [56]: %%timeit
...: mask = reviewers['zipcode'].str.match('[A-Z]+')
...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
...: reviewers['state'] = reviewers['state'].fillna('unknown')
...:
100 loops, best of 3: 2.08 ms per loop
In [57]: %%timeit
...: reviewers['State1'] = reviewers['zipcode'].apply(f)
...:
100 loops, best of 3: 17 ms per loop
从两个文本文件中提供一些数据。另外,
states.append('Canada')
或者更好的states.extend('Canada')
提供了两个文本文件中的一些数据。另外,states.append('Canada')
或者更好的states.extend('Canada')
一如既往地精彩!一如既往的辉煌!