Python 不区分大小写的替换(映射)
我发布了一个“第1部分”的问题,这让我找到了我所需要的函数的答案,但我认为这有理由提出自己的问题。如果没有,我将删除 我想对数据帧应用一个函数,该函数将全州名替换为缩写(Python 不区分大小写的替换(映射),python,pandas,dictionary,dataframe,case-insensitive,Python,Pandas,Dictionary,Dataframe,Case Insensitive,我发布了一个“第1部分”的问题,这让我找到了我所需要的函数的答案,但我认为这有理由提出自己的问题。如果没有,我将删除 我想对数据帧应用一个函数,该函数将全州名替换为缩写(newyork->NY)。然而,我在我的数据集中注意到,如果一个州被资本化,它显然不会与dicitonary相匹配。我试图解决它,但似乎无法破解代码: import pandas as pd import numpy as np dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3
newyork->NY
)。然而,我在我的数据集中注意到,如果一个州被资本化,它显然不会与dicitonary相匹配。我试图解决它,但似乎无法破解代码:
import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','NY Pharma','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
import us
statez = us.states.mapping('abbr', 'name')
inv_map = {v: k for k, v in statez.items()}
def replace_states(company):
# find all states that exist in the string
state_found = filter(lambda state: state.lower() in company.lower(), statez.values())
# replace each state with its abbreviation
for state in state_found:
print(state, inv_map[state])
company = company.replace(state, inv_map[state])
print("---" , company)
# return the modified string (or original if no states were found)
return company
dfp['C'] = dfp['C'].map(replace_states)
产出:注意《爱达荷州药房》中缺少变化
有没有办法使这个函数不区分大小写?我会找到它的索引,然后用它来替换它,不管大小写:
# replace each state with its abbreviation
for state in state_found:
print(state, inv_map[state])
index = company.lower().find(state.lower())
company = company.replace(company[index:index + len(state)], inv_map[state])
print("---" , company)
这将保留字符串所有其他部分的大小写。我将找到它的索引,然后使用该索引替换它,而不管大小写:
# replace each state with its abbreviation
for state in state_found:
print(state, inv_map[state])
index = company.lower().find(state.lower())
company = company.replace(company[index:index + len(state)], inv_map[state])
print("---" , company)
这将保留字符串所有其他部分的大小写。使用缩写替换状态名称(不区分大小写的矢量化解决方案):
t1 = dfp.C.str.split(expand=True)
t2 = t1.stack().str.title().map(inv_map).unstack()
t1[t2.notnull()] = t2
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join)
In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False)
In [89]: dfp
Out[89]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 NY Pharma 123456.0 Unassign NY
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN AK RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 PA Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(),
dfp.state.map(statez).tolist(),
regex=True)
In [91]: dfp
Out[91]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 New York Pharma 123456.0 Unassign NY
2 3.0 3.0 New Jersey Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 California Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN Alaska RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 Pennsylvania Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
结果:
In [152]: x
Out[152]:
A B C D E new
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign Pharmacy of ID
1 NaN 0.0 NY Pharma 123456.0 Unassign NY Pharma
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ Pharmacy
3 4.0 5.0 Idaho Rx 12345678.0 Ugly ID Rx
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA Herbals
5 5.0 0.0 Florida Pharma 12345.0 Undo FL Pharma
6 3.0 NaN AK RX 12345678.0 Assign AK RX
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle OH Drugs
8 5.0 0.0 PA Rx 1234567.0 Assign PA Rx
9 NaN 0.0 USA Pharma NaN Unicorn USA Pharma
说明:
In [155]: t1 = dfp.C.str.split(expand=True)
In [156]: t1
Out[156]:
0 1 2
0 Pharmacy of IDAHO
1 NY Pharma None
2 NJ Pharmacy None
3 Idaho Rx None
4 CA Herbals None
5 Florida Pharma None
6 AK RX None
7 Ohio Drugs None
8 PA Rx None
9 USA Pharma None
In [157]: t2 = t1.stack().str.title().map(inv_map).unstack()
In [158]: t2
Out[158]:
0 1 2
0 NaN NaN ID
1 NaN NaN None
2 NaN NaN None
3 ID NaN None
4 NaN NaN None
5 FL NaN None
6 NaN NaN None
7 OH NaN None
8 NaN NaN None
9 NaN NaN None
In [159]: t1[t2.notnull()] = t2
In [160]: t1
Out[160]:
0 1 2
0 Pharmacy of ID
1 NY Pharma None
2 NJ Pharmacy None
3 ID Rx None
4 CA Herbals None
5 FL Pharma None
6 AK RX None
7 OH Drugs None
8 PA Rx None
9 USA Pharma None
将状态缩写替换为其名称(不区分大小写的矢量化解决方案):
t1 = dfp.C.str.split(expand=True)
t2 = t1.stack().str.title().map(inv_map).unstack()
t1[t2.notnull()] = t2
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join)
In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False)
In [89]: dfp
Out[89]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 NY Pharma 123456.0 Unassign NY
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN AK RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 PA Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(),
dfp.state.map(statez).tolist(),
regex=True)
In [91]: dfp
Out[91]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 New York Pharma 123456.0 Unassign NY
2 3.0 3.0 New Jersey Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 California Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN Alaska RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 Pennsylvania Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
用缩写替换状态名称(不区分大小写的矢量化解决方案):
t1 = dfp.C.str.split(expand=True)
t2 = t1.stack().str.title().map(inv_map).unstack()
t1[t2.notnull()] = t2
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join)
In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False)
In [89]: dfp
Out[89]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 NY Pharma 123456.0 Unassign NY
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN AK RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 PA Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(),
dfp.state.map(statez).tolist(),
regex=True)
In [91]: dfp
Out[91]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 New York Pharma 123456.0 Unassign NY
2 3.0 3.0 New Jersey Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 California Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN Alaska RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 Pennsylvania Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
结果:
In [152]: x
Out[152]:
A B C D E new
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign Pharmacy of ID
1 NaN 0.0 NY Pharma 123456.0 Unassign NY Pharma
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ Pharmacy
3 4.0 5.0 Idaho Rx 12345678.0 Ugly ID Rx
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA Herbals
5 5.0 0.0 Florida Pharma 12345.0 Undo FL Pharma
6 3.0 NaN AK RX 12345678.0 Assign AK RX
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle OH Drugs
8 5.0 0.0 PA Rx 1234567.0 Assign PA Rx
9 NaN 0.0 USA Pharma NaN Unicorn USA Pharma
说明:
In [155]: t1 = dfp.C.str.split(expand=True)
In [156]: t1
Out[156]:
0 1 2
0 Pharmacy of IDAHO
1 NY Pharma None
2 NJ Pharmacy None
3 Idaho Rx None
4 CA Herbals None
5 Florida Pharma None
6 AK RX None
7 Ohio Drugs None
8 PA Rx None
9 USA Pharma None
In [157]: t2 = t1.stack().str.title().map(inv_map).unstack()
In [158]: t2
Out[158]:
0 1 2
0 NaN NaN ID
1 NaN NaN None
2 NaN NaN None
3 ID NaN None
4 NaN NaN None
5 FL NaN None
6 NaN NaN None
7 OH NaN None
8 NaN NaN None
9 NaN NaN None
In [159]: t1[t2.notnull()] = t2
In [160]: t1
Out[160]:
0 1 2
0 Pharmacy of ID
1 NY Pharma None
2 NJ Pharmacy None
3 ID Rx None
4 CA Herbals None
5 FL Pharma None
6 AK RX None
7 OH Drugs None
8 PA Rx None
9 USA Pharma None
将状态缩写替换为其名称(不区分大小写的矢量化解决方案):
t1 = dfp.C.str.split(expand=True)
t2 = t1.stack().str.title().map(inv_map).unstack()
t1[t2.notnull()] = t2
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join)
In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False)
In [89]: dfp
Out[89]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 NY Pharma 123456.0 Unassign NY
2 3.0 3.0 NJ Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 CA Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN AK RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 PA Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(),
dfp.state.map(statez).tolist(),
regex=True)
In [91]: dfp
Out[91]:
A B C D E state
0 NaN 1.0 Pharmacy of IDAHO 123456.0 Assign NaN
1 NaN 0.0 New York Pharma 123456.0 Unassign NY
2 3.0 3.0 New Jersey Pharmacy 1234567.0 Assign NJ
3 4.0 5.0 Idaho Rx 12345678.0 Ugly NaN
4 5.0 0.0 California Herbals 12345.0 Appreciate CA
5 5.0 0.0 Florida Pharma 12345.0 Undo NaN
6 3.0 NaN Alaska RX 12345678.0 Assign AK
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle NaN
8 5.0 0.0 Pennsylvania Rx 1234567.0 Assign PA
9 NaN 0.0 USA Pharma NaN Unicorn NaN
我为我的困惑道歉,但是你能解释一下把这个代码放在哪里,或者解释一下它背后的原因吗?当我把它放在循环中时,我得到了疯狂的输出。@MattR我已经添加了额外的代码来帮助您放置它。如果输出不正确,请告诉我您得到了什么。我添加了一些示例代码,以便海报可以使用我的测试数据帧。但这是我当前的输出<代码>爱达荷ID--IDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDXID,如果
len(state)=0
,就会发生这种情况。。。所以state\u found
可能有一个空字符串“
”。你能做类型(状态)
?它不仅仅是州名的小写字符串吗?州是一个字符串
我为我的混淆道歉,但是你能解释一下把这个代码放在哪里,或者解释一下它背后的原因吗?当我把它放在循环中时,我得到了疯狂的输出。@MattR我已经添加了额外的代码来帮助您放置它。如果输出不正确,请告诉我您得到了什么。我添加了一些示例代码,以便海报可以使用我的测试数据帧。但这是我当前的输出<代码>爱达荷ID--IDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDIDXID,如果len(state)=0
,就会发生这种情况。。。所以state\u found
可能有一个空字符串“
”。你能做类型(状态)
?它不仅仅是州名的一个小写字符串吗?州是一个字符串
我知道这有点违反直觉,但我实际上想从州全名改为缩写版本示例:oho->OH
@MattR,hmm…,这使得它更具挑战性。让我尝试另一种解决方案…有一些编辑,所以我不确定我之前发布到哪个,但第一部分正是我需要的。然而,我不知道你是怎么做到的!但它的工作非常出色。任何解释都会很棒,但不需要尊重您的时间和帮助@马特,我添加了一些解释-请检查我刚刚回到这篇文章。。。你的大脑完全处于另一个层次。这真是太棒了,我希望有一天能用这种方式思考……我知道这有点违反直觉,但实际上我想从全名改为缩写版本示例:oho->OH
@MattR,hmm…,这让它更具挑战性。让我尝试另一种解决方案…有一些编辑,所以我不确定我之前发布到哪个,但第一部分正是我需要的。然而,我不知道你是怎么做到的!但它的工作非常出色。任何解释都会很棒,但不需要尊重您的时间和帮助@马特,我添加了一些解释-请检查我刚刚回到这篇文章。。。你的大脑完全处于另一个层次。这真是太棒了,我希望有一天能用这种方式思考。。。