Python 数据帧中字符串列表的部分匹配
大家好,我正在尝试在数据框中匹配列中的部分字符串,并返回匹配字符串(大写字母)。我没有很强的编程知识,我只是开始学习Python 数据帧中字符串列表的部分匹配,python,pandas,dataframe,Python,Pandas,Dataframe,大家好,我正在尝试在数据框中匹配列中的部分字符串,并返回匹配字符串(大写字母)。我没有很强的编程知识,我只是开始学习 #list of State state_abbrv = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI&qu
#list of State
state_abbrv = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA",
"ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK",
"OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
#Create dataframe
d = {"Index": [1, 2, 3, 4, 5 , 6, 7], "Description": ["ABNY", "MANY", "NYNY","DO", "nyNY", ""CWARD NY", "HOWARD BEACH NY"]}
df = pd.DataFrame(data=d)
以下是df:
Index Description
1 ABNY
2 MANY
3 NYNY
4 DO
5 nyNY
6 CWARD NY
7 HOWARD BEACH NY
这是我的密码:
df = df.assign(State = df["Description"].str.findall(state_abbrv))
以下是预期结果:
Index Description State
1 ABNY NY
2 MANY MA,NY
3 NYNY NY,NY
4 DO
5 nyNY NY
6 CWARD NY WA,NY
7 HOWARD BEACH NY WA,AR,NY
谢谢您可以尝试使用
加入
,然后使用str.findall
:
statesjoin='|'.join(state_abbrv)
df=df.assign(State = df["Description"].str.findall(statesjoin))
输出:
df
Index Description State
0 1 ABNY [NY]
1 2 MANY [MA, NY]
2 3 NYNY [NY, NY]
3 4 DO []
4 5 nyNY [NY]
5 6 ABALBB [AL]
6 7 ALCA [AL, CA]
在@AkshaySehgal描述的可能情况下,您可以尝试以下方法:
import re
df=df.assign(State = df["Description"].apply(lambda x: ','.join(re.findall('..',x))).str.findall(statesjoin))
您可以使用以下方法,而不是将所有状态缩写组合成单个字符串并使用它们(如果某些缩写以类似字符结尾和开头,则可能会产生错误的结果)-
def get_common(s):
parts = set(map(''.join, zip(*[iter(s)]*2))) #Break string into 2 length tokens
common = ', '.join(list(parts.intersection(set(state_abbrv)))) #intersection between tokens and abbrevations
return common
df['State'] = df['Description'].apply(get_common)
当像BN这样的字符串是状态缩写的一部分时,这有时会失败(在这种情况下不是这样)。然后,艾布尼将在本应只生产NY的时候生产BN,NY。当然,这是真的。刚刚为这种情况添加了一个解决方案。谢谢,我刚刚发现包含数字和文本的描述(例如ny1NY)何时不起作用,所以我将列转换为str类型。嗨,在我将一些描述数据更改为“CWARD NY”和“HOWARD BEACH NY”后,一些结果丢失了。我从我的数据库更新了代码table@JustStartLearningCode但是为什么
“CWARD NY”
得到[WA,NY]
,而“HOWARD BEACH NY”
得到[WA,AR,NY]
?没有“CWARD NY”
也得到[WA,AR,NY]
,因为子字符串“WARD”
?谢谢,我刚刚发现描述包含数字和文本(例如ny1NY)的时间不起作用,所以我将列转换为str类型。嗨,在我将一些描述数据更改为“CWARD NY”和“纽约霍华德海滩”,我更新了表格中的代码
Index Description State
1 ABNY NY
2 MANY MA,NY
3 NYNY NY,NY
4 DO
5 nyNY NY
6 ABALBB AL
7 ALCA AL,CA