Python 使用Pandas合并相似数据
如何将类似的数据(如“推荐”)合并为一个值Python 使用Pandas合并相似数据,python,pandas,dataframe,data-science,Python,Pandas,Dataframe,Data Science,如何将类似的数据(如“推荐”)合并为一个值 df['Why you choose us'].str.lower().value_counts() location 35 recommendation 23 recommedation 8 confort 7 availability
df['Why you choose us'].str.lower().value_counts()
location 35
recommendation 23
recommedation 8
confort 7
availability 4
reconmmendation 3
facilities 3
打印(df)
.groupby()
,部分字符串。.transform()
查找和时
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
reason count groupcount
0 location 35 35
1 recommendation 23 34
2 recommedation 8 34
3 confort 7 7
4 availability 4 4
5 reconmmendation 3 34
6 facilities 3 3
如果需要,可以并排查看字符串和部分字符串。试一试
df=df.assign(groupname=df.reason.str[0:4])
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
print(df)
reason count groupname groupcount
0 location 35 loca 35
1 recommendation 23 reco 34
2 recommedation 8 reco 34
3 confort 7 conf 7
4 availability 4 avai 4
5 reconmmendation 3 reco 34
6 facilities 3 faci 3
如果您在一行中有多个项目,就像在csv中一样;然后
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())
location 48
no comment given 34
recommendation 25
confort 8
facilities 8
recommedation 8
price 7
availability 6
reputation 5
reconmmendation 3
internet 3
ac 3
breakfast 3
tranquility 2
cleanliness 2
aveilable 1
costumer service 1
pool 1
comfort 1
search engine 1
Name: group, dtype: int64
你有没有办法确定单词是否相似,或者你是在要求一个算法来确定单词的相似性?这不是一个机器学习
问题,请不要垃圾发送不相关的标签(已删除)。@Craig我要求一个算法来合并相似的条目名称(字符串),例如,只有一个数字表示,所有与“推荐”一词相似的条目都将作为一个唯一的数字计算在一起。好吧,但它如何找到相似的单词呢?通过将我称为“原因”的列中的前4个字符切片来找到相似的单词。这是短语df.reason.str[0:4]Ok,因此这是计算单词相似性的基本建议-当前4个字符相同时,单词是相似的。在这种情况下,将字符串的4个字符切片可以得到唯一分类所需的子字符串。看我的编辑这有帮助吗?
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())
location 48
no comment given 34
recommendation 25
confort 8
facilities 8
recommedation 8
price 7
availability 6
reputation 5
reconmmendation 3
internet 3
ac 3
breakfast 3
tranquility 2
cleanliness 2
aveilable 1
costumer service 1
pool 1
comfort 1
search engine 1
Name: group, dtype: int64