Python 使用Pandas合并相似数据_Python_Pandas_Dataframe_Data Science

Python 使用Pandas合并相似数据

python pandas dataframe

Python 使用Pandas合并相似数据,python,pandas,dataframe,data-science,Python,Pandas,Dataframe,Data Science,如何将类似的数据（如“推荐”）合并为一个值 df['Why you choose us'].str.lower().value_counts() location 35 recommendation 23 recommedation 8 confort 7 availability

如何将类似的数据（如“推荐”）合并为一个值

df['Why you choose us'].str.lower().value_counts()

location                           35
recommendation                     23
recommedation                       8
confort                             7
availability                        4
reconmmendation                     3
facilities                          3

打印（df）

.groupby（）

，部分字符串。

.transform（）

查找

和时
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')



          reason  count  groupcount
0         location     35          35
1   recommendation     23          34
2    recommedation      8          34
3          confort      7           7
4     availability      4           4
5  reconmmendation      3          34
6       facilities      3           3

如果需要，可以并排查看字符串和部分字符串。试一试
df=df.assign(groupname=df.reason.str[0:4])
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
print(df)


      reason  count groupname  groupcount
0         location     35      loca          35
1   recommendation     23      reco          34
2    recommedation      8      reco          34
3          confort      7      conf           7
4     availability      4      avai           4
5  reconmmendation      3      reco          34
6       facilities      3      faci           3

如果您在一行中有多个项目，就像在csv中一样；然后
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())

location            48
no comment given    34
recommendation      25
confort              8
facilities           8
recommedation        8
price                7
availability         6
reputation           5
reconmmendation      3
internet             3
ac                   3
breakfast            3
tranquility          2
cleanliness          2
aveilable            1
costumer service     1
pool                 1
comfort              1
search engine        1
Name: group, dtype: int64

你有没有办法确定单词是否相似，或者你是在要求一个算法来确定单词的相似性？这不是一个机器学习问题，请不要垃圾发送不相关的标签（已删除）。@Craig我要求一个算法来合并相似的条目名称（字符串），例如，只有一个数字表示，所有与“推荐”一词相似的条目都将作为一个唯一的数字计算在一起。好吧，但它如何找到相似的单词呢？通过将我称为“原因”的列中的前4个字符切片来找到相似的单词。这是短语df.reason.str[0:4]Ok，因此这是计算单词相似性的基本建议-当前4个字符相同时，单词是相似的。在这种情况下，将字符串的4个字符切片可以得到唯一分类所需的子字符串。看我的编辑这有帮助吗？
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())

location            48
no comment given    34
recommendation      25
confort              8
facilities           8
recommedation        8
price                7
availability         6
reputation           5
reconmmendation      3
internet             3
ac                   3
breakfast            3
tranquility          2
cleanliness          2
aveilable            1
costumer service     1
pool                 1
comfort              1
search engine        1
Name: group, dtype: int64