Python 从数据框创建多个字数列表并导出到多个excel工作表_Python_Pandas_Dataframe_Machine Learning_Nltk

Python 从数据框创建多个字数列表并导出到多个excel工作表

python pandas dataframe machine-learning

Python 从数据框创建多个字数列表并导出到多个excel工作表,python,pandas,dataframe,machine-learning,nltk,Python,Pandas,Dataframe,Machine Learning,Nltk,希望有人能帮我。我正在一些文本数据上运行K-Means聚类。在pandas数据框中获得不同的集群组后，我想为模型放入数据框中的每个集群组的“Processed_Data”列中的文本创建一个字数列表。创建每个列表后，我希望将它们导出到一个excel文件中的单个excel工作表中。对于这个特定的代码，我有17个集群，并希望将17个单词计数列表导出到一个文件的17个工作表中我已经分别将每个集群的数据导出到它们自己的表中，并为单个集群创建单词计数列表，但在循环遍历每个集群组时，这两种方法都没有成功示

希望有人能帮我。我正在一些文本数据上运行K-Means聚类。在pandas数据框中获得不同的集群组后，我想为模型放入数据框中的每个集群组的“Processed_Data”列中的文本创建一个字数列表。创建每个列表后，我希望将它们导出到一个excel文件中的单个excel工作表中。对于这个特定的代码，我有17个集群，并希望将17个单词计数列表导出到一个文件的17个工作表中

我已经分别将每个集群的数据导出到它们自己的表中，并为单个集群创建单词计数列表，但在循环遍历每个集群组时，这两种方法都没有成功

示例数据：

|SN |Processed_Data                 |cluster    |
-------------------------------------------------
|123|hello world good bye world     |    01     |
|111|hello world                    |    01     |
|222|good bye world                 |    02     |
|555|world great                    |    02     |
|543|an african or european swallow?|    03     |
|777|what do you mean?              |    03     |

我希望根据群集编号将结果放入单个excel表中：

cluster 01:
| word | freq|
---------------
|world |  3  |
|hello |  2  |
|good  |  1  |
|bye   |  1  |

cluster 02: 
| word | freq|
--------------
|world |  2  |
|great |  1  |
|good  |  1  |
|bye   |  1  |

ect for each cluster...

这是我尝试过的代码，但它似乎对我不起作用。我没有展示所有的预处理代码，比如删除大写、停止字和标点符号，因为我没有任何问题，这增加了文章的长度

true\u k=17
model=KMeans（n_clusters=true_k，init='k-means++'，max_iter=300，n_init=15）
模型拟合（X）
标签=model.labels_
数据集群=pd.DataFrame（列表（zip（df['SN']，df['Processed_data']，labels）），列=['SN'，'Processed_data'，'cluster']））
data\u clusters=data\u clusters.sort\u值（按=['cluster']）
数据集群['cluster']=数据集群['cluster'].aType（str）
uniques=数据_集群['cluster'].unique（）
使用pd.ExcelWriter（'cluster_test.xlsx'）作为编写器：
对于uniques中的群集：
a=数据集群.loc[数据集群['cluster']==集群][['Processed\u data']]].str.cat（sep=''）
words=nltk.tokenize.word\u tokenize（a）
word_dist=nltk.FreqDist（单词）
rslt=dict（（word，freq）表示单词，如果不是word.isdigit（），则表示单词中的freq\u dist.items（）
rslt=pd.DataFrame（列表（word\u dist.items（）），
列=['Word'，'Freq']）
rslt=rslt.sort_值（按=['Freq']，升序=False）
rslt['Cluster']=群集
rslt.to_excel（编写器，索引=无，工作表名称=群集）

稍微提醒一下，我必须使用

data\u clusters['cluster']=data\u clusters['cluster']将集群列更改为字符串。astype（str）

因此，Excel编写器可以根据集群编号命名工作表。它在使用整数命名工作表时遇到问题。想知道这是否是问题的一部分。

这里有一个解决方案：

import openpyxl
df = pd.DataFrame(
   {
  'SN': [123,111,222,555,543,777],
  'Processed_Data':   ['hello world good bye world','hello world', 'good bye world','world great','an african or european swallow?','what do you mean?'],
  'cluster' : ['01','01','02','02','03','03']
   
 })

df1 = pd.DataFrame(df.groupby("cluster")["Processed_Data"]) 

wb = openpyxl.Workbook('Cluster.xlsx') 
wb.save('Cluster.xlsx') #Create an excel file

for index, row in df1.iterrows():
    print(index)
    temp_list = row[1].str.split(' ').tolist()
    flat_temp_list = [item for sublist in temp_list for item in sublist]
    temp_df = pd.DataFrame({'words': flat_temp_list })   
    temp_df = temp_df.groupby(["words"])["words"].count().reset_index(name="freq")
    with pd.ExcelWriter('Cluster.xlsx',engine="openpyxl", mode="a") as writer:
        temp_df.to_excel(writer, sheet_name='Sheet'+str(index))

您的excel工作表如下所示：

words    freq
0   bye     1
1   good    1
2   hello   2
3   world   3

words     freq
0   bye     1
1   good    1
2   great   1
3   world   2

words       freq
0   african   1
1   an        1
2   do        1
3   european. 1
4   mean?     1
5   or        1
6   swallow?  1
7   what      1
8   you       1

多亏了Inputvector的帮助，我才得以解决这个问题。我拿了他们的代码，只是移动了一些东西，所以它被导出到一个excel文件的各个工作表中

df1=pd.DataFrame（data\u clusters.groupby（“cluster”）[“Processed\u data”]）
使用pd.ExcelWriter（'cluster_test_4.xlsx'）作为编写器：
对于索引，df1.iterrows（）中的行：
临时列表=行[1].str.split（“”）.tolist（）
平面临时列表=[临时列表中的子列表中的项目临时列表中的项目]
temp_df=pd.DataFrame（{'words'：flat_temp_list}）
temp_df=temp_df.groupby（[“words”]）[“words”].count（）.reset_index（name=“freq”）
临时索引到excel（编写器，索引=None，工作表名称=str（索引））

谢谢！能够接受你所做的并解决它。需要将它们分别放入同一excel文件中，但该文件中的工作表不同。只是改变了一些事情。编辑我的问题以显示答案。你是说相同的excel和不同的工作表？如果是这样的话，我更新了代码没错。成功了。非常感谢。