Python 爆炸柱
我有以下数据集:Python 爆炸柱,python,pandas,Python,Pandas,我有以下数据集: Date Text 2020/05/12 Include details about your goal 2020/05/12 Describe expected and actual results 2020/05/13 Include any error messages 2020/05/13 The community is here to help you 2020/05/14 Avoid asking opinio
Date Text
2020/05/12 Include details about your goal
2020/05/12 Describe expected and actual results
2020/05/13 Include any error messages
2020/05/13 The community is here to help you
2020/05/14 Avoid asking opinion-based questions.
我清除了标点符号,停止词。。。为了准备爆炸:
stop_words = stopwords.words('english')
# punctuation to remove
punctuation = string.punctuation.replace("'", '') # don't remove apostrophe from strings
punc = r'[{}]'.format(punctuation)
df.Text = df.Text.str.replace('\d+', '') # remove numbers
df.Text =df.Text.str.replace(punc, ' ') # remove punctuation except apostrophe
df.Text = df.Text.str.replace('\\s+', ' ') # remove occurrences of more than one whitespace
df.Text = df.Text.str.strip() # remove whitespace from beginning and end of string
df.Text = df.Text.str.lower() # convert all to lowercase
df.dropna(inplace=True)
df.Text=df.Text.apply(lambda x: list(word for word in x.split() if word not in stop_words)) # remove words
但是,它仅适用于第一行,而不适用于所有行。
下一步是
df_1 = df.explode('Text')
你能告诉我怎么了吗
第一行拆分如下:
Text New_Text (to show the difference after cleaning the text)
Include details about your goal ['include','details','goal']
我没有其他行(因此没有以“描述…”或“避免…”开头的行)。
在我的日期集中,我有1942行,但在清理文本后只返回1行
更新:
输出示例:
Date Text
2020/05/12 Include
2020/05/12 details
2020/05/12 goal
.... ...
固定问题(不适用,但应适用):
我认为下面的代码应该允许我得到这个结果:
(pd.melt(test.Text.apply(pd.Series).reset_index(),
id_vars=['Date'],
value_name='Text')
.set_index(['Date'])
.drop('variable', axis=1)
.dropna()
.sort_index()
)
要将日期转换为索引,请执行以下操作:
test=test.set_index(['Date'])
随着问题的更新,代码再次被修改。当date列和word列垂直展开时,您所需的输出得到了响应
import pandas as pd
import numpy as np
import io
data = '''
Date Text
2020/05/12 "Include details about your goal"
2020/05/12 "Describe expected and actual results"
2020/05/13 "Include any error messages"
2020/05/13 "The community is here to help you"
2020/05/14 "Avoid asking opinion-based questions."
'''
test = pd.read_csv(io.StringIO(data), sep='\s+')
test.set_index('Date',inplace=True)
expand_df = test['Text'].str.split(' ', expand=True)
expand_df.reset_index(inplace=True)
expand_df = pd.melt(expand_df, id_vars='Date', value_vars=np.arange(6), value_name='text')
expand_df.dropna(axis=0, inplace=True, )
expand_df = expand_df[['Date', 'text']]
expand_df
Date text
0 2020/05/12 Include
1 2020/05/12 Describe
2 2020/05/13 Include
3 2020/05/13 The
4 2020/05/14 Avoid
5 2020/05/12 details
6 2020/05/12 expected
7 2020/05/13 any
8 2020/05/13 community
9 2020/05/14 asking
10 2020/05/12 about
11 2020/05/12 and
12 2020/05/13 error
13 2020/05/13 is
14 2020/05/14 opinion-based
15 2020/05/12 your
16 2020/05/12 actual
17 2020/05/13 messages
18 2020/05/13 here
19 2020/05/14 questions.
20 2020/05/12 goal
21 2020/05/12 results
23 2020/05/13 to
28 2020/05/13 help
清除df.Text后,我有标记,例如:[“包括”、“详细信息”、“目标”]。其他行也一样。你能告诉我所有的步骤吗?(我在问题中写的那些,只是为了确保我在正确的步骤中应用了df.Text.str.split?感谢
df.Text.str.split()的结果)
是从问题顶部的数据粘贴而来的。在转换之前,如果没有数据,我无法向您展示该过程。问题是,这对于第一行来说效果很好,不幸的是,对于另一行来说效果不好。关于向问题添加第一个字符串的演示?修复了代码。您编写的代码是添加的代码吗?