Python 3.x 如何在python中以单独的列排列的数据框中插入pos标记?
我已经用TextBlob标记了我的输入文本,并将其导出到一个文本文件中。它给了我三个信息:POS、Parse Chunker和Deep-Parsing。此标记的输出格式为:technology:Plain/NNP/B-NP/O和/CC/I-NP/O。我想把它安排在一个数据框中,每个数据框都有单独的列 这是我正在使用的代码Python 3.x 如何在python中以单独的列排列的数据框中插入pos标记?,python-3.x,nlp,text-processing,pos-tagger,Python 3.x,Nlp,Text Processing,Pos Tagger,我已经用TextBlob标记了我的输入文本,并将其导出到一个文本文件中。它给了我三个信息:POS、Parse Chunker和Deep-Parsing。此标记的输出格式为:technology:Plain/NNP/B-NP/O和/CC/I-NP/O。我想把它安排在一个数据框中,每个数据框都有单独的列 这是我正在使用的代码 import pandas as pd import csv from textblob import TextBlob with open('report1to8_1
import pandas as pd
import csv
from textblob import TextBlob
with open('report1to8_1.txt', 'r') as myfile:
report=myfile.read().replace('\n', '')
out = TextBlob(report).parse()
tagS = 'taggedop.txt'
f = open('taggedop.txt', 'w')
f.write(str(out))
df = pd.DataFrame(columns=['Words', 'POS', 'Parse chunker','Deep
Parsing'])
df = pd.read_csv('taggedop.txt', sep=' ',error_bad_lines=False,
quoting=csv.QUOTE_NONE)
我的预期结果是有这样一个数据帧:
然而,目前我得到的是:
请帮忙 试试这个。该示例将引导您将数据转换为正确的格式,以便能够创建数据帧。您需要创建一个包含数据列表的列表。这些数据必须统一组织。然后您可以创建数据帧。如果需要更多帮助,请发表评论
from textblob import TextBlob as blob
import pandas as pd
from string import punctuation
def remove_punctuation(text):
return ''.join(c for c in text if c not in punctuation)
data = []
text = '''
He an thing rapid these after going drawn or.
Timed she his law the spoil round defer.
In surprise concerns informed betrayed he learning is ye.
Ignorant formerly so ye blessing. He as spoke avoid given downs money on we.
Of properly carriage shutters ye as wandered up repeated moreover.
Inquietude attachment if ye an solicitude to.
Remaining so continued concealed as knowledge happiness.
Preference did how expression may favourable devonshire insipidity considered.
An length design regret an hardly barton mr figure.
Those an equal point no years do. Depend warmth fat but her but played.
Shy and subjects wondered trifling pleasant.
Prudent cordial comfort do no on colonel as assured chicken.
Smart mrs day which begin. Snug do sold mr it if such.
Terminated uncommonly at at estimating.
Man behaviour met moonlight extremity acuteness direction. '''
text = remove_punctuation(text)
text = text.replace('\n', '')
text = blob(text).parse()
text = text.split(' ')
for tagged_word in text:
t_word = tagged_word.split('/')
data.append([t_word[0], t_word[1], t_word[2], t_word[3]])
df = pd.DataFrame(data, columns = ['Words', 'POS', 'Parse Chunker', 'Deep Parsing'] )
结果
Out[18]:
Words POS Parse Chunker Deep Parsing
0 He PRP B-NP O
1 an DT I-NP O
2 thing NN I-NP O
3 rapid JJ B-ADJP O
4 these DT O O
5 after IN B-PP B-PNP
6 going VBG B-VP I-PNP
7 drawn VBN I-VP I-PNP
8 or CC O O
9 Timed NNP B-NP O
10 she PRP I-NP O
11 his PRP$ I-NP O
12 law NN I-NP O
13 the DT O O
14 spoil VB B-VP O
15 round NN B-NP O
16 defer VB B-VP O
17 In IN B-PP B-PNP
18 surprise NN B-NP I-PNP
19 concerns NNS I-NP I-PNP
20 informed VBN B-VP I-PNP
21 betrayed VBN I-VP I-PNP
22 he PRP B-NP I-PNP
23 learning VBG B-VP I-PNP
24 is VBZ I-VP O
25 ye PRP B-NP O
26 Ignorant NNP I-NP O
27 formerly RB I-NP O
28 so RB I-NP O
29 ye PRP I-NP O
.. ... ... ... ...
105 no DT O O
106 on IN B-PP B-PNP
107 colonel NN B-NP I-PNP
108 as IN B-PP B-PNP
109 assured VBN B-VP I-PNP
110 chicken NN B-NP I-PNP
111 Smart NNP I-NP I-PNP
112 mrs NNS I-NP I-PNP
113 day NN I-NP I-PNP
114 which WDT O O
115 begin VB B-VP O
116 Snug NNP B-NP O
117 do VBP B-VP O
118 sold VBN I-VP O
119 mr NN B-NP O
120 it PRP I-NP O
121 if IN B-PP B-PNP
122 such JJ B-NP I-PNP
123 Terminated NNP I-NP I-PNP
124 uncommonly RB B-ADVP O
125 at IN B-PP B-PNP
126 at IN I-PP I-PNP
127 estimating VBG B-VP I-PNP
128 Man NN B-NP I-PNP
129 behaviour NN I-NP I-PNP
130 met VBD B-VP O
131 moonlight NN B-NP O
132 extremity NN I-NP O
133 acuteness NN I-NP O
134 direction NN I-NP O
[135 rows x 4 columns]
试试这个。该示例将引导您将数据转换为正确的格式,以便能够创建数据帧。您需要创建一个包含数据列表的列表。这些数据必须统一组织。然后您可以创建数据帧。如果需要更多帮助,请发表评论
from textblob import TextBlob as blob
import pandas as pd
from string import punctuation
def remove_punctuation(text):
return ''.join(c for c in text if c not in punctuation)
data = []
text = '''
He an thing rapid these after going drawn or.
Timed she his law the spoil round defer.
In surprise concerns informed betrayed he learning is ye.
Ignorant formerly so ye blessing. He as spoke avoid given downs money on we.
Of properly carriage shutters ye as wandered up repeated moreover.
Inquietude attachment if ye an solicitude to.
Remaining so continued concealed as knowledge happiness.
Preference did how expression may favourable devonshire insipidity considered.
An length design regret an hardly barton mr figure.
Those an equal point no years do. Depend warmth fat but her but played.
Shy and subjects wondered trifling pleasant.
Prudent cordial comfort do no on colonel as assured chicken.
Smart mrs day which begin. Snug do sold mr it if such.
Terminated uncommonly at at estimating.
Man behaviour met moonlight extremity acuteness direction. '''
text = remove_punctuation(text)
text = text.replace('\n', '')
text = blob(text).parse()
text = text.split(' ')
for tagged_word in text:
t_word = tagged_word.split('/')
data.append([t_word[0], t_word[1], t_word[2], t_word[3]])
df = pd.DataFrame(data, columns = ['Words', 'POS', 'Parse Chunker', 'Deep Parsing'] )
结果
Out[18]:
Words POS Parse Chunker Deep Parsing
0 He PRP B-NP O
1 an DT I-NP O
2 thing NN I-NP O
3 rapid JJ B-ADJP O
4 these DT O O
5 after IN B-PP B-PNP
6 going VBG B-VP I-PNP
7 drawn VBN I-VP I-PNP
8 or CC O O
9 Timed NNP B-NP O
10 she PRP I-NP O
11 his PRP$ I-NP O
12 law NN I-NP O
13 the DT O O
14 spoil VB B-VP O
15 round NN B-NP O
16 defer VB B-VP O
17 In IN B-PP B-PNP
18 surprise NN B-NP I-PNP
19 concerns NNS I-NP I-PNP
20 informed VBN B-VP I-PNP
21 betrayed VBN I-VP I-PNP
22 he PRP B-NP I-PNP
23 learning VBG B-VP I-PNP
24 is VBZ I-VP O
25 ye PRP B-NP O
26 Ignorant NNP I-NP O
27 formerly RB I-NP O
28 so RB I-NP O
29 ye PRP I-NP O
.. ... ... ... ...
105 no DT O O
106 on IN B-PP B-PNP
107 colonel NN B-NP I-PNP
108 as IN B-PP B-PNP
109 assured VBN B-VP I-PNP
110 chicken NN B-NP I-PNP
111 Smart NNP I-NP I-PNP
112 mrs NNS I-NP I-PNP
113 day NN I-NP I-PNP
114 which WDT O O
115 begin VB B-VP O
116 Snug NNP B-NP O
117 do VBP B-VP O
118 sold VBN I-VP O
119 mr NN B-NP O
120 it PRP I-NP O
121 if IN B-PP B-PNP
122 such JJ B-NP I-PNP
123 Terminated NNP I-NP I-PNP
124 uncommonly RB B-ADVP O
125 at IN B-PP B-PNP
126 at IN I-PP I-PNP
127 estimating VBG B-VP I-PNP
128 Man NN B-NP I-PNP
129 behaviour NN I-NP I-PNP
130 met VBD B-VP O
131 moonlight NN B-NP O
132 extremity NN I-NP O
133 acuteness NN I-NP O
134 direction NN I-NP O
[135 rows x 4 columns]
当我使用text=blob(text).tags时,它可以工作。但是,我使用text=blob(text).parse(),因为我还需要其他标记。因此,请帮助我了解您使用parse()而不是标记给出的相同示例,因为它转换为text.taggedString而不是list。我已经编辑了代码来实现这一点。我希望有帮助。我想我还应该强调指出,如果您的数据没有任何标点或新行,则不需要运行
删除\u标点()
和text=text.replace('\n','')
。当我使用text=blob(text).tags时,它会起作用。但是,我使用text=blob(text).parse(),因为我还需要其他标记。因此,请帮助我了解您使用parse()而不是标记给出的相同示例,因为它转换为text.taggedString而不是list。我已经编辑了代码来实现这一点。我希望有帮助。我想我还应该强调,如果数据没有标点或新行,则不需要运行删除标点(
和text=text。替换('\n','')
。