Python 如何将单词转换成句子字符串-文本分类
所以我目前正在与Brown Corpus合作,我有一个小问题。为了应用标记化特性,我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的:Python 如何将单词转换成句子字符串-文本分类,python,nltk,text-mining,text-classification,corpus,Python,Nltk,Text Mining,Text Classification,Corpus,所以我目前正在与Brown Corpus合作,我有一个小问题。为了应用标记化特性,我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的: from nltk.corpus import brown import nltk target_text = [s for s in brown.fileids() if s.startswith('ca01') or s.startswith('ca02')] data = [] total_text =
from nltk.corpus import brown
import nltk
target_text = [s for s in brown.fileids()
if s.startswith('ca01') or s.startswith('ca02')]
data = []
total_text = [s for s in brown.fileids()
if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')]
for text in total_text:
if text in target_text:
tag = "pos"
else:
tag = "neg"
words=list(brown.sents(total_text))
data.extend( [(tag, word) for word in words] )
data
当我这样做时,我得到的数据如下所示:
[('pos',
['The',
'Fulton',
'County',
'Grand',
'Jury',
'said',
'Friday',
'an',
'investigation',
'of',
"Atlanta's",
'recent',
'primary',
'election',
'produced',
'``',
'no',
'evidence',
"''",
'that',
'any',
'irregularities',
'took',
'place',
'.']),
('pos',
['The',
'jury',
'further',
'said',
'in',
'term-end',
'presentments',
'that',
'the',
'City',
'Executive',
'Committee',
',',
'which',
'had',
'over-all',
'charge',
'of',
'the',
'election',
',',
'``',
'deserves',
'the',
'praise',
'and',
'thanks',
'of',
'the',
'City',
'of',
'Atlanta',
"''",
'for',
'the',
'manner',
'in',
'which',
'the',
'election',
'was',
'conducted',
'.'])
'the',
'election',
',',
'``',
'deserves',
'the',
我需要的是这样的东西:
[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]
有办法解决这个问题吗?此项目比我预期的时间长。根据。sents
方法返回字符串(单词)列表(句子)的列表(文档)-您的调用没有做错任何事情
如果你想重组句子,你可以试着用空格把它们连接起来。但是,由于标点符号的原因,这实际上不起作用:
data.extend( [(tag, ' '.join(word)) for word in words] )
你会得到这样的结果:
[('pos',
['The',
'Fulton',
'County',
'Grand',
'Jury',
'said',
'Friday',
'an',
'investigation',
'of',
"Atlanta's",
'recent',
'primary',
'election',
'produced',
'``',
'no',
'evidence',
"''",
'that',
'any',
'irregularities',
'took',
'place',
'.']),
('pos',
['The',
'jury',
'further',
'said',
'in',
'term-end',
'presentments',
'that',
'the',
'City',
'Executive',
'Committee',
',',
'which',
'had',
'over-all',
'charge',
'of',
'the',
'election',
',',
'``',
'deserves',
'the',
'praise',
'and',
'thanks',
'of',
'the',
'City',
'of',
'Atlanta',
"''",
'for',
'the',
'manner',
'in',
'which',
'the',
'election',
'was',
'conducted',
'.'])
'the',
'election',
',',
'``',
'deserves',
'the',
哪个地图到:
the election , `` deserves the
因为join不知道标点符号。nltk
是否包含某种标点符号感知格式化程序