Python 如何将单词转换成句子字符串-文本分类

Python 如何将单词转换成句子字符串-文本分类,python,nltk,text-mining,text-classification,corpus,Python,Nltk,Text Mining,Text Classification,Corpus,所以我目前正在与Brown Corpus合作,我有一个小问题。为了应用标记化特性,我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的: from nltk.corpus import brown import nltk target_text = [s for s in brown.fileids() if s.startswith('ca01') or s.startswith('ca02')] data = [] total_text =

所以我目前正在与Brown Corpus合作,我有一个小问题。为了应用标记化特性,我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的:

from nltk.corpus import brown
import nltk


target_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02')]

data = []

total_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')]


for text in total_text:

    if text in target_text:
        tag = "pos"
    else:
        tag = "neg"
    words=list(brown.sents(total_text))    
    data.extend( [(tag, word) for word in words] )

data
当我这样做时,我得到的数据如下所示:

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])
'the',
'election',
',',
'``',
'deserves',
'the',
我需要的是这样的东西:

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]
有办法解决这个问题吗?此项目比我预期的时间长。

根据
。sents
方法返回字符串(单词)列表(句子)的列表(文档)-您的调用没有做错任何事情

如果你想重组句子,你可以试着用空格把它们连接起来。但是,由于标点符号的原因,这实际上不起作用:

data.extend( [(tag, ' '.join(word)) for word in words] )
你会得到这样的结果:

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])
'the',
'election',
',',
'``',
'deserves',
'the',
哪个地图到:

the election , `` deserves the
因为join不知道标点符号。
nltk
是否包含某种标点符号感知格式化程序