Python 如何将单词转换成句子字符串-文本分类_Python_Nltk_Text Mining_Text Classification_Corpus

Python 如何将单词转换成句子字符串-文本分类

python

Python 如何将单词转换成句子字符串-文本分类,python,nltk,text-mining,text-classification,corpus,Python,Nltk,Text Mining,Text Classification,Corpus,所以我目前正在与Brown Corpus合作，我有一个小问题。为了应用标记化特性，我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的： from nltk.corpus import brown import nltk target_text = [s for s in brown.fileids() if s.startswith('ca01') or s.startswith('ca02')] data = [] total_text =

所以我目前正在与Brown Corpus合作，我有一个小问题。为了应用标记化特性，我首先需要将棕色语料库转换成句子。这就是我到目前为止所做的：

from nltk.corpus import brown
import nltk


target_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02')]

data = []

total_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')]


for text in total_text:

    if text in target_text:
        tag = "pos"
    else:
        tag = "neg"
    words=list(brown.sents(total_text))    
    data.extend( [(tag, word) for word in words] )

data

当我这样做时，我得到的数据如下所示：

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])

'the',
'election',
',',
'``',
'deserves',
'the',

我需要的是这样的东西：

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]

有办法解决这个问题吗？此项目比我预期的时间长。

根据

。sents

方法返回字符串（单词）列表（句子）的列表（文档）-您的调用没有做错任何事情

如果你想重组句子，你可以试着用空格把它们连接起来。但是，由于标点符号的原因，这实际上不起作用：

data.extend( [(tag, ' '.join(word)) for word in words] )

你会得到这样的结果：

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])

'the',
'election',
',',
'``',
'deserves',
'the',

哪个地图到：

the election , `` deserves the

因为join不知道标点符号。

nltk

是否包含某种标点符号感知格式化程序