Python 2.7 如何在nltk中使用word_标记化并保留空格？_Python 2.7_Nltk

Python 2.7 如何在nltk中使用word_标记化并保留空格？

python-2.7

Python 2.7 如何在nltk中使用word_标记化并保留空格？,python-2.7,nltk,Python 2.7,Nltk,据我所知，nltk中的word\u tokenize函数接受一个字符串表示的句子，并返回其所有单词的列表： >>> from nltk import word_tokenize, wordpunct_tokenize >>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n" ... "two of them.\n\nThanks.") >>> word_tok

据我所知，nltk中的

word\u tokenize

函数接受一个字符串表示的句子，并返回其所有单词的列表：

>>> from nltk import word_tokenize, wordpunct_tokenize
>>> s = ("Good muffins cost $3.88\nin New York.  Please buy me\n"
...      "two of them.\n\nThanks.")
>>> word_tokenize(s) 
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

但是，在我的程序中，保留空间以便进一步计算很重要，因此我更希望

word\u tokenize

以如下方式返回它：

['Good', ' ', 'muffins', ' ', 'cost', ' ', '$', '3.88', ' ', 'in', ' ', 'New', ' ', 'York.', ' ', 'Please', ' ', 'buy', ' ', 'me', ' ', 'two', ' ', 'of', ' ', 'them', '.', 'Thanks', '.' ]

如何更改/替换/调整

word\u标记化

以实现此目的

您可以分两步完成此任务-

步骤1：使用字符串并根据空格打断

步骤2：使用

word\u Tokenize

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\n"
>>> ll = [[word_tokenize(w), ' '] for w in s.split()]
>>> list(itertools.chain(*list(itertools.chain(*ll))))
['Good', ' ', 'muffins', ' ', 'cost', ' ', '$', '3.88', ' ', 'in', ' ', 'New', ' ', 'York', '.', ' ', 'Please', ' ', 'buy', ' ', 'me', ' ']