Python 标记化字符串中的\n和\t字符_Python_Python 3.x_Nltk

Python 标记化字符串中的\n和\t字符

python python-3.x

Python 标记化字符串中的\n和\t字符,python,python-3.x,nltk,Python,Python 3.x,Nltk,尝试使用nltk在python中标记句子，但我也想标记\n和\t字符例如： In:“这是一个\n测试” Out:['This'，'is'，'a'，'\n'，'test'] 有没有直接支持的方法可以做到这一点？您可以使用：输出 ['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also'] ['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also'] 其思想

尝试使用nltk在python中标记句子，但我也想标记\n和\t字符

例如：

In:“这是一个\n测试”

Out:['This'，'is'，'a'，'\n'，'test']

有没有直接支持的方法可以做到这一点？

您可以使用：

输出

['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also']

['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also']

其思想是首先在单个空格上拆分，然后对拆分后的列表中的每个元素应用findall。模式

[^\t\n]+|[\t\n]+

多次匹配非制表符或换行符的所有内容，也多次匹配新行或制表符的所有内容。如果您想将每个选项卡和换行符视为单个令牌，则将该模式更改为：

import re

text = "This is a\n test\n\nwith\t\talso"
pattern = re.compile('[^\t\n]+|[\t\n]')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

输出

['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also']

['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also']