Python 基于空格和尾随标点符号的标记化?
我正在尝试使用正则表达式将字符串拆分为基于空格或尾随标点符号的列表 e、 g 我想要的是Python 基于空格和尾随标点符号的标记化?,python,regex,Python,Regex,我正在尝试使用正则表达式将字符串拆分为基于空格或尾随标点符号的列表 e、 g 我想要的是 ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good'] s.split()。它已经可以很好地处理标记化单词了 import spacy s = 'hel-lo this has whi(.)te, space. very \n good' nlp = spacy.load('en') ls = [t.text f
['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
s.split()。它已经可以很好地处理标记化单词了
import spacy
s = 'hel-lo this has whi(.)te, space. very \n good'
nlp = spacy.load('en')
ls = [t.text for t in nlp(s) if t.text.strip()]
>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
然而,它也标记了-
之间的单词,所以我借用了将-
之间的单词合并在一起的解决方案
merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
merged = ''.join(ls[t[0]:t[1]])
ls[t[0]:t[1]] = [merged]
>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
使用spacy
的粗略解决方案。它已经可以很好地处理标记化单词了
import spacy
s = 'hel-lo this has whi(.)te, space. very \n good'
nlp = spacy.load('en')
ls = [t.text for t in nlp(s) if t.text.strip()]
>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
然而,它也标记了-
之间的单词,所以我借用了将-
之间的单词合并在一起的解决方案
merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
merged = ''.join(ls[t[0]:t[1]])
ls[t[0]:t[1]] = [merged]
>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
你可能需要调整什么是“标点符号”
您可能需要调整什么是“标点符号”。我正在使用Python 3.6.1
import re
s = 'hel-lo this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
if len(j) > 1:
a.extend(j[:-1])
else:
a.append(i)
# a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
我正在使用Python 3.6.1
import re
s = 'hel-lo this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
if len(j) > 1:
a.extend(j[:-1])
else:
a.append(i)
# a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
你也允许使用其他库吗?或者你只想使用正则表达式?是的,任何库的使用都很好。你允许使用其他库吗?或者你只想使用正则表达式?是的,任何库的使用都可以