Python只提取单词

Python只提取单词,python,nltk,Python,Nltk,目前,我一直在使用此函数仅提取纯英语字符串和Unicode字符串的有效单词: s = """\"A must-read for the business leader of today and tomorrow."--John G. O'Neill, Vice President, 3M Canada. High Performance Sales Organizations defined the true nature of market-focused sales and service

目前,我一直在使用此函数仅提取纯英语字符串和Unicode字符串的有效单词:

s = """\"A must-read for the business leader of today and tomorrow."--John G. O'Neill, Vice President, 3M Canada. High Performance Sales Organizations defined the true nature of market-focused sales and service operations, and helped push sales organizations into the 21st century"""
t = 'Life is life (I want chocolate);&'
w = u'Tú te llamabas de niña Concepción Morales!!'

def clean_words(text, separator=' '):
  if isinstance(text, unicode):
    return separator.join(re.findall(r'[\w]+', text, re.U)).rstrip()
  else:
    return re.sub(r'\W+', ' ', text).replace(' ', separator).rstrip()
它似乎与姓氏和撇号有问题,有什么建议吗? 它返回s:

 A must read for the business leader of today and tomorrow John G O Neill Vice President 3M Canada High Performance Sales Organizations defined the true nature of market focused sales and service operations and helped push sales organizations into the 21st century
当我标记它时,结果是单个字符


有什么建议吗?

看起来像是您想要的树库标记器:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)
#['``', 'A', 'must-read', 'for', 'the', 'business', 'leader', 'of',
# 'today', 'and', 'tomorrow.', "''", '--', 'John', 'G.', "O'Neill",
# ',', 'Vice', 'President', ',', '3M', 'Canada.', 'High', 
# 'Performance', 'Sales', 'Organizations', 'defined', 'the', 'true', 
# 'nature', 'of', 'market-focused', 'sales', 'and', 'service', 
# 'operations', ',', 'and', 'helped', 'push', 'sales', 
# 'organizations', 'into', 'the', '21st', 'century']

看起来它是您想要的树库标记器:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)
#['``', 'A', 'must-read', 'for', 'the', 'business', 'leader', 'of',
# 'today', 'and', 'tomorrow.', "''", '--', 'John', 'G.', "O'Neill",
# ',', 'Vice', 'President', ',', '3M', 'Canada.', 'High', 
# 'Performance', 'Sales', 'Organizations', 'defined', 'the', 'true', 
# 'nature', 'of', 'market-focused', 'sales', 'and', 'service', 
# 'operations', ',', 'and', 'helped', 'push', 'sales', 
# 'organizations', 'into', 'the', '21st', 'century']

或者,您可以使用
spacy

import spacy
nlp = spacy.load('en')
s_tokenized = [t.text for t in nlp(s)]

# ['"', 'A', 'must', '-', 'read', 'for', 'the', 'business', 'leader', 'of',
#  'today', 'and', 'tomorrow', '."--', 'John', 'G.', "O'Neill", ',', 'Vice',
#  'President', ',', '3', 'M', 'Canada', '.', 'High', 'Performance', 'Sales',
#  'Organizations', 'defined', 'the', 'true', 'nature', 'of', 'market', '-',
#  'focused', 'sales', 'and', 'service', 'operations', ',', 'and', 'helped',
#  'push', 'sales', 'organizations', 'into', 'the', '21st', 'century']

或者,您可以使用
spacy

import spacy
nlp = spacy.load('en')
s_tokenized = [t.text for t in nlp(s)]

# ['"', 'A', 'must', '-', 'read', 'for', 'the', 'business', 'leader', 'of',
#  'today', 'and', 'tomorrow', '."--', 'John', 'G.', "O'Neill", ',', 'Vice',
#  'President', ',', '3', 'M', 'Canada', '.', 'High', 'Performance', 'Sales',
#  'Organizations', 'defined', 'the', 'true', 'nature', 'of', 'market', '-',
#  'focused', 'sales', 'and', 'service', 'operations', ',', 'and', 'helped',
#  'push', 'sales', 'organizations', 'into', 'the', '21st', 'century']

既然您使用了NLTK,为什么不使用NLTK.WordSpuntTokenizer()或其他一些标准的标记器呢?WordSpuntTokenizer似乎会返回类似的结果:word_tokenizer.tokenize['”、'A'、'must'、'-'、'read'、'for'、'business'、'leader'、'of'、'today'、'明日'、'、'、'、'、'''-'、'、'''-'、'、'、'''''''''.'.'“Neill”、“Neill”、“副”、“总裁”、“3M”、“加拿大”、“High”、“Performance”、“Sales”、“Organizations”、“defined”、“the”、“true”、“nature”、“of”、“market”、“the”、“true”、“nature”、“of”、“market”、“focused”、“Sales”、“and”、“service”、“operations”、“and”、“helped”、“push”、“Sales”、“Organizations”、“进入”、“21世纪”]自从您使用NLTK以来,为什么不使用nltk.WordPunctTokenizer()或其他一些标准的标记器呢?WordPunctTokenizer似乎会返回类似的结果:word_tokenizer.tokenize['”、'A'、'must'、'-'、'read'、'for'、'the'、'business'、'leader'、'of'、'today'、'tomer'、'、'、'John'、'G'、'O'、'“Neill”、“Neill”、“副”、“总裁”、“3M”、“加拿大”、“高性能”、“销售”、“组织”、“定义”、“真实”、“性质”、“市场”、“专注”、“销售”、“服务”、“运营”、“帮助”、“推动”、“销售”、“组织”、“进入”、“21世纪”]