Python re.sub、re.split无法在长文章中拆分单词_Python_Regex_Python 2.7_Beautifulsoup

Python re.sub、re.split无法在长文章中拆分单词

python regex python-2.7

Python re.sub、re.split无法在长文章中拆分单词,python,regex,python-2.7,beautifulsoup,Python,Regex,Python 2.7,Beautifulsoup,我正试图从存储在光盘上的HTML文档中创建一个单词列表。当我试图拆分单词并将它们添加到我的单词向量时，结果是一团乱 def get_word_vector(self): line = self.soup.get_text() re.sub("s/(\\u[a-e0-9][a-e0-9][a-e0-9]//|\\n)","",line) for word in line.split("\s+"): for the_word in word.split("[,.\"\\/?!@#

我正试图从存储在光盘上的HTML文档中创建一个单词列表。当我试图拆分单词并将它们添加到我的单词向量时，结果是一团乱

 def get_word_vector(self):
  line = self.soup.get_text()
  re.sub("s/(\\u[a-e0-9][a-e0-9][a-e0-9]//|\\n)","",line)
  for word in line.split("\s+"):
   for the_word in word.split("[,.\"\\/?!@#$%^&*\{\}\[\]]+"):
    if the_word not in self.word_vector:
     self.word_vector[the_word]=0
    self.word_vector[the_word]+=1
    self.doc_length=self.doc_length+1
  for keys in self.word_vector:
   print "%r: %r" % (keys, self.word_vector[keys]) #So I can see whats happening

在wiki页面上测试时，我得到（小样本）：

作为一个“词”。该文档正在读取到BS4，如：

  self.soup = BeautifulSoup(open(fullpath,"r"))

我不明白为什么会这样。我猜正则表达式之所以失败是因为它错了？？？

只是一个替代选项：通过获取文本，然后使用从文本中获取单词列表。这里的要点不是重新发明轮子，并将专用工具用于特定工作：

BeautifulSoup

用于HTML解析，

nltk

用于文本处理：

from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer

soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
tokenizer = RegexpTokenizer(r'\w+')
print tokenizer.tokenize(soup.get_text())

印刷品：

[u'Stack', u'Overflow', u'Wikipedia', u'the', u'free', u'encyclopedia', ... ]

所以，基本上，你需要一个网页的单词列表，对吗？谢谢你，先生。我想如果有人真的为这个目的建了一个图书馆，问怎么做是愚蠢的。但是你看到我的正则表达式有什么看起来不正确的地方吗？@jasondancks好吧，至少，

re.sub（）

调用没有修改字符串，你需要将它的结果分配给

line

@jasondancks plus，而不是

line.split（）

，你应该使用

re.split（）

。

[u'Stack', u'Overflow', u'Wikipedia', u'the', u'free', u'encyclopedia', ... ]