Python 无法将网站中的HTML正确转换为文本_Python_Python 2.7_Text_Web Scraping_Beautifulsoup

Python 无法将网站中的HTML正确转换为文本

python python-2.7 text web-scraping

Python 无法将网站中的HTML正确转换为文本,python,python-2.7,text,web-scraping,beautifulsoup,Python,Python 2.7,Text,Web Scraping,Beautifulsoup,编辑：我不能相信BeautifullSoup实际上不能正确解析HTML。事实上，我可能做错了什么，但如果我不这样做，这是一个真正的业余模块我试图从网络上获取文本，但我无法做到这一点，因为我总是在大多数句子中获得一些奇怪的字符。我从来没有听到过一个句子包含“不是”这样的词 useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko)

编辑：我不能相信BeautifullSoup实际上不能正确解析HTML。事实上，我可能做错了什么，但如果我不这样做，这是一个真正的业余模块

我试图从网络上获取文本，但我无法做到这一点，因为我总是在大多数句子中获得一些奇怪的字符。我从来没有听到过一个句子包含“不是”这样的词

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
    print par
    mytext +=" " +  str(par)

 print "the text is ", mytext

useragent={'User-Agent'：'Mozilla/5.0（Macintosh；英特尔Mac OS X 10_7_4）AppleWebKit/536.11（KHTML，像Gecko）Chrome/20.0.1132.57 Safari/536.11'}
request=urlib2.request（'SomeURL'，无，useragent）
myreq=urllib2.urlopen（请求，超时=5）
html=myreq.read（）
#获取段落
soup=BeautifulSoup（html）
textList=soup.find_all（'p'））
mytext=“”
对于文本列表中的PAR：
如果len（str（par））我认为问题在于系统输出编码，它无法正确输出编码字符，因为它超出了显示的字符范围
BeautifulSoup4旨在完全支持HTML实体
请注意这些命令的奇怪行为：
>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

>python temp.py
...
艾德是个黑头。《喜剧》的复数形式是喜剧
...
>python temp.py>temp.txt
>cat temp.txt
....
“comedo”的复数形式是喜剧。Blackheads是一个开放而宽泛的词
....

我建议将输出写入文本文件，或者使用不同的终端/更改终端设置以支持更大范围的字符。
因为这是Python 2，所以urllib.urlopen（）.read（）
call返回一个最有可能以UTF-8编码的字节字符串-您可以查看HTTP头以查看编码是否包含在其中。我假设是UTF-8
在开始处理内容之前，您无法解码此外部表示，这只会导致流泪。一般规则：立即解码输入，仅在输出时编码
这是你的代码，只做了两次修改
import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
      print par
      mytext +=" " +  str(par)

print "the text is ", mytext

导入urllib2
从BeautifulSoup导入BeautifulSoup
useragent={'User-Agent'：'Mozilla/5.0（Macintosh；英特尔Mac OS X 10_7_4）AppleWebKit/536.11（KHTML，像Gecko）Chrome/20.0.1132.57 Safari/536.11'}
request=urlib2.request（'SomeURL'，无，useragent）
myreq=urllib2.urlopen（请求，超时=5）
html=unicode（myreq.read（），“UTF-8”）
#获取段落
soup=BeautifulSoup（html）
textList=soup.findAll（'p'）
mytext=“”
对于文本列表中的PAR：
如果len（str（par））这是一个基于这里人们的答案和我的研究的解决方案
import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
   if len(str(par))<2000: 
    par = re.sub('<[^<]+?>', '', par)
    mytext +=" " +  html2text.html2text(par)

 print "the text is ", mytext

导入html2text
导入urllib2
进口稀土
导入nltk
useragent={'User-Agent'：'Mozilla/5.0（Macintosh；英特尔Mac OS X 10_7_4）AppleWebKit/536.11（KHTML，像Gecko）Chrome/20.0.1132.57 Safari/536.11'}
request=urlib2.request（'SomeURL'，无，useragent）
myreq=urllib2.urlopen（请求，超时=5）
html=myreq.read（）
html=html.decode（“utf-8”）
textList=re.findall（r'（？可能的重复不是重复。我首先需要提取所有段落。我认为解码会删除所有的标记。我现在需要告诉beautifullsoup一些事情来破坏我的html。我不能相信这样一个著名的python模块不能正确解析html。我仍然得到“喜剧”的消息-你使用过python 2.7吗？我的python乱七八糟，我不断得到不同的输出。你能告诉我你有什么版本的python和beautifulsoup吗？是python 2.7.3和bs4I我以前见过这些字符，但是现在当我试着运行你的代码时，我再也看不到它们了。仍然在试图弄清楚你从哪里运行python在测试之后，我相信这是你的字符编码的问题。我试过了，也看到了同样的问题。如何查看http头？很奇怪。当我运行上述代码时，我看到的HTML包含了“；cometo”；is cometomes”；
，也就是说左右引号已经正确处理。Ac使用myreq.headers.headers访问标题。我没有看到太多的价值，但确认您正在处理一个UTF-8流。内容似乎缺少一个开头引号。是的，我看到的是charset=UTF-8。更不用说我添加了适合我的解决方案。我使用的bs4可能有点奇怪。
import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
   if len(str(par))<2000: 
    par = re.sub('<[^<]+?>', '', par)
    mytext +=" " +  html2text.html2text(par)

 print "the text is ", mytext