Python 美化组无法正确读取文档_Python_Web Scraping_Beautifulsoup

Python 美化组无法正确读取文档

python web-scraping

Python 美化组无法正确读取文档,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图搜集NBA球员的统计数据，目的是在他们身上运行一些机器学习，我发现这些“可打印的球员文件”有一大堆漂亮整洁的统计数据。不幸的是，我试图使用BeautifulSoup解析html，但它根本不起作用。例如： from bs4 import BeautifulSoup import codecs import urllib2 url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html' html = ur

我试图搜集NBA球员的统计数据，目的是在他们身上运行一些机器学习，我发现这些“可打印的球员文件”有一大堆漂亮整洁的统计数据。不幸的是，我试图使用BeautifulSoup解析html，但它根本不起作用。例如：

from bs4 import BeautifulSoup
import codecs
import urllib2

url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

with open('ray_allen.txt', 'w') as f:
    f.write(soup.prettify())
    f.close()

给我一个如下所示的文件：

<html>
 <head>
  <!--no description was found-->
  <!--no title was found-->
  <!--no keywords found-->
  <!--not article-->
  <script>
   var site = "nba";
var page = "player";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script language="Javascript">
   &lt;!--
var flashinstalled = 0;
var flashversion = 0;
MSDetect = "false";
if (navigator.plugins &amp;&amp; navigator.plugins.length) {
    x = navigator.plugins["Shockwave Flash"];
    if (x) {
        flashinstalle   d       =       2   ;   

           i   f       (   x   .   d   e   s   c   r   i   p   t   i   o   n   )       {   

               y       =       x   .   d   e   s   c   r   i   p   t   i   o   n   ;   

               f   l   a   s   h   v   e   r   s   i   o   n       =       y   .   c   h   a   r   A   t   (   y   .   i   n   d   e   x   O   f   (   '   .   '   )   -   1   )   ;   

           }   

       }       e   l   s   e   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       1   ;   

       i   f       (   n   a   v   i   g   a   t   o   r   .   p   l   u   g   i   n   s   [   "   S   h   o   c   k   w   a   v   e       F   l   a   s   h       2   .   0   "   ]   )       {   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       2   ;   

           f   l   a   s   h   v   e   r   s   i   o   n       =       2   ;   

       }   
[...]

适用于终端中的第二页，因此当它尝试写入文件时会发生一些事情——这不是BeautifulSoup的问题这看起来像BeautifulSoup 4中的一个bug

我使用BeautifulSoup3（在Ubuntu中打包）尝试了您的代码，将

从bs4导入BeautifulSoup

更改为

从BeautifulSoup导入BeautifulSoup

，效果如预期。当我使用v4（不改变代码运行）时，我重现了您的问题。错误似乎在解析器中，而不是在

prettify

中，因为打印

soup

对象也会出现同样的问题

请将其作为错误提交到。同时，使用版本3。

这看起来像是BeautifulSoup4中的一个bug

我使用BeautifulSoup3（在Ubuntu中打包）尝试了您的代码，将

从bs4导入BeautifulSoup

更改为

从BeautifulSoup导入BeautifulSoup

，效果如预期。当我使用v4（不改变代码运行）时，我重现了您的问题。错误似乎在解析器中，而不是在

prettify

中，因为打印

soup

对象也会出现同样的问题

请将其作为错误提交到。同时，使用版本3。

这与4.0.3中修复的症状相同。我建议升级到Beautiful Soup 4的最新版本。

您正在运行哪一版本的python？以及哪一版本的Beautiful Soup？我知道最近的那个有点问题，等等。你凭什么认为这不管用？html文件的其余部分是什么？当我查看source.python2.7.3、beautifulsoup4.0.2-1时，html页面的开头就是这样的。直到最后的标记，它是3200多行字符，每个字符之间用3个空格隔开，符号转换为它们的HTML实体（如果这是正确的单词），等等。您运行的是什么版本的python？以及什么版本的BeautifulSoup？我知道最近的那个有点问题，等等。你凭什么认为这不管用？html文件的其余部分是什么？当我查看source.python2.7.3、beautifulsoup4.0.2-1时，html页面的开头就是这样的。直到最后的标记出现，它有3200多行字符，每行用3个空格隔开，符号转换为HTML实体（如果是正确的单词），等等。

[...]
   &lt;   /   b   o   d   y   &gt;   

   &lt;   /   h   t   m   l   &gt;
  </script>
 </head>
</html>

print soup.prettify()