Python beautifulsoup内部的标签不工作_Python_Beautifulsoup_Urllib

Python beautifulsoup内部的标签不工作

python

Python beautifulsoup内部的标签不工作,python,beautifulsoup,urllib,Python,Beautifulsoup,Urllib,此代码不按要求打印公司列表。它没有到达第一个标签的内部如果我在第一个标签中写“打印‘文本’”，它不会打印它。 BeautifulSoup正在为不同的站点编写不同的代码。有什么建议为什么它不起作用 from bs4 import BeautifulSoup import urllib request = urllib.urlopen('http://www.stockmarketsreview.com/companies_sp500/') html = request.read() requ

此代码不按要求打印公司列表。它没有到达第一个标签的内部如果我在第一个标签中写“打印‘文本’”，它不会打印它。 BeautifulSoup正在为不同的站点编写不同的代码。有什么建议为什么它不起作用

from bs4 import BeautifulSoup
import urllib
request = urllib.urlopen('http://www.stockmarketsreview.com/companies_sp500/')
html = request.read()
request.close()
soup = BeautifulSoup(html)
for tags in soup.find_all('div', {'class':'mainContent'}):
    for row in tags.find_all('tr'):
        for column in row.find_all('td'):
            print column.text

我有BeautifulSoup 3，这似乎是正确的：

import BeautifulSoup as BS
import urllib
request = urllib.urlopen('http://www.stockmarketsreview.com/companies_sp500/')
html = request.read()
request.close()
soup = BS.BeautifulSoup(html)

try:
   tags = soup.findAll('div', attrs={'class':'mainContent'})
   print '# tags = ' + str(len(tags))
   for tag in tags:
      try:         
         tables = tag.findAll('table')
         print '# tables = ' + str(len(tables))
         for table in tables:            
            try:
               rows = tag.findAll('tr')
               for row in rows:
                  try:
                     columns = row.findAll('td')
                     for column in columns:
                        print column.text
                  except:
                     e = 1
                  #   print 'Caught error getting td tag under ' + str(row)
                  # This is okay since some rows have <th>, not <td>
            except:
               print 'Caught error getting tr tag under ' + str(table)
      except:
         print 'Caught error getting table tag under ' + str(tag)
except:
   print 'Caught error getting div tag'

将美化组作为BS导入
导入URL库
request=urllib.urlopen（'http://www.stockmarketsreview.com/companies_sp500/')
html=request.read（）
请求关闭（）
soup=BS.BeautifulSoup（html）
尝试：
tags=soup.findAll（'div'，attrs={'class'：'mainContent'}）
打印“#标记=”+str（len（标记））
对于标记中的标记：
尝试：
tables=tag.findAll（'table'）
打印“#表=”+str（len（表））
对于表中的表：
尝试：
rows=tag.findAll（'tr'）
对于行中的行：
尝试：
columns=row.findAll（'td'）
对于列中的列：
打印column.text
除：
e=1
#打印“在“+str（行）下获取td标记时出错”
#这没关系，因为有些行有，而不是
除：
打印“获取“+str（表）下的tr标记时出错”
除：
打印“+str（标记）下获取表标记时出错”
除：
打印“获取div标记时捕获错误”

我相信你需要用“全部查找”取代“findAll”

输出如下所示：

这段代码对我很有用。也许可以在实际代码中检查缩进。运行这个程序时，您会得到什么？您是否使用

lxml

作为解析器？lxml的某些版本，以及底层libxml的某些版本，在解析某些HTML时遇到了真正的问题。换句话说，

BeautifulSoup（HTML，'HTML.parser'）

是否能让它正常工作？如果是这样，那么您可能需要修复

lxml

安装。@Martijn Pieters我没有使用lxml。是否需要安装才能运行此代码？我使用了您的代码，但它会打印“#tags=0”。我想我的系统上可能没有安装某些东西，因此无法将此代码用于此站点。但我已经在其他网站上使用过这种类型的代码。任何关于我需要安装什么来运行此代码的建议可能BS4与3.2.1的区别比我想象的要大。。。我在运行脚本时添加了脚本输出的图片。我看到“#tags=1”，后面是你感兴趣的公司列表。