无法获取表数据-HTML_Html_Python 2.7_Beautifulsoup

无法获取表数据-HTML

html python-2.7

无法获取表数据-HTML,html,python-2.7,beautifulsoup,Html,Python 2.7,Beautifulsoup,我正在尝试从以下位置获取“收益公告表”：我正在使用不同的beautifulsoup选项，但没有一个可以获得该表 table = soup.find('table', attrs={'class': 'earnings_announcements_earnings_table'}) table = soup.find_all('table') import urllib2 import re import ast user_agent = {'User-Agent': 'Mozilla/5

我正在尝试从以下位置获取“收益公告表”：

我正在使用不同的beautifulsoup选项，但没有一个可以获得该表

table = soup.find('table', attrs={'class': 'earnings_announcements_earnings_table'})

table = soup.find_all('table')

import urllib2
import re
import ast

user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements', None, user_agent)
source = urllib2.urlopen(req).read()

compiled = re.compile('"earnings_announcements_earnings_table"\s+\:', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[match.end(): len(source)]

compiled = re.compile('"earnings_announcements_webcasts_table"', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[0: match.start()]

result = ast.literal_eval(str(source).strip('\r\n\t, '))
print result

当我检查表时，表的元素就在那里

我正在为表粘贴一部分代码（js，json？）

document.obj_数据={
“收益公告收益表”：
，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，2016年9月“，”0.85美元“，“0.52美元”、“0.52美元，”0.52美元“，，，“0.0.52”，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，1英寸、“$0.17”、“+0.27”、”+“7/23/2015”、“6.69%，“7.69%，收盘后”，“收盘后”，，，，，，，“0.19美元，”0.19“，，“0.19美元，”0.19“，，“+0.34”、“+2266.67%，，“收盘后”，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，-0.22“，-30.14%，”“关闭后”，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，关闭“]”后，“2013年7/25/2013”、“6/2013”、“3/2013年7/25/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3/25/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“7/25/25”、“3/2013年7”、“3.25/25/2013”、“3”、“3”、“3/2013年7”、“3/25/2013”、“3”、“3”、“3/2013”、“3/2013年3”、“3”、“3/2013年3”、“3”、“3/2013”、“3”、“3/3/2013”、“3/2013”、“3/2013”、“3”、“3”、“3/2013年3/2013”、“3”、“3/2013”、“3/2013”、“3/2013”、“3/2013”、“3”、“3”、“3/13”、“3/13”、“3/13”、“3/13”、“3/13”、“3”、“3/13”、“3/13”、“3/13”、“3/13”、“3/13”、“3/13”、“3/13”、““关闭后”]，[“2012年7月26日“，”2012年6月“，”2012年12月“，”2012年1月31日“，”2011年12月“，”关闭后“，”2011年10月25日“，”2011年9月“，”2011年9月“，”2011年7月26日“，”2011年6月“，”关闭后“，”2011年7月26日“，”2011年7月26日“，”2011年7月26日“，”2011年6月“，”关闭后“，”2011年7月26日“，”，”2011年7月26日“，”2011年7月26日“，”2011年7日“，”关闭后“，”2011年7月26日，”，”2011年7月26日，“，”2011年7日，“，”2011年7日，“，”关闭后，“，”2011年7月26日，“，”关闭后，“，”之后，“，”2011年7月26，“2011年3月”、“2011年7月27日”、“2010年12月”、“2011年12月”、“2011年12月”、“2011年3月”、“2010年10月21日”、“2010年9月”、“2010年10月21日”、“2010年9月”、“2010年7月22日”、“2010年6月”、“2010年7月22日”、“2010年6月”、“2010年4月22日”、“2010年3月13日”、“2010年12月28日”、“2010年7月22日”之后--“，”、“-”、“-”、“-”、“-”、“-”、“-”、“-”、“-”、“-”、“-”、“-”、“关闭后”]、“[“7/23/2009”、“6/2009”、“-”、“-”、“-”、“-”、“关闭后”]]

我怎样才能得到这张桌子？

谢谢！

因此解决方案是使用Python的string和RegExp函数而不是BeautifulSoup解析整个HTML文档，因为我们不想从HTML标记中获取数据，而是想在JS代码中获取它们

因此，这段代码基本上是在“收益公告”和“收益表”中获取JS数组，由于JS数组与Python的列表结构相同，所以我使用ast对其进行解析。结果是一个列表，您可以循环到其中，它显示表中所有页面的所有数据

table = soup.find('table', attrs={'class': 'earnings_announcements_earnings_table'})

table = soup.find_all('table')

import urllib2
import re
import ast

user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements', None, user_agent)
source = urllib2.urlopen(req).read()

compiled = re.compile('"earnings_announcements_earnings_table"\s+\:', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[match.end(): len(source)]

compiled = re.compile('"earnings_announcements_webcasts_table"', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[0: match.start()]

result = ast.literal_eval(str(source).strip('\r\n\t, '))
print result

如果您需要澄清，请告诉我。

数据是动态加载的，而不是html格式，因此您必须解析您获得的数据。谢谢！！PhantomJS，selenium？我查看了页面源代码，但看起来还是一样，所以我认为它不会有帮助。但是，仍然可以尝试一下。它与selenium一起工作！非常感谢！它工作得很好！Elements 4每个列表上有5个html代码。太好了！如果你想清除这些html代码，可以调用BeautifulSoup。