如何使用Python使用动态生成的URL刮取页面?

如何使用Python使用动态生成的URL刮取页面?,python,web-scraping,beautifulsoup,urllib2,Python,Web Scraping,Beautifulsoup,Urllib2,我正在尝试刮取,但是传统的url字符串构建技术不起作用,因为“路径中插入了完整的公司名称”字符串。而确切的“公司全名”事先也不知道。只有公司符号“IBM”是已知的 本质上,我刮取的方式是通过循环遍历公司符号数组,并在将其发送到urllib2.urlopen(url)之前构建url字符串。但在这种情况下,这是不可能做到的 例如,CSCO字符串是 http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-rat

我正在尝试刮取,但是传统的url字符串构建技术不起作用,因为“路径中插入了完整的公司名称”字符串。而确切的“公司全名”事先也不知道。只有公司符号“IBM”是已知的

本质上,我刮取的方式是通过循环遍历公司符号数组,并在将其发送到urllib2.urlopen(url)之前构建url字符串。但在这种情况下,这是不可能做到的

例如,CSCO字符串是

http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios
另一个示例url字符串是AAPL:

http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios
因此,为了获得url,我必须在主页的输入框中搜索符号:

http://www.dailyfinance.com/
我注意到,当我键入“CSCO”并在(Firefox web developer网络选项卡中)检查搜索输入时,我注意到get请求正在发送到

http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com
而referer实际上给出了我想要捕捉的路径

Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive
对不起,解释得太长了。所以问题是我如何提取引用中的url?如果不可能,我应该如何处理这个问题?还有其他方法吗


非常感谢您的帮助。

没有回答您的具体问题,但解决了您的问题

http://www.dailyfinance.com/quotes/{公司符号}/{证券交易所}

示例:

要进入财务比率页面,您可以使用以下内容:

import urllib2

def financial_ratio_url(symbol, stock_exchange):
    starturl  = 'http://www.dailyfinance.com/quotes/'
    starturl += '/'.join([symbol, stock_exchange])
    req = urllib2.Request(starturl)
    res = urllib2.urlopen(starturl)
    return '/'.join([res.geturl(),'financial-ratios'])
例如:

financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

我喜欢这个问题。正因为如此,我会给出一个非常彻底的答案。对于这个问题,我会使用我最喜欢的请求库以及BeautifulSoup4。如果你真的想使用,移植到Mechanize取决于你。请求会帮你省去很多麻烦


首先,您可能正在查找POST请求。但是,如果搜索功能将您立即带到您要查找的页面,则通常不需要POST请求。因此,让我们检查一下,好吗

当我登录到基本URL时,
http://www.dailyfinance.com/
,我可以通过Firebug或Chrome的inspect工具进行简单检查,当我在搜索栏上输入CSCO或AAPL并启用“跳转”时,会有一个永久移动的状态代码。这意味着什么

简单地说,我被调到了某个地方。此GET请求的URL如下所示:

http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO
现在,我们通过使用一个简单的URL操作来测试它是否与AAPL一起工作

import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url
上述结果如下:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
查看响应的URL是如何更改的?让我们通过在上面的代码后面添加以下内容来查找
/financial ratios
页面,进一步了解URL操作:

new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url
运行时,将给出以下结果:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
现在我们走上了正确的轨道。我现在将尝试使用BeautifulSoup解析数据。我的完整代码如下:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Company
Industry


Valuation Ratios


P/E Ratio (TTM)
15.40
14.80


P/E High - Last 5 Yrs 
24.00
28.90


P/E Low - Last 5 Yrs
8.40
12.10


Beta
1.37
1.50


Price to Sales (TTM)
2.51
2.59


Price to Book (MRQ)
2.14
2.17


Price to Tangible Book (MRQ)
4.25
3.83


Price to Cash Flow (TTM)
11.40
11.60


Price to Free Cash Flow (TTM)
28.20
60.20


Dividends


Dividend Yield (%)
3.30
2.50


Dividend Yield - 5 Yr Avg (%)
N.A.
1.20


Dividend 5 Yr Growth Rate (%)
N.A.
144.07


Payout Ratio (TTM)
45.00
32.00


Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70


Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60


Growth Rates (%)


Sales - 5 Yr Growth Rate (%)
5.51
5.12


EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90


EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90


EPS - 5 Yr Growth Rate (%)
8.91
9.04


Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94


Financial Strength


Quick Ratio (MRQ)
2.40
2.70


Current Ratio (MRQ)
2.60
2.90


LT Debt to Equity (MRQ)
0.22
0.20


Total Debt to Equity (MRQ)
0.31
0.25


Interest Coverage (TTM)
18.90
19.10


Profitability Ratios (%)


Gross Margin (TTM)
63.20
62.50


Gross Margin - 5 Yr Avg
66.30
64.00


EBITD Margin (TTM)
26.20
25.00


EBITD - 5 Yr Avg
28.82
0.00


Pre-Tax Margin (TTM)
21.10
20.00


Pre-Tax Margin - 5 Yr Avg
21.60
18.80


Management Effectiveness (%)


Net Profit Margin (TTM)
17.10
17.65


Net Profit Margin - 5 Yr Avg
17.90
15.40


Return on Assets (TTM)
8.30
8.90


Return on Assets - 5 Yr Avg
8.90
8.00


Return on Investment (TTM)
11.90
12.30


Return on Investment - 5 Yr Avg
12.50
10.90


Efficiency


Revenue/Employee (TTM)
637,890.00
556,027.00


Net Income/Employee (TTM)
108,902.00
98,118.00


Receivable Turnover (TTM)
5.70
5.80


Inventory Turnover (TTM)
11.30
9.70


Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]
然后,我尝试运行此代码,但遇到以下回溯错误:

  File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
    div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'
让我们试着在最后的刮刀中使用它

from bs4 import BeautifulSoup as bsoup
import requests as rq

csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick

r = rq.get(new_url)
soup = bsoup(r.content)

table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row.get_text()
我们对中信建投国际财务比率数据的原始结果如下:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Company
Industry


Valuation Ratios


P/E Ratio (TTM)
15.40
14.80


P/E High - Last 5 Yrs 
24.00
28.90


P/E Low - Last 5 Yrs
8.40
12.10


Beta
1.37
1.50


Price to Sales (TTM)
2.51
2.59


Price to Book (MRQ)
2.14
2.17


Price to Tangible Book (MRQ)
4.25
3.83


Price to Cash Flow (TTM)
11.40
11.60


Price to Free Cash Flow (TTM)
28.20
60.20


Dividends


Dividend Yield (%)
3.30
2.50


Dividend Yield - 5 Yr Avg (%)
N.A.
1.20


Dividend 5 Yr Growth Rate (%)
N.A.
144.07


Payout Ratio (TTM)
45.00
32.00


Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70


Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60


Growth Rates (%)


Sales - 5 Yr Growth Rate (%)
5.51
5.12


EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90


EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90


EPS - 5 Yr Growth Rate (%)
8.91
9.04


Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94


Financial Strength


Quick Ratio (MRQ)
2.40
2.70


Current Ratio (MRQ)
2.60
2.90


LT Debt to Equity (MRQ)
0.22
0.20


Total Debt to Equity (MRQ)
0.31
0.25


Interest Coverage (TTM)
18.90
19.10


Profitability Ratios (%)


Gross Margin (TTM)
63.20
62.50


Gross Margin - 5 Yr Avg
66.30
64.00


EBITD Margin (TTM)
26.20
25.00


EBITD - 5 Yr Avg
28.82
0.00


Pre-Tax Margin (TTM)
21.10
20.00


Pre-Tax Margin - 5 Yr Avg
21.60
18.80


Management Effectiveness (%)


Net Profit Margin (TTM)
17.10
17.65


Net Profit Margin - 5 Yr Avg
17.90
15.40


Return on Assets (TTM)
8.30
8.90


Return on Assets - 5 Yr Avg
8.90
8.00


Return on Investment (TTM)
11.90
12.30


Return on Investment - 5 Yr Avg
12.50
10.90


Efficiency


Revenue/Employee (TTM)
637,890.00
556,027.00


Net Income/Employee (TTM)
108,902.00
98,118.00


Receivable Turnover (TTM)
5.70
5.80


Inventory Turnover (TTM)
11.30
9.70


Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]
清理数据取决于您


从这次事件中学到的一个很好的教训是,并非所有数据都单独包含在一个页面中。很高兴看到它来自另一个静态站点。如果它是通过JavaScript或AJAX调用或类似方式生成的,我们的方法可能会有一些困难


希望你能从中学到一些东西。让我们知道这是否有帮助,祝你好运。

+1:对我个人来说,这是一个很好的问题。这方面有什么更新吗?你看到我的回答中如何正确处理推荐人了吗?哇,你是一个网络抓取狂(从一个很好的意义上说)!:)今天没有更多的追加投票了-将在明天到达+1。@alecxe:伙计,你不知道我从你身上学到了多少。来自你,这是一个巨大的恭维谢谢非常感谢。我们在这里互相学习。如此详细的答案真的让网络抓取世界变得更好。希望在相关标签中听到您的更多信息。+1-高于OP要求!我从中学到了很多。@cdhagmann:如果我没有看到底部的滚动条,我真的有点想阻止它,让OP来解决它。我真的不喜欢动态生成的任何东西,因为它们会带来很多问题,但我很高兴我再次刷新它,以捕捉第三方呼叫。谢谢谢谢你的回答。我发现它很有用,也很有教育意义。