Python Can';t使用请求从网站上刮取不同公司的链接
我试图从一个网页中获取不同公司的链接,但是我尝试使用的脚本抛出了下面的错误。在chrome开发工具中,我可以看到我可以使用Python Can';t使用请求从网站上刮取不同公司的链接,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,我试图从一个网页中获取不同公司的链接,但是我尝试使用的脚本抛出了下面的错误。在chrome开发工具中,我可以看到我可以使用posthttp请求获取不同公司的ID。但是,如果我可以获得ID,那么我将能够使用此链接https://angel.co/startups/{}添加字符串格式的id,以创建完整的公司链接 我试过: import requests link = 'https://angel.co/company_filters/search_data' base = 'https://an
post
http请求获取不同公司的ID
。但是,如果我可以获得ID
,那么我将能够使用此链接https://angel.co/startups/{}
添加字符串格式的id,以创建完整的公司链接
我试过:
import requests
link = 'https://angel.co/company_filters/search_data'
base = 'https://angel.co/startups/{}'
payload={'sort':'signal','page':'2'}
r = requests.post(link,data=payload,headers={
'x-requested-with':'XMLHttpRequest'
'User-Agent":"Mozilla/5.0'
})
print(r.json())
上述脚本引发以下错误:
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
如何使用请求从上述网站获取不同公司的链接?您可以使用
selenium
:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://angel.co/companies')
links = [i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'photo'})]
输出:
['https://angel.co/company/orchestra-one', 'https://angel.co/company/workramp', 'https://angel.co/company/alien-labs', 'https://angel.co/company/teamdom', 'https://angel.co/company/focal-systems', 'https://angel.co/company/ripple-co', 'https://angel.co/company/solugen', 'https://angel.co/company/govpredict', 'https://angel.co/company/ring-6', 'https://angel.co/company/radiopublic', 'https://angel.co/company/function-of-beauty', 'https://angel.co/company/kid-koderz-city', 'https://angel.co/company/united-income', 'https://angel.co/company/volara', 'https://angel.co/company/optimus-ride', 'https://angel.co/company/amplitude-analytics', 'https://angel.co/company/nanonets', 'https://angel.co/company/magnar', 'https://angel.co/company/kylieai', 'https://angel.co/company/clipboardhealth']
我制作了函数
get_soup(page)
,它从1
接受page
参数并返回包含相关数据的soup。您可以将此函数放入循环中以刮取更多页面:
import requests
from bs4 import BeautifulSoup
def get_soup(page=1):
headers = {
'Accept-Language' : 'en-US,en;q=0.5',
'Host' : 'angel.co',
'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
payload={'sort':'signal','page':str(page)}
url = 'https://angel.co/company_filters/search_data'
data = requests.get(url, headers=headers, data=payload).json()
new_url = 'https://angel.co/companies/startups?' + '&'.join('ids[]={}'.format(_id) for _id in data['ids'])
new_url += '&sort=' + data['sort']
new_url += '&total=' + str(data['total'])
new_url += '&page=' + str(data['page'])
new_url += '&new=' + str(data['new']).lower()
new_url += '&hexdigest=' + data['hexdigest']
data = requests.get(new_url, headers=headers).json()
return BeautifulSoup(data['html'], 'lxml')
soup = get_soup(1)
rows = []
for company, joined, location, market, website, company_size, stage, raised in zip(soup.select('.column.company'),
soup.select('.column.joined .value'),
soup.select('.column.location .value'),
soup.select('.column.market .value'),
soup.select('.column.website .value'),
soup.select('.column.company_size .value'),
soup.select('.column.stage .value'),
soup.select('.column.raised .value')):
company = company.get_text(strip=True, separator=" ")
joined = joined.get_text(strip=True)
location = location.get_text(strip=True)
market = market.get_text(strip=True)
website = website.get_text(strip=True)
company_size = company_size.get_text(strip=True)
stage = stage.get_text(strip=True)
raised = raised.get_text(strip=True)
rows.append([company, joined, location, market, website, company_size, stage, raised])
from textwrap import shorten
print(''.join('{: <25}'.format(shorten(d, 25)) for d in ['Company', 'Joined', 'Location', 'Market', 'Website', 'Company Size', 'Stage', 'Raised']))
print('-' * (25*8))
for row in rows:
print(''.join('{: <25}'.format(shorten(d, 25)) for d in row))
编辑:要仅获取链接,您可以执行以下操作:
soup = get_soup(1)
for a in soup.select('.website a[href]'):
print(a['href'])
印刷品:
Company Joined Location Market Website Company Size Stage Raised
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nutanix Your [...] May ’14 San Jose Virtualization nutanix.com 1001-5000 IPO $312,200,000
EverFi Oct ’12 Washington DC Education everfi.com 51-200 Series C $61,000,000
Butter Make friends [...]Jun ’14 San Francisco Messaging getbutter.me 1-10 Seed $371,500
Fluent The future [...] Mar ’12 Sydney Curated Web fluent.io - - -
Belly Sep ’12 Chicago Small and Medium [...] bellycard.com Series B $24,975,000
Autotech Ventures [...] Apr ’14 Menlo Park Internet of Things autotechvc.com 1-10 - -
Oscar Health [...] Jun ’14 Tempe Technology hioscar.com 1001-5000 $1,267,500,000
Tovala Smart oven [...] Feb ’16 Chicago Home Automation tovala.com 11-50 Series A $10,800,000
GiftRocket Online [...] Mar ’16 San Francisco Gift Card giftrocket.com 1-10 Seed $520,000
Elemeno Health B2B [...] Apr ’16 Oakland Training elemenohealth.com 1-10 Seed $1,635,000
Sudo Technologies [...] Apr ’16 Menlo Park - sudo.ai - -
Stypi Sep ’16 - - Acquired -
Amazon Alexa Amazon [...]Sep ’16 Cambridge Speech Recognition developer.amazon.com 11-50 - -
Altos Ventures A [...] Oct ’16 Menlo Park Technology altos.vc 1-10 - -
Flirtey Making [...] Oct ’16 Reno - flirtey.com 11-50 Series A $16,000,000
SV Liquidity Fund [...] Oct ’16 San Francisco B2B svlq.io 1-10 - -
Princeton Ventures [...] Jan ’17 Princeton Technology princetonventures.com 1-10 - -
hulu - Beijing [...] Jan ’17 Beijing TV Production hulu.com - - -
Distributed Systems [...]Jan ’17 San Francisco Identity pavlov.ai 1-10 - -
Fetch Marketplace [...] May ’17 Atlanta Technology fetchtruck.com 1-10 Seed -
http://www.fuelpowered.com
http://www.slide.com
http://www.mparticle.com
http://www.matter.io
http://www.smartling.com
https://stensul.com
https://avametric.com/
https://ledgerinvesting.com
http://www.relativityspace.com
http://teamdom.co
http://www.wonderschool.com
http://www.upcall.com
http://focal.systems
https://asktetra.com
https://www.subdreamstudios.com/
http://www.stedi.com
http://www.magnarapp.com/
http://www.kylie.ai
http://clipboardhealth.com
数据正在异步加载。您应该使用selenium驱动程序,而不是仅通过
请求
就可以刮取数据-请参阅我的回答Hi Andrej,我刚才尝试了您的脚本,但遇到了与前面相同的错误,即引发JSONDecodeError(“期望值”,s,err.value)from None:json.decoder.jsondeCoderror:应为指向此行的data=requests.get(url,headers=headers,data=payload).json()
。我照原样运行了它。顺便问一下,您是否有意在get请求中使用data=payload
?Thans。@运行脚本的MITHU在这里为我工作……该站点正在使用Cloudflare CDN,因此您可能在IP级别被阻止。你能从不同的IP尝试这个脚本吗?是的,即使在GET请求中,您也可以使用data=
。我一定会尝试一次,以便有机会激活我的vpn。我最初的尝试不是也在正确的轨道上吗?谢谢。因为你的回答总是引导我走向正确的方向,所以我提前按下了那个复选标记。如果我遇到任何问题,我会让你知道。谢谢。嗨,Andrej,我尝试了你的脚本激活vpn,使用了不同的工作ip,但仍然遇到了我在第一条评论中提到的相同错误。谢谢