抓取url不变的网站[python with beautiful soup]
我是一个全新的网页抓取。 我怎样才能抓取一个url不随页码变化的网站? 假设以这个网站为例- url不随页码变化, 这和我要问的一样,我们如何使用python中的beautiful soup来实现它呢抓取url不变的网站[python with beautiful soup],python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是一个全新的网页抓取。 我怎样才能抓取一个url不随页码变化的网站? 假设以这个网站为例- url不随页码变化, 这和我要问的一样,我们如何使用python中的beautiful soup来实现它呢 import requests from bs4 import BeautifulSoup url = 'https://www.bseindia.com/corporates/Forth_Results.aspx' headers = {'User-Agent': 'Mozilla/5.0
import requests
from bs4 import BeautifulSoup
url = 'https://www.bseindia.com/corporates/Forth_Results.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
page = 1
while True:
print(page)
rows = soup.select('.TTRow')
if not rows:
break
# print some data to screen:
for tr in rows:
print(tr.get_text(strip=True, separator=' '))
# to get correct page, you have to do POST request with correct data
# the data is located in <input name="..." value=".."> tags
d = {}
for i in soup.select('input'):
d[i['name']] = i.get('value', '')
# some data parameters needs to be deleted:
if 'ctl00$ContentPlaceHolder1$btnSubmit' in d:
del d['ctl00$ContentPlaceHolder1$btnSubmit']
# set correct page:
page += 1
d['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gvData'
d['__EVENTARGUMENT'] = 'Page${}'.format(page)
soup = BeautifulSoup(requests.post(url, headers=headers, data=d).content, 'html.parser')
编辑:要将其保存为CSV,可以使用以下命令:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.bseindia.com/corporates/Forth_Results.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
page = 1
all_data = []
while True:
print(page)
rows = soup.select('.TTRow')
if not rows:
break
# print some data to screen:
for tr in rows:
row = tr.get_text(strip=True, separator='|').split('|')
all_data.append(row)
# to get correct page, you have to do POST request with correct data
# the data is located in <input name="..." value=".."> tags
d = {}
for i in soup.select('input'):
d[i['name']] = i.get('value', '')
# some data parameters needs to be deleted:
if 'ctl00$ContentPlaceHolder1$btnSubmit' in d:
del d['ctl00$ContentPlaceHolder1$btnSubmit']
# set correct page:
page += 1
d['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gvData'
d['__EVENTARGUMENT'] = 'Page${}'.format(page)
soup = BeautifulSoup(requests.post(url, headers=headers, data=d).content, 'html.parser')
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
从LibreOffice生成data.csv屏幕截图:
此网站由javascript控制。我认为BeautifulSoup无法处理此类网站。人们经常使用Selenium来刮它们。我想你可以找到你的答案:谢谢你,先生,我能把它保存在excel或csv文件中吗??“有三个不同的id、名称和日期栏,请您也添加注释,这样我就可以相应地修改代码了。”Prabhatkumar更新了我的答案。
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.bseindia.com/corporates/Forth_Results.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
page = 1
all_data = []
while True:
print(page)
rows = soup.select('.TTRow')
if not rows:
break
# print some data to screen:
for tr in rows:
row = tr.get_text(strip=True, separator='|').split('|')
all_data.append(row)
# to get correct page, you have to do POST request with correct data
# the data is located in <input name="..." value=".."> tags
d = {}
for i in soup.select('input'):
d[i['name']] = i.get('value', '')
# some data parameters needs to be deleted:
if 'ctl00$ContentPlaceHolder1$btnSubmit' in d:
del d['ctl00$ContentPlaceHolder1$btnSubmit']
# set correct page:
page += 1
d['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gvData'
d['__EVENTARGUMENT'] = 'Page${}'.format(page)
soup = BeautifulSoup(requests.post(url, headers=headers, data=d).content, 'html.parser')
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')