Python 我正在用selenium和beautifulsoup清理一个网站。需要的网页总数在网站或其他方式浏览网页_Python_Selenium Webdriver_Beautifulsoup_Webdriver_Webdriverwait

Python 我正在用selenium和beautifulsoup清理一个网站。需要的网页总数在网站或其他方式浏览网页

python selenium-webdriver

Python 我正在用selenium和beautifulsoup清理一个网站。需要的网页总数在网站或其他方式浏览网页,python,selenium-webdriver,beautifulsoup,webdriver,webdriverwait,Python,Selenium Webdriver,Beautifulsoup,Webdriver,Webdriverwait,我正在使用SeleniumWebDriver和BeautySoup来刮取一个具有可变多个页面数的网站。我是通过xpath粗略地完成的。一个页面显示五个页面，在计数为五之后，我按下下一步按钮并重置xpath计数以获得下一个五个页面。为此，我需要在网站上通过代码或更好的方式导航到不同的页面总页面我认为页面使用AngularJava脚本进行导航。代码如下： import requests from bs4 import BeautifulSoup from selenium import webdr

我正在使用SeleniumWebDriver和BeautySoup来刮取一个具有可变多个页面数的网站。我是通过

xpath

粗略地完成的。一个页面显示五个页面，在计数为五之后，我按下下一步按钮并重置

xpath

计数以获得下一个五个页面。为此，我需要在网站上通过代码或更好的方式导航到不同的页面总页面

我认为页面使用AngularJava脚本进行导航。代码如下：

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
spg_index=' '
url = "https://www.bseindia.com/corporates/ann.html"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
with open('bseann.txt', 'w', encoding='utf-8') as f:
    f.write(html)
time.sleep(1)
i=1  #index for page numbers navigated. ket at maximum 31 at present
k=1  #goes upto 5, the maximum navigating pages shown at one time
while i <31:
    next_pg=9   #xpath number to pinpoint to "next" page 
    snext_pg=str(next_pg)
    snext_pg=snext_pg.strip()
    if i> 5:
        next_pg=10  #when we go to next set of pages thr is a addl option
        if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's
        k=2
        path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
        path=path+snext_pg+']/a'
        next_page_btn_list=driver.find_elements_by_xpath(path)
        next_page_btn=next_page_btn_list[0]
        next_page_btn.click()  #click next page
        time.sleep(1)
    pg_index= k+2
    spg_index=str(pg_index)
    spg_index=spg_index.strip()     
    path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
    path=path+spg_index+']/a'
    next_page_btn_list=driver.find_elements_by_xpath(path)
    next_page_btn=next_page_btn_list[0]
    next_page_btn.click()  #click specific pg no. 
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    html=soup.prettify()
    i=i+1
    k=k+1
    with open('bseann.txt', 'a', encoding='utf-8') as f:
        f.write(html)

导入请求
从bs4导入BeautifulSoup
从selenium导入webdriver
driver=webdriver.Chrome（）
驱动程序。最大化_窗口（）
spg_指数=“”
url=”https://www.bseindia.com/corporates/ann.html"
获取驱动程序（url）
soup=BeautifulSoup（driver.page_源代码'html.parser'）
html=soup.prettify（）
将open（'bseann.txt'，'w'，encoding='utf-8'）作为f：
f、 编写（html）
时间。睡眠（1）
i=1#导航页码的索引。目前最高达31
k=1#最多为5，一次显示的最大导航页面数
而我5：
next_pg=10#当我们转到下一组页面时，thr是一个addl选项
如果（i==6）或（i==11）或（i==16）：#为pg的集合重置xpath indx
k=2
path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+snext_pg+']/a'
下一页\u btn\u list=驱动程序。通过xpath（路径）查找元素
下一页\u btn=下一页\u btn\u列表[0]
下一页点击（）#点击下一页
时间。睡眠（1）
pg_指数=k+2
spg_指数=str（pg_指数）
spg_index=spg_index.strip（）
path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+spg_索引+']/a'
下一页\u btn\u list=驱动程序。通过xpath（路径）查找元素
下一页\u btn=下一页\u btn\u列表[0]
下一页点击（）。
时间。睡眠（1）
soup=BeautifulSoup（driver.page_源代码'html.parser'）
html=soup.prettify（）
i=i+1
k=k+1
将open（'bseann.txt'，'a'，encoding='utf-8'）作为f：
f、 编写（html）

此处无需使用Selenium，因为您可以从API访问信息。本次发布247条公告：

import requests
from pandas.io.json import json_normalize

url = 'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

payload = {
'strCat': '-1',
'strPrevDate': '20190423',
'strScrip': '',
'strSearch': 'P',
'strToDate': '20190423',
'strType': 'C'}

jsonData = requests.get(url, headers=headers, params=payload).json()

df = json_normalize(jsonData['Table'])
df['ATTACHMENTNAME'] = '=HYPERLINK("https://www.bseindia.com/xml-data/corpfiling/AttachLive/' + df['ATTACHMENTNAME'] + '")'


df.to_csv('C:/filename.csv', index=False)

输出：

...

GYSCOAL ALLOYS LTD. - 533275 - Announcement under Regulation 30 (LODR)-Code of Conduct under SEBI (PIT) Regulations, 2015
https://www.bseindia.com/xml-data/corpfiling/AttachLive/82f18673-de98-4a88-bbea-7d8499f25009.pdf

INDIAN SUCROSE LTD. - 500319 - Certificate Under Regulation 40(9) Of Listing Regulation For The Half Year Ended 31.03.2019
https://www.bseindia.com/xml-data/corpfiling/AttachLive/2539d209-50f6-4e56-a123-8562067d896e.pdf

Dhanvarsha Finvest Ltd - 540268 - Reply To Clarification Sought From The Company
https://www.bseindia.com/xml-data/corpfiling/AttachLive/f8d80466-af58-4336-b251-a9232db597cf.pdf

Prabhat Telecoms (India) Ltd - 540027 - Signing Of Framework Supply Agreement With METRO Cash & Carry India Private Limited
https://www.bseindia.com/xml-data/corpfiling/AttachLive/acfb1f72-efd3-4515-a583-2616d2942e78.pdf

...

关于您的用例的更多信息将有助于回答您的问题。但是，要提取有关您可以访问站点的页面总数的信息，请单击文本为“下一步”的项目并提取所需数据，您可以使用以下解决方案：

代码块：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.bseindia.com/corporates/ann.html")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-last ng-scope']/a[@class='ng-binding' and text()='Last']"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-page ng-scope active']/a[@class='ng-binding']"))).get_attribute("innerHTML"))

控制台输出：
```
17
```

您想收集什么样的样本？您可以通过API访问所有文章数据，我可以提供API作为解决方案，但只需要知道您的输出是什么，这是正确的方法。听说过疯牛病数据的API，但在如何使用它方面缺乏细节和熟练程度。谢谢。我不想结束这个问题，因为我可能在这方面还有一些更相关的问题。我想将这些公告写入一个文件（可能是csv），在那里我可以直接单击pdf链接查看我想要查看的公告。确定。我明天早上再加上。所以，写csv。您希望列中包含什么类型的数据？pdf链接、标题、日期等？根据您的代码，我可以写入csv文件。我想写数据，如日期，时间，公告，公告的类型，以便它可以被隔离，pdf链接。打算在我将数据写入文件之前运行过滤器，该文件可以来自文本文件。目的是筛选包含“证书丢失”、“持股”等常见词语的公告，以便扫描较少的公司。我正在寻找一种方法1）获取丢失的数据，如日期、时间、类型2）删除文本中的逗号、特殊字符，因为这会导致csv文件中的错误分离。2） pdf链接在中不显示为可单击链接pdf@Ravi，好的，我明天早上再看一遍。“pdf链接在pdf中不显示为可单击链接”是什么意思？当我在excel中打开csv文件时，pdf链接显示在单独的单元格/列中，但不能直接单击（不显示为超链接），并且需要一个或两个额外的返回（输入）使其显示为超链接，然后可通过单击在浏览器窗口中打开。如果我在excel中键入一个新链接，它将显示为超链接，而不是这些导入的超链接。可能是excel相关问题。我还想在这里发布我修改过的代码。不知道在哪里可以这样做。谢谢。谢谢。我会试试的back@Ravi如果您需要进一步帮助，请告诉我。请键入此代码。名称错误：名称“WebDriverWait”未定义。您从selenium.webdriver.support.ui导入WebDriverWait时添加了该名称。现在名称错误：未定义名称“EC”