Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Webscraping文本返回一个空集 当使用Beautiful Soup FindAll时,代码不会刮除文本,因为它返回一个空集。在此之后,代码还有其他问题,但在这个阶段,我试图解决第一个问题。我对这个很陌生,所以我知道代码结构可能不太理想。我来自VBA背景。_Python_Selenium_Beautifulsoup - Fatal编程技术网

Python Webscraping文本返回一个空集 当使用Beautiful Soup FindAll时,代码不会刮除文本,因为它返回一个空集。在此之后,代码还有其他问题,但在这个阶段,我试图解决第一个问题。我对这个很陌生,所以我知道代码结构可能不太理想。我来自VBA背景。

Python Webscraping文本返回一个空集 当使用Beautiful Soup FindAll时,代码不会刮除文本,因为它返回一个空集。在此之后,代码还有其他问题,但在这个阶段,我试图解决第一个问题。我对这个很陌生,所以我知道代码结构可能不太理想。我来自VBA背景。,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,如上所述,您实际上并没有将html源代码输入到BeautifulSoup中。所以第一件事是:soup=BeautifulSoup(driver.current\u url,features='lxml')改为soup=BeautifulSoup(driver.page\u source,features='lxml') 第二个问题:有些元素没有class=detail的标记。所以你赢了;无法从非类型获取href。我添加了一个try/except,以便在出现这种情况时跳过(但不确定这是否会给出您想

如上所述,您实际上并没有将html源代码输入到BeautifulSoup中。所以第一件事是:
soup=BeautifulSoup(driver.current\u url,features='lxml')
改为
soup=BeautifulSoup(driver.page\u source,features='lxml')


第二个问题:有些元素没有class=detail的标记
。所以你赢了;无法从非类型获取href。我添加了一个try/except,以便在出现这种情况时跳过(但不确定这是否会给出您想要的结果)。您也可以去掉该类,只需说
Details\u Page=each\u Contract.find('a')。get('href')

接下来,这只是url的扩展,您需要附加根,因此:
driver.get('https://www.tenders.gov.au“+详细信息(第页)

我也看不出你指的是class=联系人标题

您还可以在一个点上引用class='class':'list desc inner',然后在另一个点上引用'class':'list_desc_inner'。同样,我没有看到class=list\u desc\u internal

下一个。要将列表附加到列表中,您需要
已授予。附加(组合)
,而不是
已授予。附加[组合]

我还在其中添加了
.strip()
,以清除文本中的一些空白

不管怎样,有很多东西需要修复和清理,我也不知道您的预期输出应该是什么。但希望这能让你开始

此外,正如评论中所说的,你可以点击下载按钮,直接得到结果,但也许你的做法很难练习

import requests
from requests import get
from selenium import webdriver
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
#import chromedriver_binary  # Adds chromedriver binary to path

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")

#click the search button on Austenders to return all Awarded Contracts
import time
#define the starting point: Austenders Awarded Contracts search page
driver.get('https://www.tenders.gov.au/cn/search')
#Find the Search Button and return all search results
Search_Results = driver.find_element_by_name("SearchButton")
if 'inactive' in Search_Results.get_attribute('name'):
    print("Search Button not found")
    exit;
print('Search Button found')
Search_Results.click()    

#Pause code to prevent blocking by website
time.sleep(1)
i = 0
Awarded = []

#Move to the next search page by finding the Next button at the bottom of the page
#This code will need to be refined as the last search will be skipped currently.
while True:
    Next_Page = driver.find_element_by_class_name('next')
    if 'inactive' in Next_Page.get_attribute('class'):
        print("End of Search Results")
        exit;  
    i = i + 1
    time.sleep(2)

    #Loop through all the Detail links on the current Search Results Page
    print("Checking search results page " + str(i))
    print(driver.current_url)
    soup = BeautifulSoup(driver.page_source, features='lxml')
    #Find all Contract detail links in the current search results page
    Details = soup.findAll('div', {'class': 'list-desc-inner'})

    for each_Contract in Details:
        #Loop through each Contract details link and scrape all the detailed 
        #Contract information page
        try:
            Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')       
            driver.get('https://www.tenders.gov.au' + Details_Page)
            #Scrape all the data in the Awarded Contract page
            #r = requests.get(driver.current_url)
            soup = BeautifulSoup(driver.page_source, features='lxml')

            #find a list of all the Contract Info (contained in the the 'Contact Heading'
            #class of the span element)
            Contract = soup.find_all('span', {'class': 'Contact-Heading'})
            Contract_Info = [span.text.strip() for span in Contract]

            #find a list of all the Summary Contract info which is in the text of\
            #the 'list_desc_inner' class
            Sub = soup.find_all('div', {'class': 'list-desc-inner'})
            Sub_Info = [div.text.strip() for div in Sub]

            #Combine the lists into a unified list and append to the Awarded table
            Combined = [Contract_Info, Sub_Info]
            Awarded.append(Combined)

            #Go back to the Search Results page (from the Detailed Contract page)
            driver.back()
        except:
            continue

    #Go to the next Search Page by clicking on the Next button at the bottom of the page
    Next_Page.click()
    #

    time.sleep(3)    
driver.close()
print(Awarded.Shape)

哪一行代码没有返回您所期望的内容?Details\u Page=each\u Contract.find('a',{'class':'detail'})。get('href')。我试图返回页面上所有详细链接的集合。代码和逻辑中存在多个问题。主要问题是您没有将页面源代码传递给beautiful soup。driver.current\u url仅返回url文本。您应该使用-soup=beautifulsou(driver.page_source,features='lxml'),也许值得删除粗体文本并简单地询问“为什么我的findall返回一个空集?”的一些变体。它们以
.xlsx
格式提供所有结果,为什么不使用它呢?
import requests
from requests import get
from selenium import webdriver
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
#import chromedriver_binary  # Adds chromedriver binary to path

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")

#click the search button on Austenders to return all Awarded Contracts
import time
#define the starting point: Austenders Awarded Contracts search page
driver.get('https://www.tenders.gov.au/cn/search')
#Find the Search Button and return all search results
Search_Results = driver.find_element_by_name("SearchButton")
if 'inactive' in Search_Results.get_attribute('name'):
    print("Search Button not found")
    exit;
print('Search Button found')
Search_Results.click()    

#Pause code to prevent blocking by website
time.sleep(1)
i = 0
Awarded = []

#Move to the next search page by finding the Next button at the bottom of the page
#This code will need to be refined as the last search will be skipped currently.
while True:
    Next_Page = driver.find_element_by_class_name('next')
    if 'inactive' in Next_Page.get_attribute('class'):
        print("End of Search Results")
        exit;  
    i = i + 1
    time.sleep(2)

    #Loop through all the Detail links on the current Search Results Page
    print("Checking search results page " + str(i))
    print(driver.current_url)
    soup = BeautifulSoup(driver.page_source, features='lxml')
    #Find all Contract detail links in the current search results page
    Details = soup.findAll('div', {'class': 'list-desc-inner'})

    for each_Contract in Details:
        #Loop through each Contract details link and scrape all the detailed 
        #Contract information page
        try:
            Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')       
            driver.get('https://www.tenders.gov.au' + Details_Page)
            #Scrape all the data in the Awarded Contract page
            #r = requests.get(driver.current_url)
            soup = BeautifulSoup(driver.page_source, features='lxml')

            #find a list of all the Contract Info (contained in the the 'Contact Heading'
            #class of the span element)
            Contract = soup.find_all('span', {'class': 'Contact-Heading'})
            Contract_Info = [span.text.strip() for span in Contract]

            #find a list of all the Summary Contract info which is in the text of\
            #the 'list_desc_inner' class
            Sub = soup.find_all('div', {'class': 'list-desc-inner'})
            Sub_Info = [div.text.strip() for div in Sub]

            #Combine the lists into a unified list and append to the Awarded table
            Combined = [Contract_Info, Sub_Info]
            Awarded.append(Combined)

            #Go back to the Search Results page (from the Detailed Contract page)
            driver.back()
        except:
            continue

    #Go to the next Search Page by clicking on the Next button at the bottom of the page
    Next_Page.click()
    #

    time.sleep(3)    
driver.close()
print(Awarded.Shape)