获取内部HTML-Selenium、BeautifulSoup、Python_Python_Html_Selenium_Beautifulsoup_Html Parsing

获取内部HTML-Selenium、BeautifulSoup、Python

python html selenium

获取内部HTML-Selenium、BeautifulSoup、Python,python,html,selenium,beautifulsoup,html-parsing,Python,Html,Selenium,Beautifulsoup,Html Parsing,这是一个完整的问题编辑，因为我必须问我的问题糟糕的基础上的答案-所以我会尽量更清楚我有一件东西要刮。在我的笔记本电脑上使用的代码中，我没有问题让它工作。当我转到pythonany，在那里我再也不能得到我想要的信息了在我的系统上工作的代码是： from urllib.request import urlopen from selenium import webdriver from selenium.webdriver.common.by import By from selenium.web

这是一个完整的问题编辑，因为我必须问我的问题糟糕的基础上的答案-所以我会尽量更清楚

我有一件东西要刮。在我的笔记本电脑上使用的代码中，我没有问题让它工作。当我转到pythonany，在那里我再也不能得到我想要的信息了

在我的系统上工作的代码是：

from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import csv
import time
import re

#68 lines of code for another section of the site above this working well on my system and on pythonanywhere.

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

try:
    parcel_number = bsObj.find(id="mParcelnumbersitusaddress_mParcelNumber")
    s_parcel_number =parcel_number.get_text()                         
except AttributeError as e:
    s_parcel_number = "Parcel Number not found"

# same kind of code (all working) that gets 10 more pieces of data

# Tax Year
try:
    pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
    taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]
except IndexError as e:
    s_taxes_owed_2015_yr = "No taxes due"

这段代码在我的笔记本电脑fireforx-on Pythony上运行得很好，如果我打印要刮取的页面的页面源代码，我的表应该在以下位置：

<table border="0" cellpadding="5" cellspacing="0" class="WithBorder" width="100%">
<tbody><tr>
<td id="TaxesBalancePaymentCalculator"><!--DONT_PRINT_START-->
<span class="InputFieldTitle" id="mTabGroup_Taxes_mTaxChargesBalancePaymentInjected_mReportProcessingNote">Please wait while your current taxes are calculated.</span><img src="images/progress.gif"/> <!--DONT_PRINT_FINISH--></td>
</tr> <!--DONT_PRINT_START-->
<script type="text/javascript">
                                function TaxesBalancePaymentCalculator_ScriptLoaded( pPageContent )
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;
                                }
                                function results_ready()
                                {
                                    element('pay_button_area').style.display = 'block';
                                    element('pay_button_area2').style.display = 'block';
                                    element('pay_additional_things_area').style.display = 'block';
                                }
                                var no_taxes_calculator = '&amp;nbsp;&lt;' + 'span class="MessageTitle"&gt;The tax balance calculator is not availab
le.&lt;' + '/span&gt;';
                                function no_taxes_calculator_available()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                function invalid()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                loadScript( 'injected/TaxesBalancePaymentCalculator.aspx?parcel_number=15-720-01-01-00-0-00-000' );
                                </script><script id="injected_taxesbalancepaymentcalculator_ScriptTag" type="text/javascript"></script>
<tr id="pay_button_area" style="DISPLAY: none">
<td id="pay_button_area2">
<table border="0" cellpadding="2" cellspacing="0">
<tbody><tr>

该部分保存了我的数据-问题是我无法在字符串上预执行findAll，我需要表中的某些行：

taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

我需要有关如何将该元素作为对象而不是字符串获取的帮助，以便在数据中使用它。我尝试了很多东西，所以我不能把它们都列在这里。我真的需要一些帮助

提前感谢。

正如@Steve在评论中指出的，get_属性返回字符串，而不是HTML元素。尝试用一些get_元素替换此行。你可以阅读更多的文档

除此之外，您使用beautifulsoup的方式是错误的。您需要通过将html作为参数传递来创建bs4对象，然后在对象中使用findAll：

soup = BeautifulSoup(html_as_plain_text)
for element in soup.findAll(id="mGrid_RealDataGrid"):
    #do your thing

从我在代码中看到的情况来看，您希望获取元素的innerHTML并将其提供给BeautifulSoup进行进一步解析。首先，您可能需要outerHTML在生成的HTML中获取元素本身，而且最重要的是，您需要初始化soup对象：

from bs4 import BeautifulSoup

demo_div = driver.find_element_by_id('TaxesBalancePaymentCalculator')
demo_html = demo_div.get_attribute('outerHTML')

soup = BeautifulSoup(demo_html, "html.parser")  # < YOU ARE MISSING THIS PART
s_taxes_owed_2015_yr = soup.find_all(id="mGrid_RealDataGrid")[1].find_all('tr')[1].find_all('td')[0].get_text()
print(s_taxes_owed_2015_yr)

我认为这可能是页面加载速度的差异。在代码的开头，您有

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

因此，您正在基于该页面的内容创建BeautifulSoup对象。稍后，您将执行以下操作：

pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

因此，您告诉WebDriver等待某个对象出现，然后对先前创建的BeautifulSoup对象进行查询。但是BeautifulSoup对象仍然具有脚本开始时的页面源，而不是包含您等待的对象的新页面源

在完成等待之后，尝试基于新的页面源代码重新创建bsObj。

我不记得Python中有任何findAll方法。这是bs4方法。。。是否在代码中导入bs4？你想用bsObj做什么？是的，这是一个bs4方法，我已经导入了bs4——高出几百行。我试图从内部HTML中的表中获取信息-根据文档，driver.get_属性返回字符串，因此出现错误。@Raymond，恐怕bs4模块的工作方式有点不同。。。您应该多读一些看起来不错的内容，但我仍然收到一个元素超出限制的错误，因为该表从未在PythonyWhere firefox浏览器中加载。@Raymond，这是另一个问题。让我们避免在一个主题中解决多个问题。如果需要帮助，请考虑创建一个单独的问题。谢谢

pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]