Python 循环通过元素组_Python_Selenium_Beautifulsoup

Python 循环通过元素组

python selenium

Python 循环通过元素组,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,我不确定是什么问题。但我有一个使用Selenium和Beautifulsoup4的小脚本，可以使用特定输入访问和解析特定网站的内容。对于每个搜索词，我想将元素附加到列表中。这是html： <table class="aClass"> <tr class="1"> <td> <a href="aLink"> <span class="aClass">

我不确定是什么问题。但我有一个使用Selenium和Beautifulsoup4的小脚本，可以使用特定输入访问和解析特定网站的内容。对于每个搜索词，我想将元素附加到列表中。这是html：

<table class="aClass">
       <tr class="1">
        <td>
         <a href="aLink">
          <span class="aClass">
           Text
          </span>
         </a>
        </td>
        <td>
        </td>
        <td>
        </td>
        <td>
        </td>
       </tr>
       <tr class="2">
        <td>
        </td>
        <td anAttribute="aValue">
         Text
        </td>
        <td>
        </td>
       </tr>
</table>

所以我还是觉得我把一切都搞错了。但我不知道我到底应该做什么。我试过这个，但继续只得到前25支安打。这仅适用于如上所示的“标识符”

    for tr in soup.find_all('tr'):
        for td in tr.find_all('td'):
            for span in td.find_all('span', {"class": "aClass"}):
                if span.parent.name == 'a':
                    print span.text

好吧，我的错。这是一个解析器问题，我在尝试不同的解析器时感到不耐烦。亚历克斯已经提出了这一点。问题已修复。

这是一个完整的代码，有一些改进（在

数据列表中获得所需的319行）：
你能分享你的完整代码吗？谢谢。请看上面的完整代码。在submit（）
之后再加上time.sleep（5000）
会有什么不同吗？另外，如果你使用解析器：soup=beautifulsop（pagehtml，“lxml”）
或soup=beautifulsop（pagehtml，“html.parser”）
或soup=beautifulsop（pagehtml，“html5lib”）会怎么样？lxml解析器是唯一不冻结的解析器。提供与上面建议的第二个循环相同的结果，对于951个项目为1450。如果我使用原始代码，我仍然只能得到25项。也就是说，time.sleep和parser都没有任何区别
    for tr in soup.find_all('tr'):
        for td in tr.find_all('td'):
            for span in td.find_all('span', {"class": "aClass"}):
                if span.parent.name == 'a':
                    print span.text

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


searches = ['Norway']
data = [['ID', 'MESSAGE']]

driver = webdriver.PhantomJS()
wait = WebDriverWait(driver, 10)
url = 'your URL here'
driver.get(url)

for search in searches:
    # select 1000 results
    select = Select(driver.find_element_by_id("count"))
    select.select_by_visible_text("1000")

    # provide the search query and search
    input = driver.find_element_by_id("q")
    input.clear()
    input.send_keys(search)
    input.submit()

    # wait until loaded
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a.top")))

    # parse search results with BeautifulSoup
    pagehtml = driver.page_source
    soup = BeautifulSoup(pagehtml, "html5lib")
    identifiers = [id.get_text(strip=True)
                   for id in soup.find_all('span', {"class": "glyphicon glyphicon-open-file"})]
    messages = [message.get_text(strip=True)
                for message in soup.find_all('td', {"colspan": "3"})]
    data.extend(zip(identifiers, messages))

print(len(data))