Python 过滤器Div，美化组，返回为空_Python_Web Scraping_Beautifulsoup

Python 过滤器Div，美化组，返回为空

python web-scraping

Python 过滤器Div，美化组，返回为空,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,运行下面的算法，我试图过滤div： from bs4 import BeautifulSoup for link in soup.select('div > a[href*="/tarefa"]'): ref=link.get('href') rt = ('https://brainly.com.br'+str(ref)) p.append(rt) print(p) Div如下： <div class="sg-content-box__content"&g

运行下面的算法，我试图过滤

div

：

from bs4 import BeautifulSoup

for link in soup.select('div > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

Div

如下：

<div class="sg-content-box__content"><a href="/tarefa/2254726">

我需要如何检查它？有时页面最终会更改href是大量数据的内容，需要类似“div>a[href*=“/task”]”的内容，以便您可以搜索它。关键字“task”，而不是创建已包含内容的变量

完整算法：

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

从bs4导入美化组
test_html=''
'''
soup=BeautifulSoup（测试“lxml”）
p=[]
查找所有（'div'）：
ref=link.a.get（'href'）
rt=（'https://brainly.com.br“+str（ref））
p、 附加（rt）
印刷品（p）

这可能是因为浏览器加载数据花费了更多的时间。因此，有时会得到空结果

导入

WebDriverWait

（）并等待元素

位于的所有元素的可见性（

）

输出：

['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']

请检查我的答案，让我知道它是否工作。您的代码看起来很好。可能是java脚本呈现的数据，因此您得到的结果是空的。如果此url是公共的，您可以共享您的url。Eu que Estou Usando Selenium Pra Isso vou por Codigo Completo。您可以使用Selenium，也可以检查是否有用于此目的的api。我更新了我的问题，并且只更新了用于筛选的代码使用BeautifulSoup。这不合适，因为我必须筛选其他/task，并且没有指定该div，所以我没有指定它，并且我正在使用'div>a[href*=“/tarefa”]发布示例html（如果可能），它们只有2个div，如果您有任何方法按关键字搜索。您想要的输出只是共享。我需要Href内容：我的错误是选择了正确的div或错误的等待时间？。我猜您的搜索键是正确的，但使用selenium时需要一些时间来加载元素，因此会出现空结果。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

from bs4 import BeautifulSoup

test_html = '''
         <div class="sg-content-box__content"><a href="/tarefa/2254726"> 
         <div class="sg-content-box"><a href="/tarefa/21670613">
         '''

soup = BeautifulSoup(test_html, 'lxml')
p=[]
for link in soup.find_all('div'):
    ref=link.a.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')
WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,'a[href*="/tarefa"]')))
html=browser.page_source
#html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']