Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/325.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 过滤器Div,美化组,返回为空_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 过滤器Div,美化组,返回为空

Python 过滤器Div,美化组,返回为空,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,运行下面的算法,我试图过滤div: from bs4 import BeautifulSoup for link in soup.select('div > a[href*="/tarefa"]'): ref=link.get('href') rt = ('https://brainly.com.br'+str(ref)) p.append(rt) print(p) Div如下: <div class="sg-content-box__content"&g

运行下面的算法,我试图过滤
div

from bs4 import BeautifulSoup

for link in soup.select('div > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
Div
如下:

<div class="sg-content-box__content"><a href="/tarefa/2254726"> 
我需要如何检查它?有时页面最终会更改href是大量数据的内容,需要类似“div>a[href*=“/task”]”的内容,以便您可以搜索它。关键字“task”,而不是创建已包含内容的变量

完整算法:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
从bs4导入美化组
test_html=''
'''
soup=BeautifulSoup(测试“lxml”)
p=[]
查找所有('div'):
ref=link.a.get('href')
rt=('https://brainly.com.br“+str(ref))
p、 附加(rt)
印刷品(p)

这可能是因为浏览器加载数据花费了更多的时间。因此,有时会得到空结果

导入
WebDriverWait
()并等待元素
位于的所有元素的可见性(

输出

['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']

请检查我的答案,让我知道它是否工作。您的代码看起来很好。可能是java脚本呈现的数据,因此您得到的结果是空的。如果此url是公共的,您可以共享您的url。Eu que Estou Usando Selenium Pra Isso vou por Codigo Completo。您可以使用Selenium,也可以检查是否有用于此目的的api。我更新了我的问题,并且只更新了用于筛选的代码使用BeautifulSoup。这不合适,因为我必须筛选其他/task,并且没有指定该div,所以我没有指定它,并且我正在使用'div>a[href*=“/tarefa”]发布示例html(如果可能),它们只有2个div,如果您有任何方法按关键字搜索。您想要的输出只是共享。我需要Href内容:我的错误是选择了正确的div或错误的等待时间?。我猜您的搜索键是正确的,但使用selenium时需要一些时间来加载元素,因此会出现空结果。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
from bs4 import BeautifulSoup

test_html = '''
         <div class="sg-content-box__content"><a href="/tarefa/2254726"> 
         <div class="sg-content-box"><a href="/tarefa/21670613">
         '''

soup = BeautifulSoup(test_html, 'lxml')
p=[]
for link in soup.find_all('div'):
    ref=link.a.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')
WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,'a[href*="/tarefa"]')))
html=browser.page_source
#html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']