带有搜索和非动态URI的Python Web抓取_Python_Python 3.x_Web Scraping_Beautifulsoup_Python Requests

带有搜索和非动态URI的Python Web抓取

python python-3.x web-scraping

带有搜索和非动态URI的Python Web抓取,python,python-3.x,web-scraping,beautifulsoup,python-requests,Python,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我是python和web scraper领域的乞丐，我习惯于使用动态URL创建scraper，当我在URL中输入特定参数时，URI会发生变化。例：维基百科。如果我输入一个名为Stack Overflow的搜索，我的URI如下所示：当时，我面临的挑战是开发一个网络刮板来收集数据字段Texto/Termos a serem pesquisados对应一个搜索字段，但当我输入搜索时，URL保持不变，无法为我的研究获取正确的HTML代码我习惯于使用BeautifulSoup并请求执行刮片操作，

我是python和web scraper领域的乞丐，我习惯于使用动态URL创建scraper，当我在URL中输入特定参数时，URI会发生变化。例：维基百科。如果我输入一个名为Stack Overflow的搜索，我的URI如下所示：

当时，我面临的挑战是开发一个网络刮板来收集数据

字段Texto/Termos a serem pesquisados对应一个搜索字段，但当我输入搜索时，URL保持不变，无法为我的研究获取正确的HTML代码

我习惯于使用BeautifulSoup并请求执行刮片操作，但在这种情况下它没有任何用处，因为搜索后URL保持不变

import requests
from bs4 import BeautifulSoup

url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')

print(bsObj)
# And from now on i cant go any further

通常我会做类似的事情

url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input

然后做所有漂亮的事情，然后用findAll从HTML代码中获取数据

我也尝试过使用Selenium，但由于webdriver的原因，我正在寻找与之不同的东西。通过下面的代码，我获得了一些奇怪的结果，但我仍然不能以一种好的方式刮取HTML

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup

# Acess the page and input the search on the field

driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()

因此，考虑到所有这些：

有没有办法从这些静态URL中抓取数据？我可以通过代码在字段中输入搜索吗？为什么我不能刮取正确的源代码？

问题是，您的初始搜索页面使用了搜索和结果框架，这使得BeautifulSoup更难使用它。我可以使用稍微不同的URL获得搜索结果，取而代之的是：

请注意，我在这里使用的URL是包含搜索表单的框架的URL，而不是您提供的内联表单的页面。这将删除一层间接寻址

MechanicalSoup是建立在BeautifulSoup之上的，它提供了一些与网站交互的工具，类似于旧的mechanize库

>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form()  # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text'  # input the text to search for
>>> sb.submit_selected()  # submit the form
<Response [200]>
>>> page = sb.get_current_page()  # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>