Python 不直接访问的网站抓取

Python 不直接访问的网站抓取,python,Python,任何帮助都将提前得到感谢 交易是,我一直试图从该网站()中刮取数据,但无法直接访问该网站。我获得的不是我需要的数据,而是无效访问。要访问该网站,我必须转到(),然后从下拉菜单中单击“经销商搜索”,同时将鼠标悬停在经销商信息上。 我正在寻找Python的解决方案, 以下是我尝试过的东西。我刚刚开始抓取网页: import requests from bs4 import BeautifulSoup with requests.session() as request: MAIN=&qu

任何帮助都将提前得到感谢

交易是,我一直试图从该网站()中刮取数据,但无法直接访问该网站。我获得的不是我需要的数据,而是无效访问。要访问该网站,我必须转到(),然后从下拉菜单中单击“经销商搜索”,同时将鼠标悬停在经销商信息上。 我正在寻找Python的解决方案, 以下是我尝试过的东西。我刚刚开始抓取网页:

import requests
from bs4 import BeautifulSoup

with requests.session() as request:

    MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
    INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"

    page=request.get(INITIAL)
    jsession=page.cookies["JSESSIONID"]
    print(jsession)
    print(page.headers)

    result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
    
    page1=request.get(MAIN,headers={"Referer":INITIAL})
    soup=BeautifulSoup(page1.content,'html.parser')

    data=soup.find_all("tr",class_="whitepapartd1")

    print(data) 

交易是我想根据公司名称收集公司的数据。

您介意使用浏览器吗

您可以使用浏览器访问xpath(//*[@id=“dropmenudiv”]/a[1])上的链接

如果您以前没有使用过chromedriver,您可能需要下载chromedriver并将其放在提到的目录中。如果您想进行无头浏览(每次都不打开浏览器),也可以使用selenium+phantomjs


谢谢你告诉我@Arnav和@Arman的方法,这里是最后的代码:

from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value

PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)

#ask for input
company_name=input("tell the company name")

#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")

#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")

hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()

#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()

#we are now on the leftmenu page

#click on radio button
browser.find_element_by_css_selector("#byName").click()

#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)

#submit form
inputElement.submit() 

#now we are on dealerssearch page

#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")

#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")

#check length of 'list' and on that basis decide what to print 
if(len(list)!=0):
    #company name at index=9
    #tin no. at index=10
    #registration status at index=11
    #circle name at index=15

    #store the values
    name=list[9].get_text()
    tin=list[10].get_text()
    status=list[11].get_text()
    circle=list[15].get_text()

    #make dictionary
    Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}

    print(Company_Details)
else:
    Company_Details={"VAT RC No":"Not found in database"}

    print(Company_Details)

#close the chrome 
browser.stop_client()
browser.close()
browser.quit()

您可以使用
Selenium
作为第一个登录页面,然后使用
bs
进行抓取。谢谢,但是使用请求包是否也可以这样做,因为我一直在尝试使用它,但没有成功。另外,谢谢,我也尝试过,但一直没有找到错误元素,我甚至试着用css选择器来查找元素-同样的错误。感谢advanceDid中的麻烦,您尝试将鼠标悬停在“/*[@id=“mainsection”]/form/table[1]/tbody/tr[5]/td[3]/a”元素上,然后让webdriver等待它找到“/*[@id=“dropmenudiv”]/a[1]”元素,然后单击?编辑主要答案以帮助您悬停。是的,这正是我所做的,它正在工作,我已经上传了工作代码
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value

PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)

#ask for input
company_name=input("tell the company name")

#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")

#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")

hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()

#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()

#we are now on the leftmenu page

#click on radio button
browser.find_element_by_css_selector("#byName").click()

#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)

#submit form
inputElement.submit() 

#now we are on dealerssearch page

#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")

#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")

#check length of 'list' and on that basis decide what to print 
if(len(list)!=0):
    #company name at index=9
    #tin no. at index=10
    #registration status at index=11
    #circle name at index=15

    #store the values
    name=list[9].get_text()
    tin=list[10].get_text()
    status=list[11].get_text()
    circle=list[15].get_text()

    #make dictionary
    Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}

    print(Company_Details)
else:
    Company_Details={"VAT RC No":"Not found in database"}

    print(Company_Details)

#close the chrome 
browser.stop_client()
browser.close()
browser.quit()