Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/282.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何通过selenium从一个目录中刮取信息_Python_Selenium - Fatal编程技术网

Python 如何通过selenium从一个目录中刮取信息

Python 如何通过selenium从一个目录中刮取信息,python,selenium,Python,Selenium,从目录站点中删除联系人信息 我正在从目录站点中删除联系人信息。 我需要补硒。它需要三个步骤, 1.从网站获取公司url。 2.从下一页/所有页获取所有公司url。 3.清除所有联系信息,如公司名称、网站、电子邮件。等 代码如下,但我面临两个问题 # -*- coding: utf-8 -*- from time import sleep from scrapy import Spider from selenium import webdriver from scrapy.selector im

从目录站点中删除联系人信息

我正在从目录站点中删除联系人信息。 我需要补硒。它需要三个步骤, 1.从网站获取公司url。 2.从下一页/所有页获取所有公司url。 3.清除所有联系信息,如公司名称、网站、电子邮件。等 代码如下,但我面临两个问题

# -*- coding: utf-8 -*-
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
import pandas as pd 
results = list()
driver = webdriver.Chrome('D:\chromedriver_win32\chromedriver.exe')
MAX_PAGE_NUM = 2
for i in range(1, MAX_PAGE_NUM):
    page_num = str(i)
url ="http://www.arabianbusinesscommunity.com/category/Industrial-Automation-Process-Control/" + page_num
driver.get(url)
sleep(5)
sel = Selector(text=driver.page_source)
companies = sel.xpath('//*[@id="categorypagehtml"]/div[1]/div[7]/ul/li/b//@href').extract()
for i in range(0, len(companies)):
    print(companies[i])

    results.append(companies[i])
    print('---')
    for result in results:
        url1 = "http://www.arabianbusinesscommunity.com" +result
        print(url1)
        driver.get(url1)
        sleep(5)

        sel = Selector(text=driver.page_source)


        name = sel.css('h2::text').extract_first()

        country = sel.xpath('//*[@id="companypagehtml"]/div[1]/div[2]/ul[1]/li[1]/span[4]/text()').extract_first()
        if country:
           country = country.strip()
        web = sel.xpath('//*[@id="companypagehtml"]/div[1]/div[2]/ul[1]/li[4]/a/@href').extract_first()
        email = sel.xpath('//a[contains(@href, "mailto:")]/@href').extract_first()
records = [] 
records.append((web,email,country,name))
df = pd.DataFrame(records, columns=['web','email', 'country', 'name']) 
我像上面那样写代码,但是我有两个问题。 1.我只能得到最后的公司信息。 2.每次循环迭代时,计算机总是单击以前单击的所有URL


有人能帮忙解决这个问题吗?

这里是从所有页面获取所有公司详细信息的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
baseUrl = "http://www.arabianbusinesscommunity.com/category/Industrial-Automation-Process-Control"
driver.get(baseUrl)

wait = WebDriverWait(driver, 5)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".search-result-list li")))

# Get last page number
lastPageHref = driver.find_element(By.CSS_SELECTOR, ".PagedList-skipToLast a").get_attribute("href")
hrefArray = lastPageHref.split("/")
lastPageNum = int(hrefArray[len(hrefArray) - 1])

# Get all URLs for the first page and save them in companyUrls list
js = 'return [...document.querySelectorAll(".search-result-list li b a")].map(e=>e.href)'
companyUrls = driver.execute_script(js)

# Iterate through all pages and get all companies URLs
for i in range(2, lastPageNum):
    driver.get(baseUrl + "/" + str(i))
    companyUrls.extend(driver.execute_script(js))

# Open each company page and get all details
companies = []
for url in companyUrls:
    driver.get(url)
    company = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#companypagehtml")))
    name = company.find_element_by_css_selector("h2").text
    email = driver.execute_script('var e = document.querySelector(".email"); if (e!=null) { return e.textContent;} return "";')
    website = driver.execute_script('var e = document.querySelector(".website"); if (e!=null) { return e.textContent;} return "";')
    phone = driver.execute_script('var e = document.querySelector(".phone"); if (e!=null) { return e.textContent;} return "";')
    fax = driver.execute_script('var e = document.querySelector(".fax"); if (e!=null) { return e.textContent;} return "";')
    country = company.find_element_by_xpath(".//li[@class='location']/span[last()]").text.replace(",", "").strip()
    address = ''.join([e.text.strip() for e in company.find_elements_by_xpath(".//li[@class='location']/span[position() != last()]")])

感谢您的帮助,它不起作用,但是lastPageHref的语法无效。另外,你介意告诉我我写错了代码吗?一步一步地运行它没有问题,但运行整个过程它将有问题文件“”,第14行lastPageHref=driver.find_元素((by.CSS_选择器(“.PagedList skipToLast a”))。get_属性(“href”)^SyntaxError:无效语法是有语法错误。我又写了一次代码,试试吧。我建议你阅读条款和条件。刮取您试图刮取的页面内容是非法的/禁止的。