Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/sorting/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用硒刮削法测试一致性_Python_Datetime_Selenium_Selenium Webdriver - Fatal编程技术网

Python 用硒刮削法测试一致性

Python 用硒刮削法测试一致性,python,datetime,selenium,selenium-webdriver,Python,Datetime,Selenium,Selenium Webdriver,此脚本将即将举行的体育比赛的网站中的数据刮入字典()。不到2.5分钟 from selenium.common.exceptions import TimeoutException from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.supp

此脚本将即将举行的体育比赛的网站中的数据刮入字典()。不到2.5分钟

from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import datetime
import time

upcoming = ['http://www.oddsportal.com/basketball/usa/wnba/']
nextgames = []

def rescrape(urls, cs):

    driver = webdriver.PhantomJS(executable_path=r'C:/phantomjs.exe') 
    driver.get('http://www.oddsportal.com/set-timezone/15/')
    # The above link sets the timezone. I believe problem lies here, explicit wait?    
    driver.implicitly_wait(3)

    for url in urls:        
        for i in range(2): 
            #This is to run the the scrape twice within function. It scrapes the same way both times        
            wait = WebDriverWait(driver, 5)            
            driver.get(url)
            # this is to ensure the table with games has appeared            
            try:     
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#tournamentTable tr.odd")))
            except TimeoutException:
                continue
            # below is the script to get details from each game            
            for match in driver.find_element_by_css_selector("table#tournamentTable").find_elements_by_tag_name('tr')[3:]:
                try:
                    home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
                except:
                    continue

                date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
                kickoff = match.find_element_by_class_name("table-time").text
                # following deals with exceptions to a recognized date format
                if "oday" in date:
                    date = datetime.date.today().strftime("%d %b %Y")
                    event = "Not specified"

                elif "omorrow" in date:
                    date = datetime.date.today() + datetime.timedelta(days=1)
                    date = date.strftime("%d %b %Y")                

                elif "esterday" in date:
                    date = datetime.date.today() + datetime.timedelta(days=-1)
                    date = date.strftime("%d %b %Y")                            
                elif " - " in date:
                    date, event = date.split(" - ", 1)                    


                nextgames.append({
                    "current time": time.ctime(),                
                    "home": home.strip(),
                    "away": away.strip(),
                    "date": date,
                    "time": kickoff.strip()})

                time.sleep(3)
                print len(nextgames)

        print len(nextgames)
    driver.close()
    df = pd.DataFrame(nextgames)
    df.to_csv(cs, encoding='utf-8')
    return df

for i in range(3):
    rescrape(upcoming, 'trial' + str(i) + '.csv')
它的问题在于设置时区
driver.get('http://www.oddsportal.com/set-timezone/15/)
并不总是有效。它恢复到默认时区格林尼治标准时间约20%的时间刮。该输出在第三轮显示错误的日期和时间,在前两次正确完成后。请注意,最后一个范围(2)循环两次都出错,但只有第二个日期出错-这意味着它可以更改任一循环中的时区:

pd.set\u选项('display.max\u colwidth',10)


那么如何确保时区
.get
每次都能正常工作呢?目前,我有一个隐式等待,并尝试显式等待,但没有效果。

我注意到网站为用户时区创建了一个cookie,您可以通过自己添加它来利用它

driver.add_cookie({'name': 'op_user_time_zone', 'value': '-4'})
这应该能奏效

如果无法按照您尝试的方式编辑当前代码,请检查以确保正确实现

driver.add_cookie({'name': 'op_user_time_zone', 'value': '-4'})