Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 铲运机捕获大部分数据,但缺少少量数据_Python_Python 3.x_Selenium_Selenium Webdriver_Web Scraping - Fatal编程技术网

Python 铲运机捕获大部分数据,但缺少少量数据

Python 铲运机捕获大部分数据,但缺少少量数据,python,python-3.x,selenium,selenium-webdriver,web-scraping,Python,Python 3.x,Selenium,Selenium Webdriver,Web Scraping,我已经用python和selenium编写了一个脚本,从网页上获取完整的航班时刻表。在运行我的脚本时,我可以看到它到目前为止运行良好,除了一些未被解析的字段。我已经检查了数据所在的元素,但是我注意到已经被刮取的元素和丢失的元素没有什么不同。如何获得完整内容。提前谢谢 以下是我正在尝试的脚本: from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.supp

我已经用python和selenium编写了一个脚本,从网页上获取完整的航班时刻表。在运行我的脚本时,我可以看到它到目前为止运行良好,除了一些未被解析的字段。我已经检查了数据所在的元素,但是我注意到已经被刮取的元素和丢失的元素没有什么不同。如何获得完整内容。提前谢谢

以下是我正在尝试的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [[item.text for item in data.find_elements_by_css_selector('td')]
                    for data in item.find_elements_by_css_selector('tr')]
for tab_data in list_of_data:
    print(tab_data)

driver.quit()
以下是数据的部分图片[缺少一个和刮取一个]:

以下是一个区块的td元素:

<tr class="yvr-flights__row  yvr-flights__row--departed " id="226792377">
            <td>
                <time class="yvr-flights__label yvr-flights__scheduled-label yvr-flights__scheduled-label--departed notranslate" datetime="2017-08-24T06:20:00-07:00">
                    06:20
                </time>
            </td>
            <td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--departed" datetime="2017-08-24T06:20:00-07:00">
                            06:19
                        </time>
            </td>
            <td class="yvr-table__cell yvr-flights__flightNumber notranslate">AC560</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">Air Canada</td>
            <td class="yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">San Francisco</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">
Main                
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">E87</td>

            <td class="yvr-flights__table-cell--status yvr-table__cell--nowrap">
                    <span class="yvr-flights__status yvr-flights__status--departed">Departed</span>
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap">
            </td>
            <td class="visible-until--md">
                <button class="yvr-flights__toggle-flight">Toggle flight</button>
            </td>
        </tr>

06:20
06:19
AC560
加拿大航空公司
旧金山
主要
E87
离开
切换飞行

您应该打开此URL并获取所有详细信息

http://www.yvr.ca/en/_api/Flights?%24filter=FlightScheduledTime%20gt%20DateTime%272017-08-24T00%3A00%3A00%27%20and%20FlightScheduledTime%20lt%20DateTime%272017-08-25T00%3A00%3A00%27%20and%20FlightType%20eq%20%27D%27&%24orderby=FlightScheduledTime%20asc
如果我转义URL,它会变成

http://www.yvr.ca/en/_api/Flights?$filter=FlightScheduledTime gt DateTime'2017-08-24T00:00:00' and FlightScheduledTime lt DateTime'2017-08-25T00:00:00' and FlightType eq 'D'&$orderby=FlightScheduledTime asc
因此,您应该将其参数化,并根据当前日期替换日期,以JSON形式获取所有数据

{
odata.metadata: "http://www.yvr.ca/_api/$metadata#Flights",
value: [
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:15:00",
FlightEstimatedTime: "2017-08-24T06:10:00",
FlightNumber: "WS560",
FlightAirlineName: "WestJet",
FlightAircraftType: "73H",
FlightDeskTo: "",
FlightDeskFrom: "",
FlightCarousel: "",
FlightRange: "D",
FlightCarrier: "WS",
FlightCity: "Calgary",
FlightType: "D",
FlightAirportCode: "YYC",
FlightGate: "B14",
FlightRemarks: "Departed",
FlightID: 226790614,
FlightQuickConnect: ""
},
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:20:00",
FlightEstimatedTime: "2017-08-24T06:19:00",

正如Tarun Lalwani所建议的,WebDriver对于这个活动来说确实是一个错误的工具

问题是webdriver只返回屏幕上可见元素的文本,因此,如果您想查看所有行中的数据,则需要向下滚动行,并按照中的讨论一次收集一行数据 这将是痛苦的缓慢

我想你也可以抓取文本内容而不是item.text 在java中:

我确信python有一个等价物


jsoup将是一种在一次快照中获取数据的替代方法,而且速度更快,因为您希望修复脚本,而不是刮取数据。我在你的脚本中发现了一些问题

一次扫描所有
tr
节点。但是您感兴趣的
tr
应该有
yvr-flights\uuu row
类。但是有些是隐藏的,没有数据。他们有
yvr-flights\uuuu行-隐藏
。所以你不想要他们

另外,表的第二列并不总是有数据。当它有它更像下面

<td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--early" datetime="2017-08-25T06:30:00-07:00">
                            06:20
                        </time>
            </td>
所以,如果您将所有这些结合在一起,脚本将完成所有工作

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
    [
        item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
        for item in data.find_elements_by_css_selector('td')
    ]
    for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]

for tab_data in list_of_data:
    print(tab_data)
它给了我下面的输出

['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '', 
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']

谢谢Tarun Lalwani,谢谢你的发现。基本上,这些数据不是我想要的。我想学习如何纠正我在上面粘贴的脚本中使用selenium所犯的错误。再次感谢..在
驱动程序之后添加睡眠。获取(“http://www.yvr.ca/en/passengers/flights/departing-flights”
,看看这是否有帮助。可能是您获取元素的速度比加载的速度快。事实上,我进行了硬编码延迟,但发现结果更糟。这并不能解决问题,但会使结果看起来更好。谢谢塔伦的回答。你总是有新的想法。这个javascript命令将在将来帮助我。让我来衡量一下你的声誉。顺便说一句,我已经使用了您提供的api以及,这是非常容易处理。谢谢你所做的一切。@Topto,问题的哪一部分没有解决?让我知道,我会告诉你需要做什么
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
    [
        item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
        for item in data.find_elements_by_css_selector('td')
    ]
    for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]

for tab_data in list_of_data:
    print(tab_data)
['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '', 
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']