Python 3.x 如何在Python3中使用SeleniumPhantomJS从网页的html源中提取表元素？_Python 3.x_Selenium Webdriver_Urllib

Python 3.x 如何在Python3中使用SeleniumPhantomJS从网页的html源中提取表元素？

python-3.x selenium-webdriver

Python 3.x 如何在Python3中使用SeleniumPhantomJS从网页的html源中提取表元素？,python-3.x,selenium-webdriver,urllib,Python 3.x,Selenium Webdriver,Urllib,我正在做一个网络爬虫项目，该项目应该以两个日期作为输入（如2019-03-01和2019-03-05），然后在这两个日期之间的每一天附加到基本链接的末尾（例如，基本链接+日期为）。我想在web页面源代码中提取一个具有“tablesaw sortable”类名的表，并将其保存为文本文件或任何其他类似的文件格式我开发了以下代码： from datetime import timedelta, date from bs4 import BeautifulSoup import urllib.requ

我正在做一个网络爬虫项目，该项目应该以两个日期作为输入（如2019-03-01和2019-03-05），然后在这两个日期之间的每一天附加到基本链接的末尾（例如，基本链接+日期为）。我想在web页面源代码中提取一个具有“tablesaw sortable”类名的表，并将其保存为文本文件或任何其他类似的文件格式

我开发了以下代码：

from datetime import timedelta, date
from bs4 import BeautifulSoup
import urllib.request
from selenium import webdriver

class webcrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        return [str(date1 + timedelta(n)) for n in range(int ((self.end_date - self.st_date).days)+1)]

    def create_link(self, attachment):
        url = str(self.base_url) 
        url += attachment
        return url

    def open_link(self, link):
        driver = webdriver.PhantomJS()
        driver.get(link)
        html = driver.page_source
        return html

    def extract_table(self, html):
        soup = BeautifulSoup(html)
        print(soup.prettify())

    def output_to_csv(self):
        pass

date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)

test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)

问题是我花了很长时间才得到一个链接的page.source。我已经使用了

urllib.request

，但该方法的问题是，有时它会在不等待表完全加载的情况下获取html内容

我怎样才能加快进程，只提取所提到的表并访问其html源代码，而不必等待其他内容。我只想将表行中的信息保存在每个日期的某个文本文件中

有人能帮我解决这个问题吗？

这段代码和您如何使用这些库都有很多值得注意的错误。让我试着把它修好

首先，我没有看到您使用

urllib.request

库。您可以删除它，或者如果您正在代码中的另一个位置使用它，我推荐高度评价的模块。如果您只想从站点获取HTML源代码，我还建议您使用requests库而不是selenium，因为selenium更倾向于导航站点和充当“真实”的人

您可以使用

response=requests.get（'https://your.url.here）

然后

response.text

以获取返回的HTML

接下来，我注意到在

open_link（）

方法中，每次调用该方法时都会创建

PhantomJS

类的新实例。这是非常低效的，因为selenium使用了大量资源（并且需要很长时间，具体取决于您使用的驱动程序）。这可能是导致代码运行速度低于预期的一大原因。您应该尽可能多地重用

驱动程序

实例，因为selenium就是这样设计的。一个很好的解决方案是在

webcrawler.\uuu init\uuuu（）

方法中创建

driver

实例

class WebCrawler（）：
定义初始日期（自我、开始日期、结束日期）：
self.driver=webdriver.PhantomJS（）
self.base\u url='1https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date=st_date
self.end\u日期=end\u日期
def open_链路（自身，链路）：
self.driver.get（链接）
html=driver.page\u源
返回html
#或者使用请求库
类WebCrawler（）：
定义初始日期（自我、开始日期、结束日期）：
self.base\u url='1https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date=st_date
self.end\u日期=end\u日期
def open_链路（自身，链路）：
响应=请求。获取（链接）
html=response.text
返回html

旁注：对于类名，应该使用大小写而不是小写。这只是一个建议，但是python的原始创建者已经创建了PEP8来定义编写python代码的通用样式指南。在这里查看：

我发现另一件奇怪的事是你在给。。。一串您可以在

url=str（self.base\uurl）

中执行此操作。这没有伤害任何东西，但也没有帮助。我找不到任何资源/链接，但我怀疑这会占用口译员额外的时间。由于速度是一个问题，我建议只使用

url=self.base\u url

，因为基本url已经是一个字符串

我看到您正在手动格式化和创建URL，但是如果您想要更多的控制和更少的bug，请查看库

def创建链接（自我，附件）：
f=furl（self.base\u url）
#“/=”运算符表示在末尾追加文档：https://github.com/gruns/furl/blob/master/API.md#path
f、 路径/=附件
#清除并删除url中的无效字符
f、 path.normalize（）
return f.url#以字符串形式返回url

另一个潜在问题是

extract\u table（）

方法不提取任何内容，它只是以人类可读的方式格式化html。我不会深入讨论这个问题，但我建议学习CSS选择器或XPath选择器，以便轻松地从HTML中提取数据

在

date\u list（）。我会在那里分解lambda，并将其展开几行，这样您就可以轻松地阅读和理解它试图做什么
下面是完整的重构建议代码
从日期时间导入时间增量，日期
从bs4导入BeautifulSoup
导入请求
从毛皮进口
类WebCrawler（）：
定义初始日期（自我、开始日期、结束日期）：
self.base\u url='1https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
self.st_date=st_date
self.end\u日期=end\u日期
def日期列表（自我）：
日期=[]
总天数=整数（（self.end\u日期-self.st\u日期）。天数+1）
对于范围内的i（总天数）：
日期=self.st_日期+timedelta（天=i）
dates.append（date.strftime（%Y-%m-%d））
返回日期
def创建链接（自我、附件）：
f=furl（self.base\u url）
#“/=”运算符表示在末尾追加文档：https://github.com/gruns/furl/blob/master/API.md#path
f、 路径/=附件
#清除并删除url中的无效字符
f、 path.normalize（）
return f.url#以字符串形式返回url
def open_链路（自身，链路）：
响应=请求。获取（链接）
html=response.text
返回html
def extract_表（self，html）：
汤