Python 缓慢的html解析器。如何提高速度？_Python_Html_Performance_Parsing_Html Parsing

Python 缓慢的html解析器。如何提高速度？

python html performance parsing

Python 缓慢的html解析器。如何提高速度？,python,html,performance,parsing,html-parsing,Python,Html,Performance,Parsing,Html Parsing,我想估计一下这条消息对道琼斯报价的影响。为此，我使用beutifullsoup库编写了Python html解析器。我提取一篇文章并将其存储在XML文件中，以便使用NLTK库进行进一步分析。如何提高解析速度？下面的代码执行所需的任务，但速度非常慢以下是html解析器的代码： import urllib2 import re import xml.etree.cElementTree as ET import nltk from bs4 import BeautifulSoup from dat

我想估计一下这条消息对道琼斯报价的影响。为此，我使用beutifullsoup库编写了Python html解析器。我提取一篇文章并将其存储在XML文件中，以便使用NLTK库进行进一步分析。如何提高解析速度？下面的代码执行所需的任务，但速度非常慢

以下是html解析器的代码：

import urllib2
import re
import xml.etree.cElementTree as ET
import nltk
from bs4 import BeautifulSoup
from datetime import date
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords
from collections import defaultdict

def main_parser():
    #starting date
    a = date(2014, 3, 27)
    #ending date
    b = date(2014, 3, 27)
    articles = ET.Element("articles")
    f = open('~/Documents/test.xml', 'w')
    #loop through the links and per each link extract the text of the article, store the latter at xml file
    for dt in rrule(DAILY, dtstart=a, until=b):
        url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime("%d") + ".html"
        page = urllib2.urlopen(url)
        #use html5lib ??? possibility to use another parser
        soup = BeautifulSoup(page.read(), "html5lib")
        article_date = ET.SubElement(articles, "article_date")
        article_date.text = str(dt)
        for links in soup.find_all("div", "headlineMed"):
            anchor_tag = links.a
            if not 'video' in anchor_tag['href']:
                try:
                    article_time = ET.SubElement(article_date, "article_time")
                    article_time.text = str(links.text[-11:])

                    article_header = ET.SubElement(article_time, "article_name")
                    article_header.text = str(anchor_tag.text)

                    article_link = ET.SubElement(article_time, "article_link")
                    article_link.text = str(anchor_tag['href']).encode('utf-8')

                    try:
                        article_text = ET.SubElement(article_time, "article_text")
                        #get text and remove all stop words
                        article_text.text = str(remove_stop_words(extract_article(anchor_tag['href']))).encode('ascii','ignore')
                    except Exception:
                        pass
                except Exception:
                    pass

    tree = ET.ElementTree(articles)
    tree.write("~/Documents/test.xml","utf-8")

#getting the article text from the spicific url
def extract_article(url):
    plain_text = ""
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html, "html5lib")
    tag = soup.find_all("p")
    #replace all html tags
    plain_text = re.sub(r'<p>|</p>|[|]|<span class=.*</span>|<a href=.*</a>', "", str(tag))
    plain_text = plain_text.replace(", ,", "")
    return str(plain_text)

def remove_stop_words(text):
    text=nltk.word_tokenize(text)
    filtered_words = [w for w in text if not w in stopwords.words('english')]
    return ' '.join(filtered_words)

导入urllib2
进口稀土
将xml.etree.cElementTree作为ET导入
导入nltk
从bs4导入BeautifulSoup
起始日期时间导入日期
从dateutil.rrule导入rrule，每日
从nltk.corpus导入停止词
从集合导入defaultdict
def main_解析器（）：
#开始日期
a=日期（2014年3月27日）
#结束日期
b=日期（2014年3月27日）
条款=等要素（“条款”）
f=打开（“~/Documents/test.xml”，“w”）
#循环浏览链接，并根据每个链接提取文章的文本，将后者存储在xml文件中
对于rrule中的dt（每天，dtstart=a，直到=b）：
url=”http://www.reuters.com/resources/archive/us/“+dt.strftime（“%Y”）+dt.strftime（“%m”）+dt.strftime（“%d”）+”.html”
page=urlib2.urlopen（url）
#使用html5lib？？？使用其他解析器的可能性
soup=BeautifulSoup（page.read（），“html5lib”）
article_date=ET.子元素（articles，“article_date”）
article_date.text=str（dt）
查找汤中的链接。查找所有（“div”、“headlineMed”）：
anchor_tag=links.a
如果不是锚定标签['href']中的“视频”：
尝试：
article_time=ET.SubElement（article_日期，“article_time”）
article_time.text=str（links.text[-11:]）
article\u header=ET.SubElement（article\u time，“article\u name”）
article_header.text=str（锚定标签.text）
article\u link=ET.SubElement（article\u time，“article\u link”）
article_link.text=str（锚定标签['href']）。编码（'utf-8'）
尝试：
article\u text=ET.SubElement（article\u time，“article\u text”）
#获取文本并删除所有停止词
article_text.text=str（删除_stop_单词（提取_article（锚定标签['href']））。编码（'ascii'，'ignore'））
除例外情况外：
通过
除例外情况外：
通过
tree=ET.ElementTree（文章）
tree.write（“~/Documents/test.xml”，“utf-8”）
#从spicific url获取文章文本
def摘录文章（url）：
纯文本=“”
html=urllib2.urlopen（url.read（））
soup=BeautifulSoup（html，“html5lib”）
标签=汤。查找所有（“p”）
#替换所有html标记
explain_text=re.sub（r'|
|[|]|可以应用多个修复程序（无需更改当前使用的模块）：

使用lxml
解析器而不是html5lib
——它的速度要快得多（而且还要快3倍）
仅使用解析文档的一部分（请注意，html5lib
不支持SoupTrainer
——它总是缓慢解析整个文档）

以下是更改后代码的外观。简短的性能测试显示至少有3倍的改进：
import urllib2
import xml.etree.cElementTree as ET
from datetime import date

from bs4 import SoupStrainer, BeautifulSoup
import nltk
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords


def main_parser():
    a = b = date(2014, 3, 27)
    articles = ET.Element("articles")
    for dt in rrule(DAILY, dtstart=a, until=b):
        url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime(
            "%d") + ".html"

        links = SoupStrainer("div", "headlineMed")
        soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=links)

        article_date = ET.SubElement(articles, "article_date")
        article_date.text = str(dt)
        for link in soup.find_all('a'):
            if not 'video' in link['href']:
                try:
                    article_time = ET.SubElement(article_date, "article_time")
                    article_time.text = str(link.text[-11:])

                    article_header = ET.SubElement(article_time, "article_name")
                    article_header.text = str(link.text)

                    article_link = ET.SubElement(article_time, "article_link")
                    article_link.text = str(link['href']).encode('utf-8')

                    try:
                        article_text = ET.SubElement(article_time, "article_text")
                        article_text.text = str(remove_stop_words(extract_article(link['href']))).encode('ascii', 'ignore')
                    except Exception:
                        pass
                except Exception:
                    pass

    tree = ET.ElementTree(articles)
    tree.write("~/Documents/test.xml", "utf-8")


def extract_article(url):
    paragraphs = SoupStrainer('p')
    soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=paragraphs)
    return soup.text


def remove_stop_words(text):
    text = nltk.word_tokenize(text)
    filtered_words = [w for w in text if not w in stopwords.words('english')]
    return ' '.join(filtered_words)

请注意，我已经从extract\u article（）
中删除了正则表达式处理-看起来您可以从p标记中获取整个文本
我可能引入了一些问题-请检查是否一切正常

另一种解决方案是使用lxml
，从解析（替换beautifulSoup
）到创建xml（替换xml.etree.ElementTree
）

另一个解决方案（肯定是最快的）是切换到web抓取web框架。
它很简单，速度也很快。你可以想象，里面有各种各样的电池。例如，有链接提取器、XML导出器、数据库管道等等，值得一看
希望这能有所帮助。您希望选择最好的解析器。

在构建时，我们对大多数解析器/平台进行了基准测试：
以下是一篇关于媒体的完整文章：