使用Python获取数据&;lxml

使用Python获取数据&;lxml,python,web-scraping,lxml,python-2.7,Python,Web Scraping,Lxml,Python 2.7,我有一个我的HTML,如下所示。我想获取中的文本。根据下面给出的例子,我会得到3和5 对于这项工作,我使用Python2.7和lxml <div class="pp-meta-review"> <span class="zrvwidget" style=""> <span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation

我有一个我的HTML,如下所示。我想获取
中的文本。根据下面给出的例子,我会得到3和5

对于这项工作,我使用Python2.7和lxml

<div class="pp-meta-review">
<span class="zrvwidget" style="">
    <span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation.groups="maps"    g:entity.annotation.id="http://maps.google.com/?q=Central+Kia+of+Irving++(972)+659-2204+loc:+1600+East+Airport+Freeway,+Irving,+TX+75062&gl=US&sll=32.83624,-96.92526" g:entity.annotation.author="AIe9_BH8MR-1JD_4BhwsKrGCazUyU5siqCtjchckDcg5BAl5rOLd9nvhJJDTrtjL-xFI8D42bD_7">
        <span class="zzNumUsersFoundThisHelpfulActive" zzlabel="helpful">
            <span>
                <span class="zzAggregateRatingStat">3</span>
            </span>
            <span>
                <span>&nbsp;</span>
                      out of
                <span>&nbsp;</span>
            </span>
            <span>
                <span class="zzAggregateRatingStat">5</span>
            </span>
            <span>
                <span>&nbsp;</span>
                    people found this review helpful.
            </span>
       </span>
   </span>
</span>
</div>

3.
从…里面
5.
人们发现这篇评论很有帮助。
这是

from lxml.etree导入fromstring
从lxml.cssselect导入CSSSelector
sel=CSSSelector('.zzAggregateRatingStat')
文本='3'
doc=fromstring(文本)
el=sel(doc)[0]
打印el.text
这是

from lxml.etree导入fromstring
从lxml.cssselect导入CSSSelector
sel=CSSSelector('.zzAggregateRatingStat')
文本='3'
doc=fromstring(文本)
el=sel(doc)[0]
打印el.text

以下代码适用于您的输入:

import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
    print span.text
它打印:

3
5
与CSS选择器相比,我更喜欢使用
lxml
的xpath,尽管它们都可以完成这项工作

ChrisP的示例打印
3
,但如果在实际输入上运行它,我们会得到错误:

$ python chrisp.py
Traceback (most recent call last):
  File "chrisp.py", line 6, in <module>
    doc = fromstring(text)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210
$python chrisp.py
回溯(最近一次呼叫最后一次):
文件“chrisp.py”,第6行,在
doc=fromstring(文本)
文件“lxml.etree.pyx”,第2532行,位于lxml.etree.fromstring(src/lxml/lxml.etree.c:48270)中
文件“parser.pxi”,第1545行,在lxml.etree.\u parseMemoryDocument(src/lxml/lxml.etree.c:71812)中
文件“parser.pxi”,第1424行,在lxml.etree.\u parseDoc(src/lxml/lxml.etree.c:70673)中
文件“parser.pxi”,第938行,在lxml.etree.\u BaseParser.\u parseDoc(src/lxml/lxml.etree.c:67442)中
文件“parser.pxi”,第539行,位于lxml.etree.\u ParserContext.\u handleParseResultDoc(src/lxml/lxml.etree.c:63824)
lxml.etree.中的文件“parser.pxi”,第625行。\u handleParseResult(src/lxml/lxml.etree.c:64745)
文件“parser.pxi”,第565行,在lxml.etree中。\u raiseParserError(src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError:EntityRef:应为“;”,第3行,第210列
ChrisP的代码可以更改为使用
lxml.html.fromstring
,这是一种更为宽松的解析器,而不是
lxml.etree.fromstring


如果进行了此更改,它将打印
3

以下代码与您的输入一起工作:

import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
    print span.text
它打印:

3
5
与CSS选择器相比,我更喜欢使用
lxml
的xpath,尽管它们都可以完成这项工作

ChrisP的示例打印
3
,但如果在实际输入上运行它,我们会得到错误:

$ python chrisp.py
Traceback (most recent call last):
  File "chrisp.py", line 6, in <module>
    doc = fromstring(text)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210
$python chrisp.py
回溯(最近一次呼叫最后一次):
文件“chrisp.py”,第6行,在
doc=fromstring(文本)
文件“lxml.etree.pyx”,第2532行,位于lxml.etree.fromstring(src/lxml/lxml.etree.c:48270)中
文件“parser.pxi”,第1545行,在lxml.etree.\u parseMemoryDocument(src/lxml/lxml.etree.c:71812)中
文件“parser.pxi”,第1424行,在lxml.etree.\u parseDoc(src/lxml/lxml.etree.c:70673)中
文件“parser.pxi”,第938行,在lxml.etree.\u BaseParser.\u parseDoc(src/lxml/lxml.etree.c:67442)中
文件“parser.pxi”,第539行,位于lxml.etree.\u ParserContext.\u handleParseResultDoc(src/lxml/lxml.etree.c:63824)
lxml.etree.中的文件“parser.pxi”,第625行。\u handleParseResult(src/lxml/lxml.etree.c:64745)
文件“parser.pxi”,第565行,在lxml.etree中。\u raiseParserError(src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError:EntityRef:应为“;”,第3行,第210列
ChrisP的代码可以更改为使用
lxml.html.fromstring
,这是一种更为宽松的解析器,而不是
lxml.etree.fromstring


如果进行了此更改,它将打印
3

获取…中的文本。。。然后通过展示你已经尝试过的内容来完成问题。我真的很抱歉输入错误。Stackoverflow将其作为HTML标记获取…中的文本。。。然后通过展示你已经尝试过的内容来完成问题。我真的很抱歉输入错误。Stackoverflow将其视为HTML Tagu感谢您的回答,我一直在网站上尝试这段代码,但都是徒劳的,您能看看吗it@Zulaikha,如果您想获得企业的评级,您可能希望查看Google和Yelp提供的API,而不是浏览网页。谢谢您的回答,我一直在网站上尝试这段代码,但都是徒劳的,你能看看吗it@Zulaikha,如果你想获得企业的评级,你可能想查看Google和Yelp提供的API,而不是浏览网页。嘿,谢谢你的回复,我不能完全写下你的网站代码。它不断给出不同的错误将lxml.etree.fromstring改为lxml.html.fromstring有效!坦克斯!唯一的问题是你在lxml.html中没有pretty_print选项:(嘿,谢谢你的回复,我不能完全写下你的网站代码。它不断地给出不同的错误将lxml.etree.fromstring改为lxml.html.fromstring有效!tanx!唯一的问题是你在lxml.html中没有pretty_print选项:(