Python Scrapy:使用编码和POST作为JSON数组从多个元素中提取

Python Scrapy:使用编码和POST作为JSON数组从多个元素中提取,python,html,html-parsing,scrapy,Python,Html,Html Parsing,Scrapy,我正在抓取一个气象站点,需要从一个表单元格中提取注释,并将它们作为JSON数组发布到远程API 以下是标记: <td> <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p> <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p> <p>Temperat

我正在抓取一个气象站点,需要从一个表单元格中提取注释,并将它们作为JSON数组发布到远程API

以下是标记:

<td>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
    <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>

这在某种程度上是可行的,但它在输出中有许多“\r\n”字符串,以及“之后的任何内容,正如@alecxe在上面的评论中所建议的,lxml的默认解析器似乎不能很好地处理这个HTML输入,解决方案是使用更宽容的解析器(如BeautifulSoup或html5lib)来解析它

实际上,lxml可以使用不同的解析器,并且仍然提供相同的XPath方法

使用BeautifulSoup解析器:

In [1]: from lxml.html import soupparser, html5parser

In [2]: html = """<td>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
    <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
"""

In [3]: doc = soupparser.fromstring(html)

In [4]: for p in doc.xpath('//p'):
    print p.xpath('normalize-space()')
   ...:     
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

这实际上是一个残破的html,
Sorry@alecxe目前我们对scrapy的投入相对较多,因此切换到BeautifulSoup是不可能的。嗯,您需要预处理您当前拥有的html。您可以将scrapy和BeautifulSoup结合起来。例如,在
parse()中
callback使用
BeautifulSoup
解析html,并将固定的html传递给scrapy selector实例。此外,此html来自何处?
[
   "Temperature is cold (\r\n \r\n ",
   "Temperature is very warm (> 60 degrees C / 140 degrees F)."
   "Temperature is cold (\r\n \r\n ",
]
In [1]: from lxml.html import soupparser, html5parser

In [2]: html = """<td>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
    <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
"""

In [3]: doc = soupparser.fromstring(html)

In [4]: for p in doc.xpath('//p'):
    print p.xpath('normalize-space()')
   ...:     
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).
In [5]: doc = html5parser.fromstring(html)

In [6]: for p in doc.xpath('//xhtml:p', namespaces={"xhtml": "http://www.w3.org/1999/xhtml"}):
    print p.xpath('normalize-space()')
   ...:     
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

In [7]: 
doc = soupparser.fromstring(response.body)

comments = []
cmnts = doc.xpath('td//p')

for cmnt in cmnts:
    comments.append(cmnt.xpath('normalize-space(.)'))

item['comments'] = comments

r = requests.post(api_url, data = json.dumps(dict(item)))