Python Scrapy css选择器重新给出断开的json字符串_Python_Regex_Scrapy_Python 3.6

Python Scrapy css选择器重新给出断开的json字符串

python regex scrapy

Python Scrapy css选择器重新给出断开的json字符串,python,regex,scrapy,python-3.6,Python,Regex,Scrapy,Python 3.6,嘿，我是Python的新手，尤其是我正在尝试放弃的斗志旺盛的人。但我有一个问题。我使用这个正则表达式从响应中获取json字符串 \uuuuwml\uredux\u初始状态\uuuuu=*（.\}）；\} 但它有时会给出断开的json字符串，例如fr，这是由于json.loads失败。regx或scrapy存在此问题。我不明白为什么会发生这种情况scrapy/Parsel的选择器.re（）和.re_first（）具有（不幸的是）替换HTML字符实体引用的默认行为。这可能导致JSON解码失败在s

嘿，我是Python的新手，尤其是我正在尝试放弃的斗志旺盛的人。但我有一个问题。我使用这个正则表达式从响应中获取json字符串

\uuuuwml\uredux\u初始状态\uuuuu=*（.\}）；\}
但它有时会给出断开的json字符串，例如fr，这是由于json.loads失败。regx或scrapy存在此问题。我不明白为什么会发生这种情况
scrapy/Parsel的选择器
.re（）

和

.re_first（）

具有（不幸的是）替换HTML字符实体引用的默认行为。这可能导致JSON解码失败

在scrapy shell中使用示例URL进行演示。正则表达式可以工作，它可以选择所需的数据：

$ scrapy shell https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527 -s USER_AGENT='mozilla'
2017-07-13 15:24:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(..)
2017-07-13 15:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527> (referer: None)
>>> data = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};')
>>> data[:25], data[-25:]
(' {"uuid":null,"isMobile":', 'nabled":true,"seller":{}}')

查看

“

是如何保持原样的

现在您可以将字符串解码为JSON：

>>> d = json.loads(dataraw)
>>> d.keys()
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller'])
>>>

replace_entities

是在parsel v1.2.0中引入的。（请参阅）

请给出一个示例JASON，说明所需的输出和实际的输出。下面是一个示例json所需的输出是有效的json，您使用的是哪一个scrapy版本@paul，因为我正在获取TypeError:re_first（）获取了意外的关键字参数“replace\u entities”

TypeError:re\u first（）获取了意外的关键字参数“replace\u entities”

请将parsel升级到1.2（

pip install--upgrade parsel

）。parsel是Scrapy的依赖项

>>> dataraw = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};', replace_entities=False)
>>> dataraw[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul>  <li>21&quot; Inseam</li>  <li>Rib knit waist with button and '

>>> d = json.loads(dataraw)
>>> d.keys()
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller'])
>>>