Python 在抓取时获取变量而不是文本_Python_Web Scraping_Scrapy

Python 在抓取时获取变量而不是文本

python web-scraping scrapy

Python 在抓取时获取变量而不是文本,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在抓取一个不加载任何XHR请求且所有内容都在页面中的网页。但是当我试图用粘壳或蜘蛛来刮页面时，我得到的是一些变量而不是文本。例如，请看本页： https://lastsecond.ir/tours/24588-%D8%AA%D9%88%D8%B1-%D9%85%D8%B4%D9%87%D8%AF-22-%D8%AF%DB%8C-96-%D8%A7%D8%B2-%D8%A7%D8%B5%D9%81%D9%87%D8%A7%D9%86 我使用scrapy shell尝试以下代码： respon

我正在抓取一个不加载任何XHR请求且所有内容都在页面中的网页。但是当我试图用粘壳或蜘蛛来刮页面时，我得到的是一些变量而不是文本。例如，请看本页：

https://lastsecond.ir/tours/24588-%D8%AA%D9%88%D8%B1-%D9%85%D8%B4%D9%87%D8%AF-22-%D8%AF%DB%8C-96-%D8%A7%D8%B2-%D8%A7%D8%B5%D9%81%D9%87%D8%A7%D9%86

我使用scrapy shell尝试以下代码：

response.css("table a h3 img").extract()

响应应该是这样的，就像在html响应中一样：

 <img src="https://lastsecond.ir/site/images/placeholder/hotel.svg" alt="Mehr Reza hotel" class="hotelpic">

但我明白了：

['<img :src="hotel.imageUrl" class="hotelpic" :alt="hotel.name">']

我刮不掉它

通常，网站从后端呼叫或使用第三方服务获取数据

但在本例中，您所刮取的原始数据包含在本机javascript语句中，然后导入regex模块以帮助过滤或提取数据；最后，利用json模块解析和获取所需的数据

var tourcode ={
"id": 24588,
"title": "تور مشهد 22 دی 96 (از اصفهان)",
"slug": "تور-مشهد-22-دی-96-از-اصفهان",
....
"packages": {
    "bundles": {
{
"308892": {
    "id": 308892,
    "hotels": [
        {
            "id": 1298,
            "bundle_id": 308892,
            "link": "https://lastsecond.ir/hotels/1298-mehr-reza",
            "location_id": 410,
            "location_name": "مشهد",
            "name": "Mehr Reza hotel",
            "grade": {
                "id": 80,
                "name": "هتل آپارتمان",
                "icons": [
                    "fa-building"
                ],
                "count": "0",
                "singleIcon": "<i class=\"fa fa-building large-star\"> <label class=\"orange-text\"></label> </i>"
            },
            "decoratedGrade": "<div class=\"d-inline-block ltr hotelGrade\" data-toggle=\"tooltip\" data-placement=\"left\" title=\"هتل آپارتمان\"><i class=\"fa fa-building orange-text\"></i></div>",
            "score": 0,
            "imageUrl": "https://lastsecond.ir/site/images/placeholder/hotel.svg",
            "reviewsCount": 0,
            "decoratedScore": "<div class=\"hotelScore\"><div class=\"score\" style=\"width: 0%\"></div></div>",
            "description": "صبحانه",
            "service_id": 2,
            "service": "bb",
            "serviceName": "B.B",
            "serviceDesc": "با صبحانه",
            "ordering": "1"
        }
    ],
    "prices": {
        "1": {
            "1": "295000"
        },
        "2": {
            "1": "370000"
        },
        "3": {
            "1": "295000"
        },
        "4": {
            "1": "240000"
        }
    }
}
}
...
    }}
        }

我使用此代码，但没有获得正确的编码文本：使用编解码器。打开'lastssecond/page.json'，'w'，encoding='utf-8'作为f:json.dumpjsonstr[0]，f，确保ascii=False。此代码保存所有非英语字符，如u0645或类似的字符。从json.dumpobj，fp，*，第一个参数应该是一个json对象。我已经提到了上面的答案。您应该首先将其解析为json对象，然后转储到文件文本中

$ scrapy shell https://lastsecond.ir/tours/24588-%D8%AA%D9%88%D8%B1-%D9%85%D8%B4%D9%87%D8%AF-22-%D8%AF%DB%8C-96-%D8%A7%D8%B2-%D8%A7%D8%B5%D9%81%D9%87%D8%A7%D9%86
 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
 ....
$ import re
$ import json
$ jsonstr = re.findall("var tourcode = (.+?);\n",response.body.decode('utf-8'),re.S)
$ jsonobj = json.loads(jsonstr[0])
# parse json object here