Html 使用BeautifulSoup删除隐藏元素_Html_Json_Python 3.x_Beautifulsoup

Html 使用BeautifulSoup删除隐藏元素

html json python-3.x

Html 使用BeautifulSoup删除隐藏元素,html,json,python-3.x,beautifulsoup,Html,Json,Python 3.x,Beautifulsoup,我试图从一个网站上为我的项目搜集数据，但问题是我没有在我的开发人员工具栏屏幕上看到的输出中获得标签。下面是我想要从中提取数据的DOM的快照： <div class="bigContainer"> <div ng-if="products.grid_layout.length > 0"> <div class="fl">

我试图从一个网站上为我的项目搜集数据，但问题是我没有在我的开发人员工具栏屏幕上看到的输出中获得标签。下面是我想要从中提取数据的DOM的快照：

<div class="bigContainer">
      <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
        <div class="fl">
          <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
          <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
          <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
              <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
           </grid-item>

我可以获得类为“bigContainer”的div标记，但无法刮取此标记中的标记。例如，如果我想获得网格项标记，我得到一个空列表，这意味着它表明没有此类标记。为什么会这样？请帮忙

您可以使用底层web api提取网格项细节，这些细节由angularJS javascript框架呈现，因此HTML不是静态的

解析的一种方法是使用selenium获取数据，但是使用浏览器的开发工具识别web api非常简单

编辑：我在firefox中使用firebug插件来查看从“Net tab”发出的GET请求

页面的GET请求是：

它返回了一个回调JS脚本，几乎完全是JSON数据

它返回的JSON包含网格项的详细信息

每个网格项都被描述为一个json对象，如下所示：

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

因此，您可以通过以下方式获得详细信息，而无需使用beautifulSoup

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

您将得到如下输出：

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

希望这有帮助。

您可以使用“用户代理”获取完整的数据。试试这样的

Document doc=Jsoup.connect（url）.userAgent（“Mozilla/5.0（Windows NT 6.1；WOW64；rv:5.0）Gecko/20100101 Firefox/5.0”）。超时（10*1000）.get（）；

请分享您迄今为止编写的代码。r=requests.get（url）soup=beautifulsop（r.content，“html.parser”）plink=soup.find_all（“div”，“class”：“f1”}）[0]。find_all（“网格项”）[0]检查传递给

beautifulsop

（即

r.content

）的html。它可能不同于开发者工具栏显示的HTML。如果缺少

标记，则可能会使用JavaScript将内容插入网页。如果是这种情况，您需要获取内容。当我尝试使用bigContainer类打印div标记时，soup显示no。我仍然想知道如何刮取这些数据，然后检查页面是否有任何正在发送的web api请求，我认为这里就是这样。。因为标签表明他们使用angularJS。如果是这样的话，我们可以用它来收集数据