Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/bash/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用PUP/JQ&;将HTML转换为JSON;将数据提取到变量_Json_Bash_Parsing_Html Parsing_Jq - Fatal编程技术网

使用PUP/JQ&;将HTML转换为JSON;将数据提取到变量

使用PUP/JQ&;将HTML转换为JSON;将数据提取到变量,json,bash,parsing,html-parsing,jq,Json,Bash,Parsing,Html Parsing,Jq,我有一个HTML,里面有数据,我正试图找到匹配项。我使用bash来实现这一点,由于这是不可能的,所以我将HTML运行到PUP中(如StackOverflow上的建议),然后使用PUP提取一些模式,但是我留下了大量json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正试图找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行 所以我运行命令:- cat test.html | pup 'div.scene json{}' > out.json

我有一个HTML,里面有数据,我正试图找到匹配项。我使用bash来实现这一点,由于这是不可能的,所以我将HTML运行到PUP中(如StackOverflow上的建议),然后使用PUP提取一些模式,但是我留下了大量json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正试图找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行

所以我运行命令:-

cat test.html | pup 'div.scene json{}' > out.json
下面是生成的

 [
  {
   "children": [
    {
     "children": [
      {
       "class": "icon-new active",
       "tag": "div"
      },
      {
       "children": [
        {
         "children": [
          {
           "alt": "Album Title - Artist Name - 1",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 2",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 3",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 4",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 5",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "class": "last",
           "tag": "span"
          }
         ],
         "class": "sample-picker clearfix",
         "data-trackid": "bhangra-tracking-id",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "title": "Album Title"
        }
       ],
       "class": "card-overlay",
       "tag": "div"
      },
      {
       "children": [
       {
         "alt": "Album Title",
         "class": "lazy card-main-img",
         "data-src": "",
         "tag": "img",
         "title": "Album Title"
        }
       ],
       "data-trackid": "bhangra-tracking-id  ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "title": "Album Title"
      }
     ],
     "class": "card-image",
     "tag": "div"
    },
    {
     "children": [
      {
       "children": [
        {
         "data-trackid": "scene-card-info-title Album Title ",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "text": "Album Title",
         "title": "Album Title"
        }
       ],
       "class": "scene-card-title",
       "tag": "div"
      },
      {
       "children": [
        {
         "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
         "href": "/bhangra/profile/view/2842847/artist-name/",
         "tag": "a",
         "text": "Artist Name",
         "title": "Artist Name"
        }
       ],
       "class": "model-names",
       "tag": "div"
      },
      {
       "tag": "time",
       "text": "September 08, 2018"
      },
      {
       "children": [
        {
         "children": [
          {
           "class": "label-left-box",
           "tag": "span",
           "text": "Website Name"
          },
          {
           "class": "label-text",
           "tag": "span",
           "text": "Website URL"
          }
         ],
         "class": "collection label-small",
         "data-trackid": "scene-card-collection",
         "href": "/bhangra/main/id/url/",
         "tag": "a",
         "title": "Website URL"
        },
        {
         "class": "label-hd ",
         "tag": "span"
        },
        {
         "children": [
          {
           "children": [
            {
             "class": "icons like-icon",
             "tag": "span"
            },
            {
             "class": "like-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "likes",
           "tag": "span"
          },
          {
           "children": [
            {
             "class": "icons dislike-icon",
             "tag": "span"
            },
            {
             "class": "dislike-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "dislikes",
           "tag": "span"
          }
         ],
         "class": "label-rating",
         "tag": "span"
        }
       ],
       "class": "bhangra-information",
       "tag": "div"
      }
     ],
     "class": "scene-card-info",
     "tag": "div"
    }
   ],
   "class": "bhangra-card scene ",
   "tag": "div"
  }
 ]
然后我使用JQ返回一些我想要的细节

 cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'
这是返回下面的

 {
   "1": [
     {
       "data-trackid": "scene-card-info-title Album Title ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "text": "Album Title",
       "title": "Album Title"
     }
   ],
   "2": [
     {
       "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
       "href": "/bhangra/profile/view/2842847/artist-name/",
       "tag": "a",
       "text": "Artist Name",
       "title": "Artist Name"
     }
   ],
   "date": "September 08, 2018"
 }
在上面的情况下,下一个Album2还具有键1和键2,后跟日期,这会导致语法无效,并且由于键都是相同的,因此我无法以我想要的数据为目标

为了解决这个问题,我运行了一系列sed命令来删除上面不需要的行

下面是我希望从最初的jq查询中返回的内容,但不确定如何返回这些特定数据

 { 
   "1" : {
            "album": "Album Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist Name",
            "date": "September 08, 2018"
   },
   "2" : {
            "album": "Album1 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist1 Name",
            "date": "September 08, 2018"
   },
   "3" : {
            "album": "Album2 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist2 Name",
            "date": "September 09, 2018"
   }
 }
更新编辑日期:2018年9月11日

因此,我在这方面取得了一些进展,使用下面的查询,我成功地提取了我需要的数据,但是它们仍然是单独的数组

 cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'
这将输出以下内容,使我稍微接近我想要的内容(上一个示例)


输入JSON和被称为所需输出的JSON之间的连接似乎很脆弱,但解决使用顺序编号键标记对象的问题的一种方法是使用以下函数:

def tag(s):
  reduce s as $x ({n:0, o:{}} ;
    .n += 1
    | .o += { (.n|tostring): $x})
  | .o;
这里,
s
应该是一个JSON实体流,结果是一个带有键“1”、“2”等的单个对象

所以现在的任务是生成所需对象的流。由于不清楚您想要什么,以下内容可以作为说明

{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(.. 
      | objects
      | select(has("title") and (has("children")|not) and .title == "Album Title")
      + $date )
输出
通过将其转换为json,您将使事情变得更加复杂。html中的原始请求是无法使用bash在html中完成的,因此我尝试使用pup/jq/谢谢,我似乎无法找到要添加到s中的流的路径
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(.. 
      | objects
      | select(has("title") and (has("children")|not) and .title == "Album Title")
      + $date )
{
  "1": {
    "alt": "Album Title - Artist Name - 1",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "2": {
    "alt": "Album Title - Artist Name - 2",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "3": {
    "alt": "Album Title - Artist Name - 3",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "4": {
    "alt": "Album Title - Artist Name - 4",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "5": {
    "alt": "Album Title - Artist Name - 5",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "6": {
    "alt": "Album Title",
    "class": "lazy card-main-img",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "7": {
    "data-trackid": "scene-card-info-title Album Title ",
    "href": "/bhangra/album/view/2842847/title-of-album/",
    "tag": "a",
    "text": "Album Title",
    "title": "Album Title",
    "date": "September 08, 2018"
  }
}