使用PUP/JQ&;将HTML转换为JSON;将数据提取到变量
我有一个HTML,里面有数据,我正试图找到匹配项。我使用bash来实现这一点,由于这是不可能的,所以我将HTML运行到PUP中(如StackOverflow上的建议),然后使用PUP提取一些模式,但是我留下了大量json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正试图找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行 所以我运行命令:-使用PUP/JQ&;将HTML转换为JSON;将数据提取到变量,json,bash,parsing,html-parsing,jq,Json,Bash,Parsing,Html Parsing,Jq,我有一个HTML,里面有数据,我正试图找到匹配项。我使用bash来实现这一点,由于这是不可能的,所以我将HTML运行到PUP中(如StackOverflow上的建议),然后使用PUP提取一些模式,但是我留下了大量json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正试图找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行 所以我运行命令:- cat test.html | pup 'div.scene json{}' > out.json
cat test.html | pup 'div.scene json{}' > out.json
下面是生成的
[
{
"children": [
{
"children": [
{
"class": "icon-new active",
"tag": "div"
},
{
"children": [
{
"children": [
{
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"class": "last",
"tag": "span"
}
],
"class": "sample-picker clearfix",
"data-trackid": "bhangra-tracking-id",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-overlay",
"tag": "div"
},
{
"children": [
{
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title"
}
],
"data-trackid": "bhangra-tracking-id ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-image",
"tag": "div"
},
{
"children": [
{
"children": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"class": "scene-card-title",
"tag": "div"
},
{
"children": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"class": "model-names",
"tag": "div"
},
{
"tag": "time",
"text": "September 08, 2018"
},
{
"children": [
{
"children": [
{
"class": "label-left-box",
"tag": "span",
"text": "Website Name"
},
{
"class": "label-text",
"tag": "span",
"text": "Website URL"
}
],
"class": "collection label-small",
"data-trackid": "scene-card-collection",
"href": "/bhangra/main/id/url/",
"tag": "a",
"title": "Website URL"
},
{
"class": "label-hd ",
"tag": "span"
},
{
"children": [
{
"children": [
{
"class": "icons like-icon",
"tag": "span"
},
{
"class": "like-amount",
"tag": "var",
"text": "0"
}
],
"class": "likes",
"tag": "span"
},
{
"children": [
{
"class": "icons dislike-icon",
"tag": "span"
},
{
"class": "dislike-amount",
"tag": "var",
"text": "0"
}
],
"class": "dislikes",
"tag": "span"
}
],
"class": "label-rating",
"tag": "span"
}
],
"class": "bhangra-information",
"tag": "div"
}
],
"class": "scene-card-info",
"tag": "div"
}
],
"class": "bhangra-card scene ",
"tag": "div"
}
]
然后我使用JQ返回一些我想要的细节
cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'
这是返回下面的
{
"1": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"2": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"date": "September 08, 2018"
}
在上面的情况下,下一个Album2还具有键1和键2,后跟日期,这会导致语法无效,并且由于键都是相同的,因此我无法以我想要的数据为目标
为了解决这个问题,我运行了一系列sed命令来删除上面不需要的行
下面是我希望从最初的jq查询中返回的内容,但不确定如何返回这些特定数据
{
"1" : {
"album": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist Name",
"date": "September 08, 2018"
},
"2" : {
"album": "Album1 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist1 Name",
"date": "September 08, 2018"
},
"3" : {
"album": "Album2 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist2 Name",
"date": "September 09, 2018"
}
}
更新编辑日期:2018年9月11日
因此,我在这方面取得了一些进展,使用下面的查询,我成功地提取了我需要的数据,但是它们仍然是单独的数组
cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'
这将输出以下内容,使我稍微接近我想要的内容(上一个示例)
输入JSON和被称为所需输出的JSON之间的连接似乎很脆弱,但解决使用顺序编号键标记对象的问题的一种方法是使用以下函数:
def tag(s):
reduce s as $x ({n:0, o:{}} ;
.n += 1
| .o += { (.n|tostring): $x})
| .o;
这里,s
应该是一个JSON实体流,结果是一个带有键“1”、“2”等的单个对象
所以现在的任务是生成所需对象的流。由于不清楚您想要什么,以下内容可以作为说明
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(..
| objects
| select(has("title") and (has("children")|not) and .title == "Album Title")
+ $date )
输出
通过将其转换为json,您将使事情变得更加复杂。html中的原始请求是无法使用bash在html中完成的,因此我尝试使用pup/jq/谢谢,我似乎无法找到要添加到s中的流的路径
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(..
| objects
| select(has("title") and (has("children")|not) and .title == "Album Title")
+ $date )
{
"1": {
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"2": {
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"3": {
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"4": {
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"5": {
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"6": {
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"7": {
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title",
"date": "September 08, 2018"
}
}