Javascript 正则表达式要抓取<;脚本>;标签

Javascript 正则表达式要抓取<;脚本>;标签,javascript,json,regex,Javascript,Json,Regex,我试图将脚本中包含“@type”:“NewsArticle”的整个脚本标记作为目标 比如: <script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script> [^\{]*?{(.*?\}[^\}]*? 我可以用上面的正则表达式来定位最上面的脚本标记。但我正在寻找一个是新闻文章JSON信息,这是本例中的第二个,但在某些页面中有4+application/ld+JSON标记,但“

我试图将脚本中包含“@type”:“NewsArticle”的整个脚本标记作为目标

比如:

<script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script>
[^\{]*?{(.*?\}[^\}]*?
我可以用上面的正则表达式来定位最上面的脚本标记。但我正在寻找一个是新闻文章JSON信息,这是本例中的第二个,但在某些页面中有4+application/ld+JSON标记,但“@type”:“newsArticle”始终存在于每个页面中,无论发生什么情况。因此,我正在寻找一个可以针对特定脚本的脚本

谢谢你的帮助


<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "@id": "https://www.givemesport.com/#gms",
    "name": "GiveMeSport",
    "url": "https://www.givemesport.com",
    "logo": {
        "@type": "ImageObject",
        "url": "https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
    },
    "sameAs":[
        "https://www.facebook.com/GiveMeSport",
        "https://www.instagram.com/givemesport",
        "https://twitter.com/GiveMeSport",
        "https://www.youtube.com/user/GiveMeSport"
    ]
}
</script>
    <script type="application/ld+json">
    {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "mainEntityOfPage": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "url": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "headline": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "datePublished": "2020-10-30T21:52:48.3510000Z",
    "dateModified": "2020-10-30T21:52:48.3510000Z",
    "description": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "articleSection": "Football",
    "keywords": ["Football","Manchester United","Marcus Rashford","RB Leipzig","Scott McTominay","UEFA Champions"],
    "creator": ["Scott Wilson"],
    "thumbnailUrl": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
    "author": {
    "@type": "Person",
    "name": "Scott Wilson",
    "sameAs": "https://www.givemesport.com/scott-wilson-1"
    },
    "publisher": {
    "@id": "https://www.givemesport.com/#gms"
    },
    "image": {
    "@type": "ImageObject",
    "url": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
    "height": 620,
    "width": 960
    }
    }
</script>


{
“@context”:”http://schema.org",
“@type”:“组织”,
“@id”:”https://www.givemesport.com/#gms",
“名称”:“GiveMeSport”,
“url”:”https://www.givemesport.com",
“徽标”:{
“@type”:“ImageObject”,
“url”:”https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
},
“sameAs”:[
"https://www.facebook.com/GiveMeSport",
"https://www.instagram.com/givemesport",
"https://twitter.com/GiveMeSport",
"https://www.youtube.com/user/GiveMeSport"
]
}
{
“@context”:”http://schema.org",
“@type”:“新闻文章”,
“页面的维护”:https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
“url”:”https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
“头条新闻”:“曼联的斯科特·麦克托米奈在对阵RB莱比锡的比赛中打进第三个球后,球迷们的反应非常高兴”,
“发布日期”:“2020-10-30T21:52:48.3510000Z”,
“修改日期”:“2020-10-30T21:52:48.3510000Z”,
“描述”:“曼联的斯科特·麦克托米奈在对阵RB莱比锡的比赛中打进第三个球后,球迷们的反应非常高兴”,
“条款部分”:“足球”,
“关键词”:[“足球”、“曼联”、“马库斯·拉什福德”、“RB莱比锡”、“斯科特·麦克托米奈”、“欧足联冠军”],
“创建者”:[“斯科特·威尔逊”],
“缩略图URL”:https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
“作者”:{
“@type”:“Person”,
“姓名”:“斯科特·威尔逊”,
“sameAs”:https://www.givemesport.com/scott-wilson-1"
},
“出版商”:{
“@id”:”https://www.givemesport.com/#gms"
},
“图像”:{
“@type”:“ImageObject”,
“url”:”https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
“高度”:620,
“宽度”:960
}
}

很遗憾,您不想遵循最佳实践,使用正则表达式解析HTML充满了问题。然而,如果你想快速而肮脏的工作,使用

((?:(?!'script>)

Dosen听起来不像regex是这样做的手段。它会变得缓慢和不准确。如果脚本标记中没有
,您可以尝试。使用regex解析HTML不是一个好的做法:您可以在JS和next
getElementsByTagName中解析HTML字符串,或者通过CSS选择器:@mkczyk我没有在这种情况下,我不担心良好的实践。
--------------------------------------------------------------------------------
  <script                  '<script type="application'
  type="application
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  ld                       'ld'
--------------------------------------------------------------------------------
  \+                       '+'
--------------------------------------------------------------------------------
  json">                   'json">'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        <                        '<'
--------------------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        script                   'script'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [\w\W]                   any character of: word characters (a-
                               z, A-Z, 0-9, _), non-word characters
                               (all but a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
    "@type":                 '"@type":'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "NewsArticle"            '"NewsArticle"'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  script>                  'script>'