Javascript 正则表达式要抓取<；脚本>；标签_Javascript_Json_Regex

Javascript 正则表达式要抓取<；脚本>；标签

javascript json regex

Javascript 正则表达式要抓取<；脚本>；标签,javascript,json,regex,Javascript,Json,Regex,我试图将脚本中包含“@type”：“NewsArticle”的整个脚本标记作为目标比如： <script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script> [^\{]*？{（.*？\}[^\}]*？我可以用上面的正则表达式来定位最上面的脚本标记。但我正在寻找一个是新闻文章JSON信息，这是本例中的第二个，但在某些页面中有4+application/ld+JSON标记，但“

我试图将脚本中包含“@type”：“NewsArticle”的整个脚本标记作为目标

比如：

<script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script>

[^\{]*？{（.*？\}[^\}]*？

我可以用上面的正则表达式来定位最上面的脚本标记。但我正在寻找一个是新闻文章JSON信息，这是本例中的第二个，但在某些页面中有4+application/ld+JSON标记，但“@type”：“newsArticle”始终存在于每个页面中，无论发生什么情况。因此，我正在寻找一个可以针对特定脚本的脚本

谢谢你的帮助


<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "@id": "https://www.givemesport.com/#gms",
    "name": "GiveMeSport",
    "url": "https://www.givemesport.com",
    "logo": {
        "@type": "ImageObject",
        "url": "https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
    },
    "sameAs":[
        "https://www.facebook.com/GiveMeSport",
        "https://www.instagram.com/givemesport",
        "https://twitter.com/GiveMeSport",
        "https://www.youtube.com/user/GiveMeSport"
    ]
}
</script>
    <script type="application/ld+json">
    {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "mainEntityOfPage": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "url": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "headline": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "datePublished": "2020-10-30T21:52:48.3510000Z",
    "dateModified": "2020-10-30T21:52:48.3510000Z",
    "description": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "articleSection": "Football",
    "keywords": ["Football","Manchester United","Marcus Rashford","RB Leipzig","Scott McTominay","UEFA Champions"],
    "creator": ["Scott Wilson"],
    "thumbnailUrl": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
    "author": {
    "@type": "Person",
    "name": "Scott Wilson",
    "sameAs": "https://www.givemesport.com/scott-wilson-1"
    },
    "publisher": {
    "@id": "https://www.givemesport.com/#gms"
    },
    "image": {
    "@type": "ImageObject",
    "url": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
    "height": 620,
    "width": 960
    }
    }
</script>


{
“@context”：”http://schema.org",
“@type”：“组织”，
“@id”：”https://www.givemesport.com/#gms",
“名称”：“GiveMeSport”，
“url”：”https://www.givemesport.com",
“徽标”：{
“@type”：“ImageObject”，
“url”：”https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
},
“sameAs”：[
"https://www.facebook.com/GiveMeSport",
"https://www.instagram.com/givemesport",
"https://twitter.com/GiveMeSport",
"https://www.youtube.com/user/GiveMeSport"
]
}
{
“@context”：”http://schema.org",
“@type”：“新闻文章”，
“页面的维护”：https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
“url”：”https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
“头条新闻”：“曼联的斯科特·麦克托米奈在对阵RB莱比锡的比赛中打进第三个球后，球迷们的反应非常高兴”，
“发布日期”：“2020-10-30T21:52:48.3510000Z”，
“修改日期”：“2020-10-30T21:52:48.3510000Z”，
“描述”：“曼联的斯科特·麦克托米奈在对阵RB莱比锡的比赛中打进第三个球后，球迷们的反应非常高兴”，
“条款部分”：“足球”，
“关键词”：[“足球”、“曼联”、“马库斯·拉什福德”、“RB莱比锡”、“斯科特·麦克托米奈”、“欧足联冠军”]，
“创建者”：[“斯科特·威尔逊”]，
“缩略图URL”：https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
“作者”：{
“@type”：“Person”，
“姓名”：“斯科特·威尔逊”，
“sameAs”：https://www.givemesport.com/scott-wilson-1"
},
“出版商”：{
“@id”：”https://www.givemesport.com/#gms"
},
“图像”：{
“@type”：“ImageObject”，
“url”：”https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
“高度”：620，
“宽度”：960
}
}

很遗憾，您不想遵循最佳实践，使用正则表达式解析HTML充满了问题。然而，如果你想快速而肮脏的工作，使用

（（？：（？！'script>）

Dosen听起来不像regex是这样做的手段。它会变得缓慢和不准确。如果脚本标记中没有

，您可以尝试。使用regex解析HTML不是一个好的做法：您可以在JS和next

getElementsByTagName中解析HTML字符串，或者通过CSS选择器：@mkczyk我没有在这种情况下，我不担心良好的实践。
--------------------------------------------------------------------------------
  <script                  '<script type="application'
  type="application
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  ld                       'ld'
--------------------------------------------------------------------------------
  \+                       '+'
--------------------------------------------------------------------------------
  json">                   'json">'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        <                        '<'
--------------------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        script                   'script'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [\w\W]                   any character of: word characters (a-
                               z, A-Z, 0-9, _), non-word characters
                               (all but a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
    "@type":                 '"@type":'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "NewsArticle"            '"NewsArticle"'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  script>                  'script>'