Html Ruby Regex在src中查找没有Youtube、Vimeo或Soundcloud的iFrame？_Html_Ruby_Regex

Html Ruby Regex在src中查找没有Youtube、Vimeo或Soundcloud的iFrame？

html ruby regex

Html Ruby Regex在src中查找没有Youtube、Vimeo或Soundcloud的iFrame？,html,ruby,regex,Html,Ruby,Regex,我想编写一个正则表达式来忽略包含来自youtube、vimeo或soundcloud的URL的iFrame，这些URL是用HTML实体编码的字符串这是我尝试过的，但不起作用。下面给出了一些示例文本正则表达式 <iframe(^?youtube|soundcloud|vimeo)*\/iframe 示例文本 <p><iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width

我想编写一个正则表达式来忽略包含来自youtube、vimeo或soundcloud的URL的iFrame，这些URL是用HTML实体编码的字符串

这是我尝试过的，但不起作用。下面给出了一些示例文本

正则表达式

&lt;iframe(^?youtube|soundcloud|vimeo)*\/iframe

示例文本

&lt;p&gt;&lt;iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;&lt;/p&gt;
29  &lt;p&gt;text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the &lt;a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml"&gt;Conne

&lt;p&gt;&lt;iframe src="http://www.youtube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;&lt;/p&gt;
29  &lt;p&gt;text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the &lt;a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml"&gt;Conne

样本输出

&lt;iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;

nil

示例文本

&lt;p&gt;&lt;iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;&lt;/p&gt;
29  &lt;p&gt;text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the &lt;a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml"&gt;Conne

&lt;p&gt;&lt;iframe src="http://www.youtube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;&lt;/p&gt;
29  &lt;p&gt;text daily to place domain staff as volunteers with charity partners, we know all too well that the "V" word can sometimes be misunderstood. Occasionally seen as a dusty, worthy word, it can conjure images of coffee mornings and bric-a-brac stalls. So its not always as easy as you might think to get people to embrace their inner-volunteer. That's why the &lt;a href="http://www.domain.co.uk/sdfn/2010/11/connect-create-domain-volunteers.shtml"&gt;Conne

样本输出

&lt;iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;

nil

我想说清楚：

我想忽略包含youtube、vimeo或soundcloud的iFrame

我在红细胞上测试它

您可以使用此正则表达式：

.*?iframe src=".*?(?:youtube|soundcloud|vimeo).*?".*|(.*?iframe src=".*?".*)

您可以看到，对于第一个输入（绿色输入），有您在问题中指定的输出。对于蓝色匹配，没有输出，因为它是youtube、soundcloud或vimeo的有效匹配

匹配信息

MATCH 1
1.  [0-155] `&lt;p&gt;&lt;iframe src="http://www.3you3tube.com/embed/YoX1yc92MOU" width="500" height="300" frameborder="0" scrolling="auto"&gt;&lt;/iframe&gt;&lt;/p&gt;`

这里的关键是

iframe.*？src=“（？=[^”]*（？：youtube | vimeo | soundcloud））

，所以让我为您展开：

iframe                          ?# literally match iframe
.*?                             ?# lazily match 0+ characters
src="                           ?# literally match src="
(?!                             ?# start negative lookahead assertion
  [^"]*                         ?# match 0+ non-" characters
  (?:youtube|vimeo|soundcloud)  ?# match one of the domains
)                               ?# end assertion

因此，一旦表达式到达

iframe

的

src

属性，它将在任何数量的非

“

字符之后对其中一个域进行否定断言（换句话说，直到

src

属性结束）。只要我们在属性中找不到这些域中的一个，我们就继续通过懒洋洋地匹配其余的

iframe

（直到结束标记）。

众所周知，除非您拥有HTML的生成，否则使用正则表达式解析HTML是很困难的，即使这样也很痛苦

相反，对于最简单的用途之外的任何东西，都可以使用解析器，它可以规范化许多导致模式失败的问题

提交的模式将失败，因为它们假定src参数使用标记名大小写、空格和字符串分隔符。这些可以在模式中使用，但更容易省事。在以下代码中，所有被检查的字符串都是有效的HTML：

require 'htmlentities'
require 'nokogiri'

[
  %#&lt;p&gt;&lt;iframe\nsrc="http://www.youtube.com/embed/YoX1yc92MOU_1"&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;iframe\nsrc= "http://www.youtube.com/embed/YoX1yc92MOU_2"&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;iframe\nsrc = "http://www.youtube.com/embed/YoX1yc92MOU_3"&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;iframe\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_4'&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;Iframe\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_5'&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;IFRAME\nsrc = 'http://www.youtube.com/embed/YoX1yc92MOU_6'&lt;/iframe&gt;&lt;/p&gt;#,
  %#&lt;p&gt;&lt;IFRAME\nsrc =
  'http://www.youtube.com/embed/YoX1yc92MOU_7'&lt;/iframe&gt;&lt;/p&gt;#,
].each do |text|
  html = HTMLEntities::Decoder.new('html4').decode(text)
  doc = Nokogiri::HTML::DocumentFragment.parse(html)

  iframe = doc.at('iframe')
  puts "Ignoring: #{ iframe['src'] }" if iframe['src'][/\b(?:youtube|soundcloud|vimeo)\b/i]
end
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_1
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_2
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_3
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_4
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_5
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_6
# >> Ignoring: http://www.youtube.com/embed/YoX1yc92MOU_7

“当出现此类问题时，这是堆栈溢出上的一个强制性链接。最著名的答案当然是开玩笑，但它强调了不要对模式这样做

在上面的代码中，

/\b（？：youtube | soundcloud | vimeo）\b/i

是一个正则表达式，但它又短又甜，根本不应用于HTML。相反，它是针对src参数的内容使用的，该参数必须在（编码的）中正确HTML，不能被篡改/篡改，否则iframe本身将无法工作。

这不是正则表达式的好用法。HTML可能变化太大，模式无法处理。相反，将实体解码回HTML，然后使用解析器，如Nokogiri，它将规范化HTML，从而很容易忽略顺序、空格、c和apitalization等。我尝试了您提到的解决方案，但数据似乎不太一致。有几个断开的标记导致nokogiri无法正确解析HTML字符串。其中一个示例是这样的问题：@QambarRaza and..？这正是我想要的！只想让您知道解决方案是：/iframe.*？src=“（？！[^]”*（？：youtube | vimeo | soundcloud））.\/iframe/m