Php 利用漏洞从文本中过滤单词_Php_Regex

Php 利用漏洞从文本中过滤单词

php regex

Php 利用漏洞从文本中过滤单词,php,regex,Php,Regex,我有一个过滤器，可以过滤像“屁股”、“操”之类的脏话。现在我正在尝试处理像“f*ck”、“sh/t”这样的漏洞我能做的一件事是将每个单词与有此类漏洞的坏单词词典进行匹配。但这是相当静态的，不是好方法我能做的另一件事是，使用levenshtein距离。应阻止levenshtein距离为1的单词。但这种方法也容易产生误报 if(!ctype_alpha($text)&& levenshtein('shit', $text)===1) { //match } 我正在寻找使用正则表

我有一个过滤器，可以过滤像“屁股”、“操”之类的脏话。现在我正在尝试处理像“f*ck”、“sh/t”这样的漏洞

我能做的一件事是将每个单词与有此类漏洞的坏单词词典进行匹配。但这是相当静态的，不是好方法

我能做的另一件事是，使用levenshtein距离。应阻止levenshtein距离为1的单词。但这种方法也容易产生误报

if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)
{
//match
}

我正在寻找使用正则表达式的方法。也许我可以将levenshtein距离与

regex

结合起来，但我无法理解

任何建议都是非常值得赞赏的。

就像评论中所说的那样，很难做到这一点。这段代码远非完美，它将检查用字母替换相同数量的其他字符的匹配情况

它可能会让你大致了解如何解决这个问题，但如果你想让它更智能，还需要更多的逻辑。例如，此过滤器不会过滤“fukk”、“f ck”、“f**ck”、“fck”、“fuck”（带前导点）或“fück”，而它可能会过滤掉“+++++”以替换为“beep”。但它也会过滤“f*ck”、“f**k”、“f*ck”和“sh1t”，所以它可能会做得更糟。：）

一个简单的改进方法是以更智能的方式拆分字符串，这样标点符号就不会粘在相邻的单词上。另一个改进是从每个单词中删除所有非字母字符，并检查剩余字母在单词中的顺序是否相同。这样，“f\/ck”也会匹配“fuck”。无论如何，让你的想象力尽情发挥吧，但要小心误报。相信我，“他们”总能找到一种方式来表达他们自己，绕过你的过滤器

<?php 
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);

// Loop through all words.
foreach ($words as $word)
{
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
  {
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
    {
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
      {
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
        {
          $wordOk = true;
          break;
        }
      }
      // If the word is not okay, break the loop.
      if (!$wordOk)
      {
        $naughty = true;
        break;
      }
    }
  }

  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');
}

正如评论中所述，很难做到这一点。这段代码远非完美，它将检查用字母替换相同数量的其他字符的匹配情况
它可能会让你大致了解如何解决这个问题，但如果你想让它更智能，还需要更多的逻辑。例如，此过滤器不会过滤“fukk”、“f ck”、“f**ck”、“fck”、“fuck”（带前导点）或“fück”，而它可能会过滤掉“+++++”以替换为“beep”。但它也会过滤“f*ck”、“f**k”、“f*ck”和“sh1t”，所以它可能会做得更糟。：）
一个简单的改进方法是以更智能的方式拆分字符串，这样标点符号就不会粘在相邻的单词上。另一个改进是从每个单词中删除所有非字母字符，并检查剩余字母在单词中的顺序是否相同。这样，“f\/ck”也会匹配“fuck”。无论如何，让你的想象力尽情发挥吧，但要小心误报。相信我，“他们”总能找到一种方式来表达他们自己，绕过你的过滤器
<?php 
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);

// Loop through all words.
foreach ($words as $word)
{
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
  {
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
    {
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
      {
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
        {
          $wordOk = true;
          break;
        }
      }
      // If the word is not okay, break the loop.
      if (!$wordOk)
      {
        $naughty = true;
        break;
      }
    }
  }

  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');
}

正如评论中所述，很难做到这一点。这段代码远非完美，它将检查用字母替换相同数量的其他字符的匹配情况
它可能会让你大致了解如何解决这个问题，但如果你想让它更智能，还需要更多的逻辑。例如，此过滤器不会过滤“fukk”、“f ck”、“f**ck”、“fck”、“fuck”（带前导点）或“fück”，而它可能会过滤掉“+++++”以替换为“beep”。但它也会过滤“f*ck”、“f**k”、“f*ck”和“sh1t”，所以它可能会做得更糟。：）
一个简单的改进方法是以更智能的方式拆分字符串，这样标点符号就不会粘在相邻的单词上。另一个改进是从每个单词中删除所有非字母字符，并检查剩余字母在单词中的顺序是否相同。这样，“f\/ck”也会匹配“fuck”。无论如何，让你的想象力尽情发挥吧，但要小心误报。相信我，“他们”总能找到一种方式来表达他们自己，绕过你的过滤器
<?php 
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);

// Loop through all words.
foreach ($words as $word)
{
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
  {
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
    {
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
      {
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
        {
          $wordOk = true;
          break;
        }
      }
      // If the word is not okay, break the loop.
      if (!$wordOk)
      {
        $naughty = true;
        break;
      }
    }
  }

  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');
}

正如评论中所述，很难做到这一点。这段代码远非完美，它将检查用字母替换相同数量的其他字符的匹配情况
它可能会让你大致了解如何解决这个问题，但如果你想让它更智能，还需要更多的逻辑。例如，此过滤器不会过滤“fukk”、“f ck”、“f**ck”、“fck”、“fuck”（带前导点）或“fück”，而它可能会过滤掉“+++++”以替换为“beep”。但它也会过滤“f*ck”、“f**k”、“f*ck”和“sh1t”，所以它可能会做得更糟。：）
一个简单的改进方法是以更智能的方式拆分字符串，这样标点符号就不会粘在相邻的单词上。另一个改进是从每个单词中删除所有非字母字符，并检查剩余字母在单词中的顺序是否相同。这样，“f\/ck”也会匹配“fuck”。无论如何，让你的想象力尽情发挥吧，但要小心误报。相信我，“他们”总能找到一种方式来表达他们自己，绕过你的过滤器
<?php 
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);

// Loop through all words.
foreach ($words as $word)
{
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
  {
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
    {
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
      {
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
        {
          $wordOk = true;
          break;
        }
      }
      // If the word is not okay, break the loop.
      if (!$wordOk)
      {
        $naughty = true;
        break;
      }
    }
  }

  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');
}

那么你想过滤掉已经被审查过的单词吗？我支持@AmalMurali的链接，但这可能会让你开始（它肯定有缺陷，这就是为什么它是一个快速评论）：山姆：我不是只有一个这样的单词，很多。“fukk”显然被利用了，但“shot”是一个有效的词，与“shit”完全不同。此外，“驴”也意味着“驴”（非洲马）。我想说的是，你永远都不会明白这一点。所以你想过滤掉已经被审查过的单词吗？我支持@AmalMurali的链接，但这可能会让你开始（它肯定有缺陷，这就是为什么它是一个快速评论）：山姆：我不是只有一个这样的单词，很多。“fukk”显然被利用了，但“shot”是一个有效的词，与“shit”完全不同。此外，“屁股”