Python 在标记代码块外查找图像标记简介_Python_Regex_Python 3.x_Bash_Markdown

Python 在标记代码块外查找图像标记简介

python regex python-3.x bash markdown

Python 在标记代码块外查找图像标记简介,python,regex,python-3.x,bash,markdown,Python,Regex,Python 3.x,Bash,Markdown,我有几百个带有代码块的降价文件，它们看起来像这样 ```html <img src="fil.png"> ``` - [ ] Here is another image <img src="fil.png"> and another `<img src="fil.png">` ```html <a href="scratch/index.html" id=&quo

我有几百个带有代码块的降价文件，它们看起来像这样

```html
<img src="fil.png">
```

- [ ] Here is another image <img src="fil.png"> and another `<img src="fil.png">`

  ```html
  <a href="scratch/index.html" id="scratch" data-original-title="" title="" aria-describedby="popover162945">
    <div class="logo-wrapper">
    </div>
    <div class="name">
      <span>Scratch</span>
    </div>
    <img src="fil.png">
  </a>
  ```

然而，我觉得一个更干净的方法可能是这样的

过滤掉每个代码块

找到每个img标签

查找每个不带alt标记的img标记

我在第一步上有点卡住了。我可以使用以下正则表达式找到每个代码块：

```[a-z]*\n[\s\S]*?\n```

但是，我不知道如何反转，例如，查找它之外的所有文本。我愿意接受任何可以在bash脚本或python中运行的解决方案。

你完全正确，这是regex-trashcan方法的一个经典案例：我们*跳过整个匹配中要避免的内容，使用捕获组来获取我们真正想要的内容，即

我想要避免的内容（我想要匹配的内容）

：

``.*.`.*.*.*.`.*.（\n）
“``\n\n”
-[]这是另一个图像和另一个“`\n\n”
“``html\n”
“\n”
"  ```")
matches=re.finditer（regex、test\u str、re.DOTALL）
对于匹配中的匹配：
如果匹配。组（1）：
打印（“位于{start}-{end}:{group}”。格式（start=match.start（1），end=match.end（1），group=match.group（1）））

实际上，只在完全匹配中添加a就足够了。但是，可以说它更具可读性，并且如上所示更清楚地演示了这个想法。

我的方法是删除

“`

”之间的所有字符串，然后将文本提供给BeautifulSoup进行解析（我将找到所有不带

alt

属性的

img

标记，并打印它的

src

）：

@ØisteinSøvik作为一个快速跟进，下面是使用否定字符组和所有格量词组的模式的优化版本：要利用它，需要使用python的替代包：import regex As re

- [ ] Here is another image `<img src="fil.png">` and another <img src="dog.png" title: "re
aaaaaaaaaaaaaaaallllyl long title">

<img(\s*(?!alt)([\w\-])+=([\"\'])[^\"\']+\3)*\s*\/?>

```[a-z]*\n[\s\S]*?\n```

```.*?```|`.*?`|(<img(?!.*?alt=(['\"]).*?\2)[^>]*)(>)

import re
regex = r"```.*?```|`.*?`|(<img(?!.*?alt=(['\"]).*?\2)[^>]*)(>)"
test_str = ("```html\n"
    "<img src=\"fil.png\">\n"
    "```\n\n"
    "- [ ] Here is another image <img src=\"fil.png\"> and another `<img src=\"fil.png\">`\n\n"
    "  ```html\n"
    "  <a href=\"scratch/index.html\" id=\"scratch\" data-original-title=\"\" title=\"\" aria-describedby=\"popover162945\">\n"
    "    <div class=\"logo-wrapper\">\n"
    "    </div>\n"
    "    <div class=\"name\">\n"
    "      <span>Scratch</span>\n"
    "    </div>\n"
    "    <img src=\"fil.png\">\n"
    "  </a>\n"
    "  ```")

matches = re.finditer(regex, test_str, re.DOTALL)
for match in matches:
    if match.group(1):
        print ("Found at {start}-{end}: {group}".format(start = match.start(1), end = match.end(1), group = match.group(1)))

data = """
```html
<img src="fil.png">
```

- [ ] Here is another image <img src="fil.png"> and another `<img src="fil.png">`

  ```html
  <a href="scratch/index.html" id="scratch" data-original-title="" title="" aria-describedby="popover162945">
    <div class="logo-wrapper">
    </div>
    <div class="name">
      <span>Scratch</span>
    </div>
    <img src="fil.png">
  </a>
  ```
  """

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(re.sub(r'`+[^`]+`+', '', data), 'lxml')
for img in soup.find_all(lambda t: t.name == 'img' and not 'alt' in t.attrs):
    print(img['src'])

fil.png