从标记文件中删除HTML注释_Html_Bash_Awk_Markdown_Pandoc

从标记文件中删除HTML注释

html bash awk markdown

从标记文件中删除HTML注释,html,bash,awk,markdown,pandoc,Html,Bash,Awk,Markdown,Pandoc,例如，如果需要防止注释出现在最终的HTML源代码中，那么在将标记转换为HTML时，这将非常方便示例输入my.md： # Contract Cancellation Dear Contractor X, due to delays in our imports, we would like to ... <!-- ... due to a general shortage in the Y market TODO make sure to verify this befo

例如，如果需要防止注释出现在最终的HTML源代码中，那么在将标记转换为HTML时，这将非常方便

示例输入

my.md

：

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

在Linux上，我会这样做：

cat my.md | remove_html_comments > my-filtered.md

我还能够编写处理一些常见情况的AWK脚本，但据我所知，AWK和任何其他用于简单文本操作的常用工具（如

sed

）都不能胜任这项工作。需要使用HTML解析器

如何编写适当的

删除html\u注释脚本，以及使用哪些工具？
如果使用vim打开脚本，您可以执行以下操作：
:%s/<!--\_.\{-}-->//g

：%s///g

与uu。允许正则表达式匹配所有字符，即使是新行字符，{-}用于使其变为惰性，否则将丢失从第一个注释到最后一个注释的所有内容
我曾尝试在sed上使用相同的表达式，但它不起作用。
这可能有点违反直觉，但我会使用HTML解析器
Python和BeautifulSoup的示例：
import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

它不应破坏.md文件中的任何其他HTML（它可能会稍微更改代码格式，但不会改变其含义）：

当然，如果您决定使用它，请仔细测试它
编辑–在此在线试用：（输入从input.md读取，而不是stdin）
此awk应该可以工作
$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

$awk-vfs=“”{for（i=1；i我的awk解决方案可能比@batMan的解决方案更容易理解，至少对于高级开发人员来说是这样。功能应该大致相同
文件删除\u html\u注释：
！/usr/bin/awk-f
#删除常见的简单HTML注释案例。
#
#例如：
#>cat my.html | remove_html_comments>my-filtered.html
#
#例如，这对于在生成前预解析标记非常有用
#HTML或PDF文档，以确保注释掉的内容
#不会出现在最终文件中，甚至不会作为评论
#在源代码中。
#
#例如：
#>cat my.markdown | remove_html | comments | pandoc-o my-filtered.html
#
#资料来源：hoijui
#许可证：CC0 1.0-https://creativecommons.org/publicdomain/zero/1.0/
开始{
com_lvl=0；
}
// {
如果（com_lvl==1）{
行=$0
子（/.*-->/，“”行）
打印行
}
com_lvl=com_lvl-1；
}
我从您的评论中看到，您主要使用Pandoc
，于2017年10月29日发布。提供了此更改的一些上下文
升级到最新版本并向命令中添加--strip comments
，应在转换过程中删除HTML注释。
将个人注释写入单独的文件我稍微改变了示例。将注释放在文档中然后放在单独的文件中肯定有好处，特别是当有许多修订时它的一部分，由多人共享。您使用的是什么降价处理器？@Chris我主要使用的是pandoc这比awk脚本好多少？在这里使用它不会有什么坏处，但出于我的目的（自动构建/生成，添加手动步骤毫无意义）。
$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

awk -v FS=""                                 # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
    '{ 
        for(i=1; i<=NF; i++)                 # Iterate through each character
        {
            if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
                {                            # then raise flag p and increment i by 4
                    i+=4; p=1                
                } 
            else if(!p && $i!="-->")         # if p==0 then print the character
                 printf $i 
            else if($i$(i+1)$(i+2)=="-->")   # if combination of 3 fields forms comment close tag 
                {                            # then reset flag and increment i by 3
                    i+=3; p=0;
                }

        } 

        printf RS

        }' file

#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/

BEGIN {
    com_lvl = 0;
}

/<!--/ {
    if (com_lvl == 0) {
        line = $0
        sub(/<!--.*/, "", line)
        printf line
    }
    com_lvl = com_lvl + 1
}

com_lvl == 0

/-->/ {
    if (com_lvl == 1) {
        line = $0
        sub(/.*-->/, "", line)
        print line
    }
    com_lvl = com_lvl - 1;
}