bash命令将html页面转换为文本文件_Bash

bash命令将html页面转换为文本文件

bash

bash命令将html页面转换为文本文件,bash,Bash,我是linux的初学者。你能帮我把html页面转换成文本文件吗。文本文件将从网页中删除任何图像和链接。我只想使用bash命令，不想使用html到文本转换工具。作为一个例子，我想将第一页谷歌搜索结果转换为“计算机” 谢谢你在命令行上所做的一切用法：html2text.py[（文件名| url）[编码]] Options: --version show program's version number and exit -h, --help s

我是linux的初学者。你能帮我把html页面转换成文本文件吗。文本文件将从网页中删除任何图像和链接。我只想使用bash命令，不想使用html到文本转换工具。作为一个例子，我想将第一页谷歌搜索结果转换为“计算机”

谢谢你在命令行上所做的一切

用法：

html2text.py[（文件名| url）[编码]]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevent when -g is
                        specified as well

我认为链接是最常用的工具。检查man链接并搜索纯文本或类似内容-垃圾场是我的猜测，搜索也一样。该软件随大多数发行版一起提供。

最简单的方法是使用类似于转储的内容（简而言之，是可查看HTML的文本版本）

远程文件：

lynx --dump www.google.com > file.txt
links -dump www.google.com

本地文件：

lynx --dump ./1.html > file.txt
links -dump ./1.htm

使用sed

sed -e 's/<[^>]*>//g' foo.html

sed-e's/]*>///g'foo.html

我使用过，它工作得很好，到目前为止……

您可以获得并全局安装模块：

然后像这样使用它：

html-to-text < stuff.html > stuff.txt

html到textstuff.txt

本地htm&html文件的批处理模式，

lynx

必需

#!/bin/sh
# h2t, convert all htm and html files of a directory to text 

for file in `ls *.htm`
do
new=`basename $file htm`
lynx -dump $file > ${new}txt 
done
#####
for file in `ls *.html`
do
new=`basename $file html`
lynx -dump $file > ${new}txt 
done

在ubuntu/debian

html2text

中，这是一个不错的选择

在OSX上，您可以使用名为textutil的命令行工具将html文件批量转换为txt格式：

textutil -convert txt *.html

Bash脚本递归地将html页面转换为文本文件。适用于httpd手册。使grep-Rhi'loadmodulessl'/usr/share/httpd/manual_dump-A 10工作起来很方便

#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)

for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new=`basename $file .html`
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done

您可能无法仅使用“bash命令”完成此操作，您可能至少需要

sed

或

awk

。并不是说仅仅使用普通的bash内置程序是不可能的，但这肯定是不可行的。如果您使用

curl

获取该页面，您可以将其传输到

| lynx-stdin-dump

。再见，谢谢。这真的很有帮助。然而，当模式不止一行时，这不起作用。它还将输出不需要的元素的内容。当它删除某些内容时，我应该如何更改此脚本以放置新行？sed-e的/]*>/\n/g'foo.html>foo.txt阅读注释页：

-dump:转储默认文档的格式化输出

--I“格式化”的意思是，所有的html标签。。。通过DISPLAY_charset_CHOICE和successed_DOC_charset_CHOICE查看“假定文档字符集”字段。正如我所说，它可能与您的版本相关，请尝试其他链接。下面是一些澄清的例子，这个答案是在2014被编辑的，链接到Python项目，它指向一个不同的程序（ HTML2Te.Py < /C> >），而不是我认为原作者所打算的（Ubuntu/Debian，C++程序）。不管怎么说，现在更改已经太迟了，但我想我会指出这一点，因为有点困惑。注意，这是一个不同于

html2text.py

的另一个答案的程序，如何将其应用于所有子目录？它对我不起作用！输出与我使用的XHTML完全相同！

#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)

for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new=`basename $file .html`
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done