从html标记中提取内容_Html_Extraction

从html标记中提取内容

html

从html标记中提取内容,html,extraction,Html,Extraction,我有一个包含100多个html文件的目录。我只需要提取和标记中的内容，然后将它们格式化为：标题，“正文内容”（即每个文档一行）如果数组中每个文件的结果都可以写入一个巨型文本文件，这将是有益的。我发现以下命令可将文档格式化为一行： grep'^[^您可以使用C#和LINQ来实现这一点。加载文件的一个快速示例： IDictionary<string, string> parsed = new Dictionary<string, string>(); f

我有一个包含100多个html文件的目录。我只需要提取

和

标记中的内容，然后将它们格式化为：

标题，“正文内容”（即每个文档一行）

如果数组中每个文件的结果都可以写入一个巨型文本文件，这将是有益的。我发现以下命令可将文档格式化为一行：

grep'^[^您可以使用C#和LINQ来实现这一点。加载文件的一个快速示例：
    IDictionary<string, string> parsed = new Dictionary<string, string>();

    foreach ( string file in Directory.GetFiles( @"your directory here" ) )
    {
        var html = XDocument.Load( "file path here" ).Element( "html" );

        string title = html.Element( "title" ).Value;
        string body = html.Element( "body" ).Value;
        body = XElement.Parse( body ).ToString( SaveOptions.DisableFormatting );

        parsed.Add( title, body );
    }

    using ( StreamWriter file = new StreamWriter( @"your file path") )
    {
        foreach ( KeyValuePair<string, string> pair in parsed )
        {
            file.WriteLine( string.Format( "{0}, \"{1}\"", pair.Key, pair.Value ) );
        }
    }

IDictionary parsed=new Dictionary（）；
foreach（Directory.GetFiles（@“此处为您的目录”）中的字符串文件）
{
var html=XDocument.Load（“此处的文件路径”）.Element（“html”）；
字符串title=html.Element（“title”）.Value；
字符串body=html.Element（“body”）.Value；
body=XElement.Parse（body）.ToString（SaveOptions.DisableFormatting）；
添加（标题、正文）；
}
使用（StreamWriter file=newstreamwriter（@“您的文件路径”））
{
foreach（已解析的KeyValuePair对）
{
WriteLine（string.Format（“{0}，\“{1}\”，pair.Key，pair.Value））；
}
}

我还没有测试这段特定的代码，但它应该可以工作。HTH
编辑：如果您有基本目录路径，可以使用来检索目录中的文件名。
以下是Ruby中使用Nokogiri的一些内容
require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text
  puts %Q(#{title}, "#{body}")
end

将其保存到.rb
文件中，例如extractor.rb
。然后需要通过运行gem install Nokogiri
确保安装了Nokogiri
这样使用此脚本：
ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt

请注意，我在这个脚本中不处理换行符，但您似乎已经明白了这一点
更新：
这一次，它剥离了新行和开始/结束空格
require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text.gsub("\n", '').strip
  puts %Q(#{title}, "#{body}")
end

谢谢，但是代码似乎有点短。我应该如何将结果写入一个文件？对不起，我没有这方面的经验C@user1290757请给我一分钟时间创建一个更好的示例，但同时，这可能会有所帮助：谢谢，我从未使用过C。下面是我尝试的foreach（Directory.GetFiles（@“C:\”）中的字符串文件）{var html=XDocument.Load（“test”）.Element（“html”）；string title=html.Element（“title”）.Value；string body=html.Element（“body”）.Value；body=XElement.Parse（body）.ToString（SaveOptions.DisableFormatting）；parsed.Add（title，body）；}使用（StreamWriter file=new StreamWriter（@“C:\”）
我收到4个错误1:名称“Directory”在当前上下文中不存在错误3找不到类型或命名空间名称“StreamWriter”（是否缺少using指令或程序集引用？）c:\Program.cs 27 20 LINQConsoleApplication1错误4找不到类型或命名空间名称“StreamWriter”（是否缺少using指令或程序集引用？）c:\Program.cs 27 44 LINQConsoleApplication1StreamWriter
需要引用System.IO
，IDictionary
和Dictionary
需要System.Collections.Generic
和XDocument
需要System.Xml.Linq
。最简单的修复方法是使用引用将它们添加到顶部谢谢，这非常接近。它唯一缺少的小格式，我如何在一行中生成每个文件的结果（将每个结果的格式设置为每行一个）。例如，应用grep'^[^我尝试了多个文件，我可能无法清楚地解释。html文件（）大多数格式为段落，脚本提取和，但不重新格式化文本，使每个文档以标题“内容”的形式显示在一行中。它显示如下：标题1，“xxxxxxxxxxxxxxxx”标题2，“XXXXXX”而不是标题1“XXXXXXXXXX”也许你可以粘贴一个输入和输出的实际例子？先生，在你更新脚本之前我没有回复，或者没有完成。它工作得很好。谢谢你为我节省了几个小时的手动操作时间。哈哈