用Ruby解析结构化文件_Ruby - Fatal编程技术网

用Ruby解析结构化文件

ruby

用Ruby解析结构化文件,ruby,Ruby,我想解析一个大的日志文件（大约500mb）。如果这不是适合这份工作的工具，请告诉我我有一个日志文件，其内容结构如下。每个部分可以有额外的键值对： requestID: saldksadk time: 92389389 action: foobarr ---------------------- requestID: 2393029 time: 92389389 action: helloworld source: email ---------------------- requestID:

我想解析一个大的日志文件（大约500mb）。如果这不是适合这份工作的工具，请告诉我

我有一个日志文件，其内容结构如下。每个部分可以有额外的键值对：

requestID: saldksadk
time: 92389389
action: foobarr
----------------------
requestID: 2393029
time: 92389389
action: helloworld
source: email
----------------------
requestID: skjflkjasf3
time: 92389389
userAgent: mobile browser
----------------------
requestID: gdfgfdsdf
time: 92389389
action: randoms

我想知道是否有一种简单的方法来处理日志中每个部分的数据。一个部分可以跨越多行，所以我不能只拆分字符串。例如，有没有一种简单的方法可以做到这一点：

for(section in log){
   // handle section contents
}

这看起来像是，虽然它不完全是YAML。（YAML仅用三个破折号分隔文档。）您可以尝试以某种方式损坏文档，使仅由连字符组成的行折叠为三个连字符，从而使其成为有效的YAML。之后，您可以将其输入YAML解析器。

我将您的示例文本保存到一个名为“test.txt”的文件中。以以下方式打开它：

File.foreach('test.txt').slice_before(/^---/).to_a

[
  ["requestID: saldksadk\n", "time: 92389389\n", "action: foobarr\n"], 
  ["----------------------\n", "requestID: 2393029\n", "time: 92389389\n", "action: helloworld\n", "source: email\n"], 
  ["----------------------\n", "requestID: skjflkjasf3\n", "time: 92389389\n", "userAgent: mobile browser\n"], 
  ["----------------------\n", "requestID: gdfgfdsdf\n", "time: 92389389\n", "action: randoms\n"]
]

通过过滤器运行每个子数组，我们可以去掉前导的“---”：

运行后，

块

为：

[
  ["requestID: saldksadk", "time: 92389389", "action: foobarr"],
  ["requestID: 2393029", "time: 92389389", "action: helloworld", "source: email"],
  ["requestID: skjflkjasf3", "time: 92389389", "userAgent: mobile browser"],
  ["requestID: gdfgfdsdf", "time: 92389389", "action: randoms"]
]

稍微调整一下：

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  Hash[ary.map{ |s| s.chomp.split(':') }]
}

而

块

将是：

[
  {"requestID"=>" saldksadk", "time"=>" 92389389", "action"=>" foobarr"},
  {"requestID"=>" 2393029", "time"=>" 92389389", "action"=>" helloworld", "source"=>" email"},
  {"requestID"=>" skjflkjasf3", "time"=>" 92389389", "userAgent"=>" mobile browser"},
  {"requestID"=>" gdfgfdsdf", "time"=>" 92389389", "action"=>" randoms"}
]

利用icktoofay的想法，通过使用a，我得到了：

require 'yaml'

File.open("path/to/file") do |f|
  f.each_line("\n----------------------\n") do |line|
    puts YAML::load(line.sub(/\-{3,}/, "---")).inspect
  end
end

输出：

{"requestID"=>"saldksadk", "time"=>92389389, "action"=>"foobarr"}
{"requestID"=>2393029, "time"=>92389389, "action"=>"helloworld", "source"=>"email"}
{"requestID"=>"skjflkjasf3", "time"=>92389389, "userAgent"=>"mobile browser"}
{"requestID"=>"gdfgfdsdf", "time"=>92389389, "action"=>"randoms"}

您可以逐行阅读该文件。对于每一行，我们将检查它是记录分隔符还是键：值对。如果是前者，我们将把当前记录添加到记录列表中。如果是后者，我们将把k:v对添加到当前记录中

records = []
record = {}
open("data.txt", "r").each do |line|
  if line.start_with? "-"
    records << record unless record.empty?
    record = {}
  else
    k, v = line.split(":", 2).map(&:strip)
    record[k] = v
  end
end
records << record unless record.empty?

这是一种非常基本的方法，使其保持简单和高效：

blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?

blocks=[]
当前_块={}
sep_范围=0..3
sep_值=“-----”
拆分模式=/：\s*/
File.open（“filename.txt”，“r”）do | f|
f、 每条线都要做|
如果行[sep\u范围]==sep\u值
如果你不打算给出一个具体的理由，街区不会投反对票。upvoting首先，不要试图一次将500MB加载到内存中，这是拆分文件时必须执行的操作。这不是一个可扩展的解决方案。我从来没有说过我想把它全部加载到内存中。。。这就是为什么我贴在这里寻求建议。编辑得不错。。。那么建议是什么呢？我相信，这些解决方案中的许多都可以扩展到流式处理数据，而不是使用yield
一次性在内存中收集数据；它将文档分开。瞧。@lc2817：事实上，请原谅；似乎只允许三个破折号。我错误地认为允许使用三个以上的破折号。你是对的；YAML不正确。@icktoofay好主意，+1。我用这个作为我的答案。而不是f=File.new“path/to/File”f.each_行（“-------------------------”）…
使用File.foreach（'path/to/File'，“-------------------------”）…
。Ruby将在块退出后自动关闭文件。我认为每行的参数
的开头和结尾都应该有一个新行；否则，它会在直线中间的破折号上分裂。如果它们的长度是可变的，那么您需要在内部做更多的工作来清除仅连字符的记录，但这并不是那么糟糕，也许next If line.start_？“-”
或类似的东西。@iain:我指的是值中的破折号，比如userAgent:Mozilla--------------------Firefox@icktoofay啊，我明白了。我测试了它，它仍然有效，所以我用它更新了答案。
[{"requestID"=>"saldksadk", "time"=>"92389389", "action"=>"foobarr"},
 {"requestID"=>"2393029", "time"=>"92389389", "action"=>"helloworld", "source"=>"email"},
 {"requestID"=>"skjflkjasf3", "time"=>"92389389", "userAgent"=>"mobile browser"}, 
 {"requestID"=>"gdfgfdsdf", "time"=>"92389389", "action"=>"randoms"}]

blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?