Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Ruby解析结构化文件_Ruby - Fatal编程技术网

用Ruby解析结构化文件

用Ruby解析结构化文件,ruby,Ruby,我想解析一个大的日志文件(大约500mb)。如果这不是适合这份工作的工具,请告诉我 我有一个日志文件,其内容结构如下。每个部分可以有额外的键值对: requestID: saldksadk time: 92389389 action: foobarr ---------------------- requestID: 2393029 time: 92389389 action: helloworld source: email ---------------------- requestID:

我想解析一个大的日志文件(大约500mb)。如果这不是适合这份工作的工具,请告诉我

我有一个日志文件,其内容结构如下。每个部分可以有额外的键值对:

requestID: saldksadk
time: 92389389
action: foobarr
----------------------
requestID: 2393029
time: 92389389
action: helloworld
source: email
----------------------
requestID: skjflkjasf3
time: 92389389
userAgent: mobile browser
----------------------
requestID: gdfgfdsdf
time: 92389389
action: randoms
我想知道是否有一种简单的方法来处理日志中每个部分的数据。一个部分可以跨越多行,所以我不能只拆分字符串。例如,有没有一种简单的方法可以做到这一点:

for(section in log){
   // handle section contents
}

这看起来像是,虽然它不完全是YAML。(YAML仅用三个破折号分隔文档。)您可以尝试以某种方式损坏文档,使仅由连字符组成的行折叠为三个连字符,从而使其成为有效的YAML。之后,您可以将其输入YAML解析器。

我将您的示例文本保存到一个名为“test.txt”的文件中。以以下方式打开它:

File.foreach('test.txt').slice_before(/^---/).to_a
返回:

[
  ["requestID: saldksadk\n", "time: 92389389\n", "action: foobarr\n"], 
  ["----------------------\n", "requestID: 2393029\n", "time: 92389389\n", "action: helloworld\n", "source: email\n"], 
  ["----------------------\n", "requestID: skjflkjasf3\n", "time: 92389389\n", "userAgent: mobile browser\n"], 
  ["----------------------\n", "requestID: gdfgfdsdf\n", "time: 92389389\n", "action: randoms\n"]
]
通过过滤器运行每个子数组,我们可以去掉前导的“---”:

运行后,
为:

[
  ["requestID: saldksadk", "time: 92389389", "action: foobarr"],
  ["requestID: 2393029", "time: 92389389", "action: helloworld", "source: email"],
  ["requestID: skjflkjasf3", "time: 92389389", "userAgent: mobile browser"],
  ["requestID: gdfgfdsdf", "time: 92389389", "action: randoms"]
]
稍微调整一下:

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  Hash[ary.map{ |s| s.chomp.split(':') }]
}
将是:

[
  {"requestID"=>" saldksadk", "time"=>" 92389389", "action"=>" foobarr"},
  {"requestID"=>" 2393029", "time"=>" 92389389", "action"=>" helloworld", "source"=>" email"},
  {"requestID"=>" skjflkjasf3", "time"=>" 92389389", "userAgent"=>" mobile browser"},
  {"requestID"=>" gdfgfdsdf", "time"=>" 92389389", "action"=>" randoms"}
]

利用icktoofay的想法,通过使用a,我得到了:

require 'yaml'

File.open("path/to/file") do |f|
  f.each_line("\n----------------------\n") do |line|
    puts YAML::load(line.sub(/\-{3,}/, "---")).inspect
  end
end
输出:

{"requestID"=>"saldksadk", "time"=>92389389, "action"=>"foobarr"}
{"requestID"=>2393029, "time"=>92389389, "action"=>"helloworld", "source"=>"email"}
{"requestID"=>"skjflkjasf3", "time"=>92389389, "userAgent"=>"mobile browser"}
{"requestID"=>"gdfgfdsdf", "time"=>92389389, "action"=>"randoms"}

您可以逐行阅读该文件。对于每一行,我们将检查它是记录分隔符还是键:值对。如果是前者,我们将把当前记录添加到记录列表中。如果是后者,我们将把k:v对添加到当前记录中

records = []
record = {}
open("data.txt", "r").each do |line|
  if line.start_with? "-"
    records << record unless record.empty?
    record = {}
  else
    k, v = line.split(":", 2).map(&:strip)
    record[k] = v
  end
end
records << record unless record.empty?

这是一种非常基本的方法,使其保持简单和高效:

blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?
blocks=[]
当前_块={}
sep_范围=0..3
sep_值=“-----”
拆分模式=/:\s*/
File.open(“filename.txt”,“r”)do | f|
f、 每条线都要做|
如果行[sep\u范围]==sep\u值

如果你不打算给出一个具体的理由,街区不会投反对票。upvoting首先,不要试图一次将500MB加载到内存中,这是拆分文件时必须执行的操作。这不是一个可扩展的解决方案。我从来没有说过我想把它全部加载到内存中。。。这就是为什么我贴在这里寻求建议。编辑得不错。。。那么建议是什么呢?我相信,这些解决方案中的许多都可以扩展到流式处理数据,而不是使用
yield
一次性在内存中收集数据;它将文档分开。瞧。@lc2817:事实上,请原谅;似乎只允许三个破折号。我错误地认为允许使用三个以上的破折号。你是对的;YAML不正确。@icktoofay好主意,+1。我用这个作为我的答案。而不是
f=File.new“path/to/File”f.each_行(“-------------------------”)…
使用
File.foreach('path/to/File',“-------------------------”)…
。Ruby将在块退出后自动关闭文件。我认为
每行的参数
的开头和结尾都应该有一个新行;否则,它会在直线中间的破折号上分裂。如果它们的长度是可变的,那么您需要在内部做更多的工作来清除仅连字符的记录,但这并不是那么糟糕,也许
next If line.start_?“-”
或类似的东西。@iain:我指的是值中的破折号,比如
userAgent:Mozilla--------------------Firefox
@icktoofay啊,我明白了。我测试了它,它仍然有效,所以我用它更新了答案。
[{"requestID"=>"saldksadk", "time"=>"92389389", "action"=>"foobarr"},
 {"requestID"=>"2393029", "time"=>"92389389", "action"=>"helloworld", "source"=>"email"},
 {"requestID"=>"skjflkjasf3", "time"=>"92389389", "userAgent"=>"mobile browser"}, 
 {"requestID"=>"gdfgfdsdf", "time"=>"92389389", "action"=>"randoms"}]
blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?