Ruby 纯文本表数据组织_Ruby_Web Scraping_Screen Scraping_Plaintext

Ruby 纯文本表数据组织

ruby web-scraping

Ruby 纯文本表数据组织,ruby,web-scraping,screen-scraping,plaintext,Ruby,Web Scraping,Screen Scraping,Plaintext,我有一个像这样的纯文本表格。我需要对结果行进行分组，以便将数据集中在各自的列中我可以在一个空格上拆分字符串（一行），然后得到如下数组： [“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52”、“一些”、“名字”、“哟”、“草原”、“客栈”、“哈里斯”、“鲁尼”、“25:03”] 我还可以在两个空格上拆分，这让我很接近，但仍然不一致，正如您看到的名称： [“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22

我有一个像这样的纯文本表格。我需要对结果行进行分组，以便将数据集中在各自的列中

我可以在一个空格上拆分字符串（一行），然后得到如下数组：

[“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52”、“一些”、“名字”、“哟”、“草原”、“客栈”、“哈里斯”、“鲁尼”、“25:03”]

我还可以在两个空格上拆分，这让我很接近，但仍然不一致，正如您看到的名称：

[“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52一些名字Yo”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”
我可以指定要连接的索引，但我需要像这样获取数千个文件，而且列的顺序并不总是相同的
一个常量是列数据的长度永远不会超过列名和数据之间的分隔符（代码）。我试图利用这个优势，但发现了一些漏洞
我需要编写一个算法来检测名称列中的内容以及其他“word”列中的内容。有什么想法吗？
这应该行得通
divider = "===== ======= ===== =======  ==== ======= ======= ====== ========================= ========================== ======="
str     = "    1   1/24  M3034   24:46   866   12:11   12:35  15.88 Andy Bas                  Prairie Inn Harriers         24:46"

divider.split(/\s+/).each {|delimiter| puts str.slice!(0..delimiter.size).strip }

这应该行得通
divider = "===== ======= ===== =======  ==== ======= ======= ====== ========================= ========================== ======="
str     = "    1   1/24  M3034   24:46   866   12:11   12:35  15.88 Andy Bas                  Prairie Inn Harriers         24:46"

divider.split(/\s+/).each {|delimiter| puts str.slice!(0..delimiter.size).strip }

首先，我们设置问题：
data = <<EOF
Place Div/Tot Div   Guntime  PerF 1sthalf 2ndhalf 100m   Name                      Club                       Nettime 
===== ======= ===== =======  ==== ======= ======= ====== ========================= ========================== ======= 
    1   1/24  M3034   24:46   866   12:11   12:35  15.88 Andy Bas                  Prairie Inn Harriers         24:46 
    2   1/47  M4044   25:03*  856   12:22   12:41  17.52 Some Name Yo              Prairie Inn Harriers Runni   25:03 
EOF
lines = data.split "\n"

剩下的很简单：
headers = lines[0].unpack format
lines[2..-1].each do |line|
  puts Hash[headers.zip line.unpack(format).map(&:strip)]
end
#=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
#=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}

首先，我们设置问题：
data = <<EOF
Place Div/Tot Div   Guntime  PerF 1sthalf 2ndhalf 100m   Name                      Club                       Nettime 
===== ======= ===== =======  ==== ======= ======= ====== ========================= ========================== ======= 
    1   1/24  M3034   24:46   866   12:11   12:35  15.88 Andy Bas                  Prairie Inn Harriers         24:46 
    2   1/47  M4044   25:03*  856   12:22   12:41  17.52 Some Name Yo              Prairie Inn Harriers Runni   25:03 
EOF
lines = data.split "\n"

剩下的很简单：
headers = lines[0].unpack format
lines[2..-1].each do |line|
  puts Hash[headers.zip line.unpack(format).map(&:strip)]
end
#=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
#=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}

下面是一个有效的解决方案（基于给定的文件——但我想应该推广到此表单的所有文件）：
下面是一个有效的解决方案（基于给定的文件——但我想应该推广到此表单的所有文件）：
你是创建这个的人吗。。？
header, format, *data = plain_text_table.split($/)
h = {}
format.scan(/=+/) do
  range = $~.begin(0)..$~.end(0)
  h[header[range].strip] = data.map{|s| s[range].strip}
end

h # => {
  "Place" => ["1", "2"],
  "Div/Tot" => ["1/24", "1/47"],
  "Div" => ["M3034", "M4044"],
  "Guntime" => ["24:46", "25:03*"],
  "PerF" => ["866", "856"],
  "1sthalf" => ["12:11", "12:22"],
  "2ndhalf" => ["12:35", "12:41"],
  "100m" => ["15.88", "17.52"],
  "Name" => ["Andy Bas", "Some Name Yo"],
  "Club" => ["Prairie Inn Harriers", "Prairie Inn Harriers Runni"],
  "Nettime" => ["24:46", "25:03"]
}