Ruby 纯文本表数据组织
我有一个像这样的纯文本表格。我需要对结果行进行分组,以便将数据集中在各自的列中 我可以在一个空格上拆分字符串(一行),然后得到如下数组:Ruby 纯文本表数据组织,ruby,web-scraping,screen-scraping,plaintext,Ruby,Web Scraping,Screen Scraping,Plaintext,我有一个像这样的纯文本表格。我需要对结果行进行分组,以便将数据集中在各自的列中 我可以在一个空格上拆分字符串(一行),然后得到如下数组: [“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52”、“一些”、“名字”、“哟”、“草原”、“客栈”、“哈里斯”、“鲁尼”、“25:03”] 我还可以在两个空格上拆分,这让我很接近,但仍然不一致,正如您看到的名称: [“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22
[“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52”、“一些”、“名字”、“哟”、“草原”、“客栈”、“哈里斯”、“鲁尼”、“25:03”]
我还可以在两个空格上拆分,这让我很接近,但仍然不一致,正如您看到的名称:
[“2”、“1/47”、“M4044”、“25:03*”、“856”、“12:22”、“12:41”、“17.52一些名字Yo”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”、“”
我可以指定要连接的索引,但我需要像这样获取数千个文件,而且列的顺序并不总是相同的
一个常量是列数据的长度永远不会超过列名和数据之间的分隔符(代码)。我试图利用这个优势,但发现了一些漏洞
我需要编写一个算法来检测名称列中的内容以及其他“word”列中的内容。有什么想法吗?这应该行得通
divider = "===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== ======="
str = " 1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46"
divider.split(/\s+/).each {|delimiter| puts str.slice!(0..delimiter.size).strip }
这应该行得通
divider = "===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== ======="
str = " 1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46"
divider.split(/\s+/).each {|delimiter| puts str.slice!(0..delimiter.size).strip }
首先,我们设置问题:
data = <<EOF
Place Div/Tot Div Guntime PerF 1sthalf 2ndhalf 100m Name Club Nettime
===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== =======
1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46
2 1/47 M4044 25:03* 856 12:22 12:41 17.52 Some Name Yo Prairie Inn Harriers Runni 25:03
EOF
lines = data.split "\n"
剩下的很简单:
headers = lines[0].unpack format
lines[2..-1].each do |line|
puts Hash[headers.zip line.unpack(format).map(&:strip)]
end
#=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
#=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}
首先,我们设置问题:
data = <<EOF
Place Div/Tot Div Guntime PerF 1sthalf 2ndhalf 100m Name Club Nettime
===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== =======
1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46
2 1/47 M4044 25:03* 856 12:22 12:41 17.52 Some Name Yo Prairie Inn Harriers Runni 25:03
EOF
lines = data.split "\n"
剩下的很简单:
headers = lines[0].unpack format
lines[2..-1].each do |line|
puts Hash[headers.zip line.unpack(format).map(&:strip)]
end
#=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
#=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}
下面是一个有效的解决方案(基于给定的文件——但我想应该推广到此表单的所有文件):
下面是一个有效的解决方案(基于给定的文件——但我想应该推广到此表单的所有文件):
你是创建这个的人吗。。?
header, format, *data = plain_text_table.split($/)
h = {}
format.scan(/=+/) do
range = $~.begin(0)..$~.end(0)
h[header[range].strip] = data.map{|s| s[range].strip}
end
h # => {
"Place" => ["1", "2"],
"Div/Tot" => ["1/24", "1/47"],
"Div" => ["M3034", "M4044"],
"Guntime" => ["24:46", "25:03*"],
"PerF" => ["866", "856"],
"1sthalf" => ["12:11", "12:22"],
"2ndhalf" => ["12:35", "12:41"],
"100m" => ["15.88", "17.52"],
"Name" => ["Andy Bas", "Some Name Yo"],
"Club" => ["Prairie Inn Harriers", "Prairie Inn Harriers Runni"],
"Nettime" => ["24:46", "25:03"]
}