Ruby 解读这些原始文本——一种策略?
我有以下原始文本:Ruby 解读这些原始文本——一种策略?,ruby,parsing,text,language-agnostic,screen-scraping,Ruby,Parsing,Text,Language Agnostic,Screen Scraping,我有以下原始文本: ________________________________________________________________________________________________________________________________ Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Ti
________________________________________________________________________________________________________________________________
Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
3 37 Bruce Cook Bruce Cook Ford Escort 3759 10 9:56.4388 4 0:58.3359
4 18 Troy Marinelli Troy Marinelli Nissan Silvia 3396 10 9:56.7758 2 0:58.4443
5 75 Anthony Gilbertson Anthony Gilbertson BMW M3 3200 10 10:02.5842 3 0:58.9336
6 26 Trent Purcell Trent Purcell Mazda RX7 2354 10 10:07.6285 4 0:59.0546
7 12 Scott Hunter Scott Hunter Toyota Corolla 2000 10 10:11.3722 5 0:59.8921
8 91 Graeme Wilkinson Graeme Wilkinson Ford Escort 2000 10 10:13.4114 5 1:00.2175
9 7 Justin Wade Justin Wade BMW M3 4000 10 10:18.2020 9 1:00.8969
10 55 Greg Craig Grag Craig Toyota Corolla 1840 10 10:18.9956 7 1:00.7905
11 46 Kyle Orgam-Moore Kyle Organ-Moore Holden VS Commodore 6000 10 10:30.0179 3 1:01.6741
12 39 Uptiles Strathpine Trent Spencer BMW Mini Cooper S 1500 10 10:40.1436 2 1:02.2728
13 177 Mark Hyde Mark Hyde Ford Escort 1993 10 10:49.5920 2 1:03.8069
14 34 Peter Draheim Peter Draheim Mazda RX3 2600 10 10:50.8159 10 1:03.4396
15 5 Scott Douglas Scott Douglas Datsun 1200 1998 9 9:48.7808 3 1:01.5371
16 72 Paul Redman Paul Redman Ford Focus 2lt 9 10:11.3707 2 1:05.8729
17 8 Matthew Speakman Matthew Speakman Toyota Celica 1600 9 10:16.3159 3 1:05.9117
18 74 Lucas Easton Lucas Easton Toyota Celica 1600 9 10:16.8050 6 1:06.0748
19 77 Dean Fuller Dean Fuller Mitsubishi Sigma 2600 9 10:25.2877 3 1:07.3991
20 16 Brett Batterby Brett Batterby Toyota Corolla 1600 9 10:29.9127 4 1:07.8420
21 95 Ross Hurford Ross Hurford Toyota Corolla 1600 8 9:57.5297 2 1:12.2672
DNF 13 Charles Wright Charles Wright BMW 325i 2700 9 9:47.9888 7 1:03.2808
DNF 20 Shane Satchwell Shane Satchwell Datsun 1200 Coupe 1998 1 1:05.9100 1 1:05.9100
Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012 Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended
我需要将其解析为一个对象,其中包含明显的位置、汽车、驾驶员等字段。问题是我不知道该用什么样的策略。如果我用空格分开,我会得到这样一个列表:
["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
你能看到这个问题吗。我不能仅仅解释这个列表,因为人们可能只有一个名字,或者一个名字中有三个单词,或者一辆车里有很多不同的单词。这使得仅使用索引引用列表变得不可能
使用列名定义的偏移量怎么样?不过我不太明白怎么用它
编辑:因此我使用的当前算法如下:
{:head=>[{}, {}, {}],
:body=>
[{:pos=>"1",
:car=>"6",
:competitor=>"Jason Clements",
:driver=>"Jason Clements",
:vehicle=>"BMW M3",
:cap=>"3200",
:cl_laps=>"10",
:race_time=>"9:48.5710",
:fast_lap_no=>"3",
:fast_lap_time=>"0:57.3228"},
{:pos=>"2",
:car=>"42",
:competitor=>"David Skillender",
:driver=>"David Skillender",
:vehicle=>"Holden VS Commodore",
:cap=>"6000",
:cl_laps=>"10",
:race_time=>"9:55.6866",
:fast_lap_no=>"2",
:fast_lap_time=>"0:57.9409"},
Jason Adams
Bobby Sacka
Jerry Louis
然后,它会将其解释为两个独立的项目:([“Jason”、“Adams”、“Bobby”、“Sacka”、“Jerry”、“Louis”]
)
然而,如果它们都是如此不同:
Dominic Bou
Bob Adams
Jerry Seinfeld
然后它会在宋飞的最后一个“d”上正确地分开(这样我们就得到了三个名字的集合([“多米尼克·鲍”、“鲍勃·亚当斯”、“杰里·宋飞”]
)
它也很脆弱。我正在寻找一个更好的解决方案。除非有一个关于如何分离列的明确规则,否则你无法真正做到这一点 假设您知道每个列值都正确缩进到列标题中,那么您采用的方法是好的
另一种方法是将仅由一个空格分隔的单词组合在一起(从您提供的文本中,我可以看出此规则也适用)。根据格式的一致性,您可能可以使用正则表达式来实现此目的 下面是一个适用于当前数据的示例正则表达式-可能需要根据精确的规则进行调整,但它给出了一个想法:
^
# Pos
(\d+|DNF)
\s+
#Car
(\d+)
\s+
# Team
([\w-]+(?: [\w-]+)+)
\s+
# Driver
([\w-]+(?: [\w-]+)+)
\s+
# Vehicle
([\w-]+(?: ?[\w-]+)+)
\s+
# Cap
(\d{4}|\dlt)
\s+
# CL Laps
(\d+)
\s+
# Race.Time
(\d+:\d+\.\d+)
\s+
# Fastest Lap
(\d+)
\s+
# Fastest Lap Time
(\d+:\d+\.\d+\*?)
\s*
$
如果您可以验证空格是空格字符而不是制表符,并且过长的文本总是被截断以适合列结构,那么我将硬编码切片边界:
parsed = [rawLine[0:3],rawLine[4:7],rawLine[9:38], ...etc... ]
根据数据源的不同,这可能很脆弱(例如,如果每次运行都有不同的列宽)
如果标题行始终相同,则可以通过搜索标题行的已知单词来提取切片边界。假设文本的间距始终相同,则可以根据位置拆分字符串,然后在每个部分周围去掉额外的空格。例如,在python中:
pos=row[0:3].strip()
car=row[4:7].strip()
等等。或者,您可以定义一个正则表达式来捕获每个部分:
([:alnum:]+)\s([:num:]+)\s(([:alpha:]+ )+)\s(([:alpha:]+ )+)\s(([:alpha:]* )+)\s
依此类推。(确切的语法取决于您的regexp语法。)请注意,car regexp需要处理添加的空格。这对于regex来说不是一个好例子,您确实希望发现格式,然后解压缩行:
lines = str.split "\n"
# you know the field names so you can use them to find the column positions
fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
header = lines.shift until header =~ /^Pos/
positions = fields.map{|f| header.index f}
# use that to construct an unpack format string
format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
# A4A5A31A25A21A6A12A10
lines.each do |line|
next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
data = line.unpack(format).map{|x| x.strip}
puts data.join(', ')
# or better yet...
car = Hash[fields.zip data]
puts car['Driver']
end
这可能会解决你的问题
还有几个例子和github
希望这有帮助!您可以使用
固定宽度
宝石
可以使用以下代码解析给定文件:
require 'fixed_width'
require 'pp'
FixedWidth.define :cars do |d|
d.head do |head|
head.trap { |line| line !~ /\d/ }
end
d.body do |body|
body.trap { |line| line =~ /^(\d|DNF)/ }
body.column :pos, 4
body.column :car, 5
body.column :competitor, 31
body.column :driver, 25
body.column :vehicle, 21
body.column :cap, 5
body.column :cl_laps, 11
body.column :race_time, 11
body.column :fast_lap_no, 4
body.column :fast_lap_time, 10
end
end
pp FixedWidth.parse(File.open("races.txt"), :cars)
trap
方法识别每个部分中的行
regex查找不包含数字的行head
regex查找以数字或“DNF”开头的行body
列
定义仅标识要获取的列数。库为您去除空白。如果您想生成固定宽度的文件,可以添加对齐参数,但它似乎不需要
结果是一个如下开始的哈希:
{:head=>[{}, {}, {}],
:body=>
[{:pos=>"1",
:car=>"6",
:competitor=>"Jason Clements",
:driver=>"Jason Clements",
:vehicle=>"BMW M3",
:cap=>"3200",
:cl_laps=>"10",
:race_time=>"9:48.5710",
:fast_lap_no=>"3",
:fast_lap_time=>"0:57.3228"},
{:pos=>"2",
:car=>"42",
:competitor=>"David Skillender",
:driver=>"David Skillender",
:vehicle=>"Holden VS Commodore",
:cap=>"6000",
:cl_laps=>"10",
:race_time=>"9:55.6866",
:fast_lap_no=>"2",
:fast_lap_time=>"0:57.9409"},
好吧,我明白了:
Edit:我忘了提到,它假设您已将输入文本存储在变量input\u string
# Choose a delimeter that is unlikely to occure
DELIM = '|||'
# DRY -> extend String
class String
def split_on_spaces(min_spaces = 1)
self.strip.gsub(/\s{#{min_spaces},}/, DELIM).split(DELIM)
end
end
# just get the data lines
lines = input_string.split("\n")
lines = lines[2...(lines.length - 4)].delete_if { |line|
line.empty?
}
# Grab all the entries into a nice 2-d array
entries = lines.map { |line|
[
line[0..8].split_on_spaces,
line[9..85].split_on_spaces(3).map{ |string|
string.gsub(/\s+/, ' ') # replace whitespace with 1 space
},
line[85...line.length].split_on_spaces(2)
].flatten
}
# BONUS
# Make nice hashes
keys = [:pos, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest_lap]
objects = entries.map { |entry|
Hash[keys.zip entry]
}
产出:
entries # =>
["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3 0:57.3228*"]
["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2 0:57.9409"]
...
# all of length 9, no extra spaces
万一阵列不能切断它
objects # =>
{:pos=>"1", :car=>"6", :team=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fastest_lap=>"3 0:57.3228*"}
{:pos=>"2", :car=>"42", :team=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fastest_lap=>"2 0:57.9409"}
...
我将把它重构成漂亮的函数留给您。我认为在每一行上使用固定的宽度就足够简单了
#!/usr/bin/env ruby
# ruby parsing_winner.rb winners_list.txt
args = ARGV
puts "ruby parsing_winner.rb winners_list.txt " if args.empty?
winner_file = open args.shift
array_of_race_results, array_of_race_results_array = [], []
class RaceResult
attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap
def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
@position = position
@car = car
@team = team
@driver = driver
@vehicle = vehicle
@cap = cap
@cl_laps = cl_laps
@race_time = race_time
@fastest = fastest
@fastest_lap = fastest_lap
end
def to_a
# ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
[position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap]
end
end
# Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
# 1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
# 2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
# etc...
winner_file.each_line do |line|
next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/]
position = line[0..3].strip
car = line[4..8].strip
team = line[9..39].strip
driver = line[40..64].strip
vehicle = line[65..85].strip
cap = line[86..91].strip
cl_laps = line[92..101].strip
race_time = line[102..113].strip
fastest = line[114..116].strip
fastest_lap = line[117..-1].strip
racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
array_of_race_results << racer
array_of_race_results_array << racer.to_a
end
puts "Race Results Objects: #{array_of_race_results}"
puts "Race Results: #{array_of_race_results_array.inspect}"
!/usr/bin/env ruby
#ruby解析_winner.rb winners_list.txt
args=ARGV
如果args.empty,则将“ruby parsing_winner.rb winners_list.txt”放入?
winner\u文件=打开args.shift
比赛结果数组,比赛结果数组=[],[]
班级竞赛成绩
属性访问器:位置、车辆、车队、驾驶员、车辆、上限、圈数、比赛时间、最快、最快圈数
def初始化(位置、车辆、车队、驾驶员、车辆、cap、cl_圈、比赛时间、最快、最快_圈)
@位置=位置
@汽车
@团队
@司机
@车辆=车辆
@上限=上限
@cl_圈=cl_圈
@比赛时间=比赛时间
@最快的
@最快圈数=最快圈数
结束
def to_a
#[“1”,“6”,“Jason”,“Clements”,“Jason”,“Clements”,“BMW”,“M3”,“3200”,“10”,“9:48.5710”,“3”,“0:57.3228*”]
[位置、汽车、车队、驾驶员、车辆、驾驶帽、cl_圈、比赛时间、最快、最快_圈]
结束
结束
#Pos汽车竞争对手/车队驾驶员车辆Cap CL圈比赛。时间最快…圈
#1 6 Jason Clements Jason Clements宝马M3 3200 10 9:48.5710 3 0:57.3228*
#2 42大卫·斯基伦德·大卫·斯基伦德·霍尔顿VS准将6000 10 9:55.6866 2 0:57.9409
#等等。。。
winner_file.each_line do|line|
下一个if行[/^ uuuuuuuuuuu/]| |行[/^\w{4,}| ^\s{Pos/]| |行[0..3][/\=/]
位置=行[0..3]。带
car=线[4..8]。带
团队=第[9..39]行。带
驱动程序=行[40..64]。带
车辆=线[65..85]。带
cap=线[86..91]。带
cl_laps=直线
someArray = array of strings that were split by white space
Pos = someArray[0]
Car = someArray[1]
Competitor/Team = someArray[2] + " " + someArray[3]
Driver = someArray[4] + " " + someArray[5]
Vehicle = someArray[6] + " " + ... + " " + someArray[someArray.length - 6]
Cap = someArray[someArray.length - 5]
CL Laps = someArray[someArray.length - 4]
Race.Time = someArray[someArray.length - 3]
Fastest...Lap = someArray[someArray.length - 2] + " " + someArray[someArray.length - 1]