Ruby 解读这些原始文本——一种策略?

Ruby 解读这些原始文本——一种策略?,ruby,parsing,text,language-agnostic,screen-scraping,Ruby,Parsing,Text,Language Agnostic,Screen Scraping,我有以下原始文本: ________________________________________________________________________________________________________________________________ Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Ti

我有以下原始文本:

________________________________________________________________________________________________________________________________
Pos Car  Competitor/Team                Driver                   Vehicle              Cap   CL Laps     Race.Time Fastest...Lap

1     6  Jason Clements                 Jason Clements           BMW M3               3200       10     9:48.5710   3 0:57.3228*
2    42  David Skillender               David Skillender         Holden VS Commodore  6000       10     9:55.6866   2 0:57.9409 
3    37  Bruce Cook                     Bruce Cook               Ford  Escort         3759       10     9:56.4388   4 0:58.3359 
4    18  Troy Marinelli                 Troy Marinelli           Nissan  Silvia       3396       10     9:56.7758   2 0:58.4443 
5    75  Anthony Gilbertson             Anthony Gilbertson       BMW M3               3200       10    10:02.5842   3 0:58.9336 
6    26  Trent Purcell                  Trent Purcell            Mazda RX7            2354       10    10:07.6285   4 0:59.0546 
7    12  Scott Hunter                   Scott Hunter             Toyota  Corolla      2000       10    10:11.3722   5 0:59.8921 
8    91  Graeme Wilkinson               Graeme Wilkinson         Ford  Escort         2000       10    10:13.4114   5 1:00.2175 
9     7  Justin Wade                    Justin Wade              BMW M3               4000       10    10:18.2020   9 1:00.8969 
10   55  Greg Craig                     Grag Craig               Toyota  Corolla      1840       10    10:18.9956   7 1:00.7905 
11   46  Kyle Orgam-Moore               Kyle Organ-Moore         Holden VS Commodore  6000       10    10:30.0179   3 1:01.6741 
12   39  Uptiles Strathpine             Trent Spencer            BMW Mini Cooper S    1500       10    10:40.1436   2 1:02.2728 
13  177  Mark Hyde                      Mark Hyde                Ford  Escort         1993       10    10:49.5920   2 1:03.8069 
14   34  Peter Draheim                  Peter Draheim            Mazda RX3            2600       10    10:50.8159  10 1:03.4396 
15    5  Scott Douglas                  Scott Douglas            Datsun  1200         1998        9     9:48.7808   3 1:01.5371 
16   72  Paul Redman                    Paul Redman              Ford  Focus          2lt         9    10:11.3707   2 1:05.8729 
17    8  Matthew Speakman               Matthew Speakman         Toyota  Celica       1600        9    10:16.3159   3 1:05.9117 
18   74  Lucas Easton                   Lucas Easton             Toyota  Celica       1600        9    10:16.8050   6 1:06.0748 
19   77  Dean Fuller                    Dean Fuller              Mitsubishi  Sigma    2600        9    10:25.2877   3 1:07.3991 
20   16  Brett Batterby                 Brett Batterby           Toyota  Corolla      1600        9    10:29.9127   4 1:07.8420 
21   95  Ross Hurford                   Ross Hurford             Toyota  Corolla      1600        8     9:57.5297   2 1:12.2672 
DNF  13  Charles Wright                 Charles Wright           BMW 325i             2700        9     9:47.9888   7 1:03.2808 
DNF  20  Shane Satchwell                Shane Satchwell          Datsun  1200 Coupe   1998        1     1:05.9100   1 1:05.9100 

Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012                     Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended 
我需要将其解析为一个对象,其中包含明显的位置、汽车、驾驶员等字段。问题是我不知道该用什么样的策略。如果我用空格分开,我会得到这样一个列表:

["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
你能看到这个问题吗。我不能仅仅解释这个列表,因为人们可能只有一个名字,或者一个名字中有三个单词,或者一辆车里有很多不同的单词。这使得仅使用索引引用列表变得不可能

使用列名定义的偏移量怎么样?不过我不太明白怎么用它

编辑:因此我使用的当前算法如下:

{:head=>[{}, {}, {}],
 :body=>
  [{:pos=>"1",
    :car=>"6",
    :competitor=>"Jason Clements",
    :driver=>"Jason Clements",
    :vehicle=>"BMW M3",
    :cap=>"3200",
    :cl_laps=>"10",
    :race_time=>"9:48.5710",
    :fast_lap_no=>"3",
    :fast_lap_time=>"0:57.3228"},
   {:pos=>"2",
    :car=>"42",
    :competitor=>"David Skillender",
    :driver=>"David Skillender",
    :vehicle=>"Holden VS Commodore",
    :cap=>"6000",
    :cl_laps=>"10",
    :race_time=>"9:55.6866",
    :fast_lap_no=>"2",
    :fast_lap_time=>"0:57.9409"},
  • 拆分新行上的文本,给出一组行
  • 查找每行最右边的常用空白字符。即,每行上的位置(索引),其中 行中包含空格。例如:
  • 根据这些常用字符拆分行
  • 修剪线条
  • 存在几个问题:

    如果名称包含相同的长度,如下所示:

    Jason Adams
    Bobby Sacka
    Jerry Louis
    
    然后,它会将其解释为两个独立的项目:([
    “Jason”、“Adams”、“Bobby”、“Sacka”、“Jerry”、“Louis”]

    然而,如果它们都是如此不同:

    Dominic Bou
    Bob Adams
    Jerry Seinfeld
    
    然后它会在宋飞的最后一个“d”上正确地分开(这样我们就得到了三个名字的集合(
    [“多米尼克·鲍”、“鲍勃·亚当斯”、“杰里·宋飞”]


    它也很脆弱。我正在寻找一个更好的解决方案。

    除非有一个关于如何分离列的明确规则,否则你无法真正做到这一点

    假设您知道每个列值都正确缩进到列标题中,那么您采用的方法是好的


    另一种方法是将仅由一个空格分隔的单词组合在一起(从您提供的文本中,我可以看出此规则也适用)。

    根据格式的一致性,您可能可以使用正则表达式来实现此目的

    下面是一个适用于当前数据的示例正则表达式-可能需要根据精确的规则进行调整,但它给出了一个想法:

    ^
    
    # Pos
    (\d+|DNF)
    \s+
    
    #Car
    (\d+)
    \s+
    
    # Team
    ([\w-]+(?: [\w-]+)+)
    \s+
    
    # Driver
    ([\w-]+(?: [\w-]+)+)
    \s+
    
    # Vehicle
    ([\w-]+(?:  ?[\w-]+)+)
    \s+
    
    # Cap
    (\d{4}|\dlt)
    \s+
    
    # CL Laps
    (\d+)
    \s+
    
    # Race.Time
    (\d+:\d+\.\d+)
    \s+
    
    # Fastest Lap
    (\d+)
    \s+
    
    # Fastest Lap Time
    (\d+:\d+\.\d+\*?)
    \s*
    
    $
    

    如果您可以验证空格是空格字符而不是制表符,并且过长的文本总是被截断以适合列结构,那么我将硬编码切片边界:

    parsed = [rawLine[0:3],rawLine[4:7],rawLine[9:38], ...etc... ]
    
    根据数据源的不同,这可能很脆弱(例如,如果每次运行都有不同的列宽)


    如果标题行始终相同,则可以通过搜索标题行的已知单词来提取切片边界。

    假设文本的间距始终相同,则可以根据位置拆分字符串,然后在每个部分周围去掉额外的空格。例如,在python中:

    pos=row[0:3].strip()
    car=row[4:7].strip()
    
    等等。或者,您可以定义一个正则表达式来捕获每个部分:

    ([:alnum:]+)\s([:num:]+)\s(([:alpha:]+ )+)\s(([:alpha:]+ )+)\s(([:alpha:]* )+)\s
    

    依此类推。(确切的语法取决于您的regexp语法。)请注意,car regexp需要处理添加的空格。

    这对于regex来说不是一个好例子,您确实希望发现格式,然后解压缩行:

    lines = str.split "\n"
    
    # you know the field names so you can use them to find the column positions
    fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
    header = lines.shift until header =~ /^Pos/
    positions = fields.map{|f| header.index f}
    
    # use that to construct an unpack format string
    format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
    # A4A5A31A25A21A6A12A10
    
    lines.each do |line|
      next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
      data = line.unpack(format).map{|x| x.strip}
      puts data.join(', ')
      # or better yet...
      car = Hash[fields.zip data]
      puts car['Driver']
    end
    
    这可能会解决你的问题

    还有几个例子和github


    希望这有帮助!

    您可以使用
    固定宽度
    宝石

    可以使用以下代码解析给定文件:

    require 'fixed_width'
    require 'pp'
    
    FixedWidth.define :cars do |d|
      d.head do |head|
        head.trap { |line| line !~ /\d/ }
      end
      d.body do |body|
        body.trap { |line| line =~ /^(\d|DNF)/ }
        body.column :pos, 4
        body.column :car, 5
        body.column :competitor, 31
        body.column :driver, 25
        body.column :vehicle, 21
        body.column :cap, 5
        body.column :cl_laps, 11
        body.column :race_time, 11
        body.column :fast_lap_no, 4
        body.column :fast_lap_time, 10
      end
    end
    
    pp FixedWidth.parse(File.open("races.txt"), :cars)
    
    trap
    方法识别每个部分中的行

    • head
      regex查找不包含数字的行
    • body
      regex查找以数字或“DNF”开头的行
    每个部分必须包含紧跟在最后一行之后的行。
    定义仅标识要获取的列数。库为您去除空白。如果您想生成固定宽度的文件,可以添加对齐参数,但它似乎不需要

    结果是一个如下开始的哈希:

    {:head=>[{}, {}, {}],
     :body=>
      [{:pos=>"1",
        :car=>"6",
        :competitor=>"Jason Clements",
        :driver=>"Jason Clements",
        :vehicle=>"BMW M3",
        :cap=>"3200",
        :cl_laps=>"10",
        :race_time=>"9:48.5710",
        :fast_lap_no=>"3",
        :fast_lap_time=>"0:57.3228"},
       {:pos=>"2",
        :car=>"42",
        :competitor=>"David Skillender",
        :driver=>"David Skillender",
        :vehicle=>"Holden VS Commodore",
        :cap=>"6000",
        :cl_laps=>"10",
        :race_time=>"9:55.6866",
        :fast_lap_no=>"2",
        :fast_lap_time=>"0:57.9409"},
    
    好吧,我明白了:

    Edit:我忘了提到,它假设您已将输入文本存储在变量
    input\u string

    # Choose a delimeter that is unlikely to occure
    DELIM = '|||'
    
    # DRY -> extend String
    class String
      def split_on_spaces(min_spaces = 1)
        self.strip.gsub(/\s{#{min_spaces},}/, DELIM).split(DELIM)
      end
    end
    
    # just get the data lines
    lines = input_string.split("\n")
    lines = lines[2...(lines.length - 4)].delete_if { |line|
      line.empty?
    }
    
    # Grab all the entries into a nice 2-d array
    entries = lines.map { |line|
      [
        line[0..8].split_on_spaces,
        line[9..85].split_on_spaces(3).map{ |string| 
          string.gsub(/\s+/, ' ')  # replace whitespace with 1 space
        },
        line[85...line.length].split_on_spaces(2)
    
      ].flatten
    }
    
    # BONUS
    
    # Make nice hashes
    keys = [:pos, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest_lap]
    objects = entries.map { |entry|
      Hash[keys.zip entry]
    }
    
    产出:

    entries # =>
    ["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3 0:57.3228*"]
    ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2 0:57.9409"]
    ...
    # all of length 9, no extra spaces
    
    万一阵列不能切断它

    objects # =>
    {:pos=>"1", :car=>"6", :team=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fastest_lap=>"3 0:57.3228*"}
    {:pos=>"2", :car=>"42", :team=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fastest_lap=>"2 0:57.9409"}
    ...
    

    我将把它重构成漂亮的函数留给您。

    我认为在每一行上使用固定的宽度就足够简单了

    #!/usr/bin/env ruby
    
    # ruby parsing_winner.rb winners_list.txt 
    args = ARGV
    puts "ruby parsing_winner.rb winners_list.txt " if args.empty?
    winner_file = open args.shift
    array_of_race_results, array_of_race_results_array  = [], []
    
    class RaceResult
    
      attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap
      def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
        @position    = position 
        @car         = car 
        @team        = team  
        @driver      = driver  
        @vehicle     = vehicle  
        @cap         = cap  
        @cl_laps     = cl_laps  
        @race_time   = race_time 
        @fastest     = fastest
        @fastest_lap = fastest_lap 
      end
    
      def to_a
        # ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
        [position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap]
      end
    end
    
    # Pos Car  Competitor/Team                Driver                   Vehicle              Cap   CL Laps     Race.Time Fastest...Lap
    
    # 1     6  Jason Clements                 Jason Clements           BMW M3               3200       10     9:48.5710   3 0:57.3228*
    # 2    42  David Skillender               David Skillender         Holden VS Commodore  6000       10     9:55.6866   2 0:57.9409
    # etc...
    winner_file.each_line do |line|
      next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/]
      position    = line[0..3].strip
      car         = line[4..8].strip
      team        = line[9..39].strip
      driver      = line[40..64].strip
      vehicle     = line[65..85].strip
      cap         = line[86..91].strip
      cl_laps     = line[92..101].strip
      race_time   = line[102..113].strip
      fastest     = line[114..116].strip
      fastest_lap = line[117..-1].strip
      racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
      array_of_race_results << racer
      array_of_race_results_array << racer.to_a
    end
    
    puts "Race Results Objects: #{array_of_race_results}"
    puts "Race Results: #{array_of_race_results_array.inspect}"
    
    !/usr/bin/env ruby
    #ruby解析_winner.rb winners_list.txt
    args=ARGV
    如果args.empty,则将“ruby parsing_winner.rb winners_list.txt”放入?
    winner\u文件=打开args.shift
    比赛结果数组,比赛结果数组=[],[]
    班级竞赛成绩
    属性访问器:位置、车辆、车队、驾驶员、车辆、上限、圈数、比赛时间、最快、最快圈数
    def初始化(位置、车辆、车队、驾驶员、车辆、cap、cl_圈、比赛时间、最快、最快_圈)
    @位置=位置
    @汽车
    @团队
    @司机
    @车辆=车辆
    @上限=上限
    @cl_圈=cl_圈
    @比赛时间=比赛时间
    @最快的
    @最快圈数=最快圈数
    结束
    def to_a
    #[“1”,“6”,“Jason”,“Clements”,“Jason”,“Clements”,“BMW”,“M3”,“3200”,“10”,“9:48.5710”,“3”,“0:57.3228*”]
    [位置、汽车、车队、驾驶员、车辆、驾驶帽、cl_圈、比赛时间、最快、最快_圈]
    结束
    结束
    #Pos汽车竞争对手/车队驾驶员车辆Cap CL圈比赛。时间最快…圈
    #1 6 Jason Clements Jason Clements宝马M3 3200 10 9:48.5710 3 0:57.3228*
    #2 42大卫·斯基伦德·大卫·斯基伦德·霍尔顿VS准将6000 10 9:55.6866 2 0:57.9409
    #等等。。。
    winner_file.each_line do|line|
    下一个if行[/^ uuuuuuuuuuu/]| |行[/^\w{4,}| ^\s{Pos/]| |行[0..3][/\=/]
    位置=行[0..3]。带
    car=线[4..8]。带
    团队=第[9..39]行。带
    驱动程序=行[40..64]。带
    车辆=线[65..85]。带
    cap=线[86..91]。带
    cl_laps=直线
    
    someArray = array of strings that were split by white space
    
    Pos = someArray[0]
    Car = someArray[1]
    Competitor/Team = someArray[2] + " " + someArray[3]
    Driver = someArray[4] + " " + someArray[5]
    Vehicle = someArray[6] + " " + ... + " " + someArray[someArray.length - 6]
    Cap = someArray[someArray.length - 5]
    CL Laps = someArray[someArray.length - 4]
    Race.Time = someArray[someArray.length - 3]
    Fastest...Lap = someArray[someArray.length - 2] + " " + someArray[someArray.length - 1]