无法使用Ruby Regex Rubular正确分割数据_Ruby_Regex_Rubular

无法使用Ruby Regex Rubular正确分割数据

ruby regex

无法使用Ruby Regex Rubular正确分割数据,ruby,regex,rubular,Ruby,Regex,Rubular,我正在尝试组织和分解通过Net:：POP3提取的电子邮件中的内容。在代码中，当我使用 p mail.pop 我明白了到目前为止，我一直在使用rubular，但由于我仍在学习如何正确使用regex、gsub和split，所以结果各不相同。我的代码如下 p mail.pop.scan(/Summary: (.+) Name:/) p mail.pop.scan(/Name: (.+) Category:/) p mail.pop.scan(/Category: (.+) Email:

我正在尝试组织和分解通过Net:：POP3提取的电子邮件中的内容。在代码中，当我使用

p mail.pop

我明白了

到目前为止，我一直在使用rubular，但由于我仍在学习如何正确使用regex、gsub和split，所以结果各不相同。我的代码如下

  p mail.pop.scan(/Summary: (.+) Name:/)
  p mail.pop.scan(/Name: (.+) Category:/)
  p mail.pop.scan(/Category: (.+) Email:/) 
  p mail.pop.scan(/Email: (.+) Journal News:/)     
  p mail.pop.scan(/Journal News: (.+) Deadline:/)       
  p mail.pop.scan(/Deadline: (.+) Questions:/)    
  p mail.pop.scan(/Questions:(.+) Requirements:/) 
  p mail.pop.scan(/Requirements:(.+) Back to Top/)

但我得到的是空数组

[]
[]
[]
[]
[]
[]
[]
[]

想知道我怎么才能做得更好。提前谢谢。

哦，天哪！真是一团糟

当然，有很多方法可以解决这个问题，但我希望它们都涉及多个步骤和大量的尝试和错误。我只能说我是怎么做的

很多小步骤是一件好事，有几个原因。首先，它将问题分解为可管理的任务，这些任务的解决方案可以单独测试。其次，解析规则将来可能会发生变化。如果有多个步骤，则可能只需更改和/或添加一个或两个操作。如果步骤少，正则表达式复杂，最好从头开始，特别是如果代码是由其他人编写的

假设

text

是一个包含字符串的变量

首先，我不喜欢所有这些新词，因为它们使正则表达式复杂化，所以我要做的第一件事就是去掉它们：

s1 = text.gsub(/\n/, '')

接下来，有许多

“20\r”

，可能会很麻烦，因为我们可能希望保留包含数字的其他文本，因此我们可以删除这些文本（以及

“7941\r”

）：

现在，让我们看一下所需的字段以及紧接前面和后面的文本：

puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
  # <> Summary: Working with V
  #=>> Name: Megumi Lindon 
  #=>> Category: Social Psychol
  #=>> Email: information@ex
  #<mailto:information@exa
  #=>> Journal News: Saving Grace 
  #=>> Deadline: 10:00 PM EST -
  #=>> Query:=>>=>> Lorem ip
  #=>> Requirements:=>>=>> Psycholo
  # <x-msg://30/#top> Back
  #<x-msg://30/#SocialPsy

在

s3

的正则表达式中，

（？可能有更好的方法，但对于初学者来说，可以像这样将/m
添加到扫描中：str.scan（/Summary:（.+）Name:/m）嘿，谢谢你花时间提供如此详细的答案。这确实帮助我更好地理解regex。虽然我没有使用你的解决方案（因为我能够使内容不那么凌乱），但通过你的指南，我能够自己设计。再次感谢。=）我很乐意提供帮助。帮助别人学习新东西总是比解决他们提出的特定问题更令人满意。
s1 = text.gsub(/\n/, '')

s2 = s1.gsub(/\d+\r/, '') 

puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
  # <> Summary: Working with V
  #=>> Name: Megumi Lindon 
  #=>> Category: Social Psychol
  #=>> Email: information@ex
  #<mailto:information@exa
  #=>> Journal News: Saving Grace 
  #=>> Deadline: 10:00 PM EST -
  #=>> Query:=>>=>> Lorem ip
  #=>> Requirements:=>>=>> Psycholo
  # <x-msg://30/#top> Back
  #<x-msg://30/#SocialPsy

s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")

s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")

a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
  # => ["11)  :", "Summary: ", "Working with Vars on Social Influence platform :",
  #     "Name: ", "Megumi Lindon  :",
  #     "Category: ", "Social Psychology :",
  #     "Email: ", "informationexample.com mailto:informationexample.com :",
  #     "Journal News: ", "Saving Grace  :",
  #     "Deadline: ", "10:00 PM EST  15 February :",
  #     "Query:  ", "Lorem ipsum ...laborum. :",
  #     "Requirements:  ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"] 

a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
  #=> "11)  Summary: "
a3 = a2.each_slice(2).to_a
  #=> [["11)  Summary: ", "Working with Vars on Social Influence platform "],
  #    ["Name: ", "Megumi Lindon  "],
  #    ["Category: ", "Social Psychology "],
  #    ["Email: ", "informationexample.com mailto:informationexample.com "],
  #    ["Journal News: ", "Saving Grace  "],
  #    ["Deadline: ", "10:00 PM EST  15 February "],
  #    ["Query:  ", "Lorem...est laborum. "],
  #    ["Requirements:  ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]] 

idx = a3.index { |n,_| n =~ /Email: / }
  #=> 3 
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
  #=> "informationexample.com " 

a4 = a3.map { |b| b.join(' ').split.join(' ') }
  #=> ["11) Summary: Working with Vars on Social Influence platform",
  #    "Name: Megumi Lindon",
  #    "Category: Social Psychology",
  #    "Email: informationexample.com",
  #    "Journal News: Saving Grace",
  #    "Deadline: 10:00 PM EST 15 February",
  #    "Query: Lorem...laborum.",
  #    "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"] 

idx = a4.index { |n,_| n =~ /Requirements: / }
  #=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
  # => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."

def parse_it(text)
  a1 = text.gsub(/\n/, '')
           .gsub(/\d+\r/, '') 
           .gsub(/(?<=\w):=/, ": ")
           .gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
           .gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
           .split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
           .map { |s| s.chomp(':') }

  a1[0] = a1.shift + a1.first

  a2 = a1.each_slice(2).to_a
  idx = a2.index { |n,_| n =~ /Email: / }
  a2[idx][1] = a2[idx][1][/.*?\s/] if idx

  a3 = a2.map { |b| b.join(' ').split.join(' ') }    
  idx = a3.index { |n,_| n =~ /Requirements: / }
  a3[idx] = a3[idx][/.*?[.!?]/] if idx

  a3
end