无法使用Ruby Regex Rubular正确分割数据
我正在尝试组织和分解通过Net::POP3提取的电子邮件中的内容。在代码中,当我使用无法使用Ruby Regex Rubular正确分割数据,ruby,regex,rubular,Ruby,Regex,Rubular,我正在尝试组织和分解通过Net::POP3提取的电子邮件中的内容。在代码中,当我使用 p mail.pop 我明白了 到目前为止,我一直在使用rubular,但由于我仍在学习如何正确使用regex、gsub和split,所以结果各不相同。我的代码如下 p mail.pop.scan(/Summary: (.+) Name:/) p mail.pop.scan(/Name: (.+) Category:/) p mail.pop.scan(/Category: (.+) Email:
p mail.pop
我明白了
到目前为止,我一直在使用rubular,但由于我仍在学习如何正确使用regex、gsub和split,所以结果各不相同。我的代码如下
p mail.pop.scan(/Summary: (.+) Name:/)
p mail.pop.scan(/Name: (.+) Category:/)
p mail.pop.scan(/Category: (.+) Email:/)
p mail.pop.scan(/Email: (.+) Journal News:/)
p mail.pop.scan(/Journal News: (.+) Deadline:/)
p mail.pop.scan(/Deadline: (.+) Questions:/)
p mail.pop.scan(/Questions:(.+) Requirements:/)
p mail.pop.scan(/Requirements:(.+) Back to Top/)
但我得到的是空数组
[]
[]
[]
[]
[]
[]
[]
[]
想知道我怎么才能做得更好。提前谢谢。哦,天哪!真是一团糟
当然,有很多方法可以解决这个问题,但我希望它们都涉及多个步骤和大量的尝试和错误。我只能说我是怎么做的
很多小步骤是一件好事,有几个原因。首先,它将问题分解为可管理的任务,这些任务的解决方案可以单独测试。其次,解析规则将来可能会发生变化。如果有多个步骤,则可能只需更改和/或添加一个或两个操作。如果步骤少,正则表达式复杂,最好从头开始,特别是如果代码是由其他人编写的
假设text
是一个包含字符串的变量
首先,我不喜欢所有这些新词,因为它们使正则表达式复杂化,所以我要做的第一件事就是去掉它们:
s1 = text.gsub(/\n/, '')
接下来,有许多“20\r”
,可能会很麻烦,因为我们可能希望保留包含数字的其他文本,因此我们可以删除这些文本(以及“7941\r”
):
现在,让我们看一下所需的字段以及紧接前面和后面的文本:
puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
# <> Summary: Working with V
#=>> Name: Megumi Lindon
#=>> Category: Social Psychol
#=>> Email: information@ex
#<mailto:information@exa
#=>> Journal News: Saving Grace
#=>> Deadline: 10:00 PM EST -
#=>> Query:=>>=>> Lorem ip
#=>> Requirements:=>>=>> Psycholo
# <x-msg://30/#top> Back
#<x-msg://30/#SocialPsy
在
s3
的正则表达式中,(?可能有更好的方法,但对于初学者来说,可以像这样将/m
添加到扫描中:str.scan(/Summary:(.+)Name:/m)
嘿,谢谢你花时间提供如此详细的答案。这确实帮助我更好地理解regex。虽然我没有使用你的解决方案(因为我能够使内容不那么凌乱),但通过你的指南,我能够自己设计。再次感谢。=)我很乐意提供帮助。帮助别人学习新东西总是比解决他们提出的特定问题更令人满意。
s1 = text.gsub(/\n/, '')
s2 = s1.gsub(/\d+\r/, '')
puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
# <> Summary: Working with V
#=>> Name: Megumi Lindon
#=>> Category: Social Psychol
#=>> Email: information@ex
#<mailto:information@exa
#=>> Journal News: Saving Grace
#=>> Deadline: 10:00 PM EST -
#=>> Query:=>>=>> Lorem ip
#=>> Requirements:=>>=>> Psycholo
# <x-msg://30/#top> Back
#<x-msg://30/#SocialPsy
s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
# => ["11) :", "Summary: ", "Working with Vars on Social Influence platform :",
# "Name: ", "Megumi Lindon :",
# "Category: ", "Social Psychology :",
# "Email: ", "informationexample.com mailto:informationexample.com :",
# "Journal News: ", "Saving Grace :",
# "Deadline: ", "10:00 PM EST 15 February :",
# "Query: ", "Lorem ipsum ...laborum. :",
# "Requirements: ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"]
a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
#=> "11) Summary: "
a3 = a2.each_slice(2).to_a
#=> [["11) Summary: ", "Working with Vars on Social Influence platform "],
# ["Name: ", "Megumi Lindon "],
# ["Category: ", "Social Psychology "],
# ["Email: ", "informationexample.com mailto:informationexample.com "],
# ["Journal News: ", "Saving Grace "],
# ["Deadline: ", "10:00 PM EST 15 February "],
# ["Query: ", "Lorem...est laborum. "],
# ["Requirements: ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]]
idx = a3.index { |n,_| n =~ /Email: / }
#=> 3
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
#=> "informationexample.com "
a4 = a3.map { |b| b.join(' ').split.join(' ') }
#=> ["11) Summary: Working with Vars on Social Influence platform",
# "Name: Megumi Lindon",
# "Category: Social Psychology",
# "Email: informationexample.com",
# "Journal News: Saving Grace",
# "Deadline: 10:00 PM EST 15 February",
# "Query: Lorem...laborum.",
# "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"]
idx = a4.index { |n,_| n =~ /Requirements: / }
#=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
# => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."
def parse_it(text)
a1 = text.gsub(/\n/, '')
.gsub(/\d+\r/, '')
.gsub(/(?<=\w):=/, ": ")
.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
.map { |s| s.chomp(':') }
a1[0] = a1.shift + a1.first
a2 = a1.each_slice(2).to_a
idx = a2.index { |n,_| n =~ /Email: / }
a2[idx][1] = a2[idx][1][/.*?\s/] if idx
a3 = a2.map { |b| b.join(' ').split.join(' ') }
idx = a3.index { |n,_| n =~ /Requirements: / }
a3[idx] = a3[idx][/.*?[.!?]/] if idx
a3
end