Ruby 获取正则表达式以从非结构化文本返回数字范围(例如薪水)

Ruby 获取正则表达式以从非结构化文本返回数字范围(例如薪水),ruby,regex,Ruby,Regex,我试图从电子邮件和招聘广告中提取工资信息 我需要一个正则表达式,该正则表达式将返回范围或薪水号码的第一个实例(我还希望避免在字符串后面出现匹配的电话号码),例如 我已经走了这么远 /(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9]{0,5}[0,5,k]?)?/ 或者稍微增强一点: /(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9][0,5,k]?)? ?

我试图从电子邮件和招聘广告中提取工资信息

我需要一个正则表达式,该正则表达式将返回范围或薪水号码的第一个实例(我还希望避免在字符串后面出现匹配的电话号码),例如

我已经走了这么远

/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9]{0,5}[0,5,k]?)?/
或者稍微增强一点:

/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9][0,5,k]?)? ?(ph|pw|pa| per|per)? ?(hour|annum|week|month)?/
有点不错,但不匹配整个子串编号范围,似乎有很多单个片段的匹配

e、 g


我遗漏了什么(还有更优雅的方法吗?

这是最新的更新

((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))

((?:)

r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"
       ]

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 
我假设所有电话号码都是由连字符分隔的三个数字串,可以选择用空格括起来。如果这个假设不正确,您当然必须根据需要修改
r0

现在从
arr1
的元素中提取所需的值:

r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"] 

这是非常优雅的。但不适用于50000-60000示例。也就是说,我试图匹配整个范围(有趣的是,这是后一个值60000,这是最重要的,我希望范围是字符串值“50000-60000”).但我不想与后面的例子相匹配,因为我可能会遇到一些看起来有点像数字但不是我真正想要的东西…例如,如果电子邮件中的电话号码以800开头,如果我忽略它是否“连接”到我找到的第一个号码,我会遇到麻烦。请澄清以上问题,因为我认为问题不是c李尔王,明白了-我需要完成一些工作,然后我会带着一个新的解决方案回来。给我一个小时左右的时间…同时,你能用实际的(或多或少的)例子发布这些行吗。这将帮助我调试。这有点棘手,因为字符串可能相当长(这是一封完整的电子邮件,包括描述,例如)“需要大约50000英镑的高级PHP开发人员伦敦”将是一个例子……或者“嗨,鲍勃,我正在招聘一名项目经理……费率:每天500-750英镑……”这看起来好像有了进展,但我已经通过编辑澄清了这个问题。那些带有=>的字符串是我试图在开始时用箭头后面的请求输出指示字符串…很抱歉,这不清楚。我还将澄清,有时它只是一个数,有时它是一个数字范围。我正在查看的示例查看工作信息。有时人们会将工资列为5万英镑。有时他们会将工资列为5万至6万英镑。在这种情况下,单个数字或范围都很重要。他们会输入5万至6万英镑吗?5万英镑不错。但“500至650英镑”不是很难吗,尤其是在伦敦市中心?如果他们每天支付500-650英镑(就像他们在本案中所做的那样…),那就不必了。凯兰,别忘了我评论的第一句话(我问的是,一个范围的两端是否都可以以“英镑”开头)。
((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))
((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w| per hour)|(?<=\s)\d{1,}(?=\s))
r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"
       ]

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 
r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"]