Ruby 在不删除任何字符的情况下，在正则表达式匹配处拆分字符串_Ruby_Regex

Ruby 在不删除任何字符的情况下，在正则表达式匹配处拆分字符串

ruby regex

Ruby 在不删除任何字符的情况下，在正则表达式匹配处拆分字符串,ruby,regex,Ruby,Regex,我想按日期拆分此文本，但不从字符串中删除日期： sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @ sep 25 fri The Holdup, The Wheeland Brothers at the El Rey Theatre, Chico 18+ (a/a with adult)

我想按日期拆分此文本，但不从字符串中删除日期：

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
   at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **

数组中的第一个元素是：

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @`

条目具有可变的行数，因此我无法拆分新行

日期的格式为：

month_abbreviation + space(or two) + day_number

类似这样的伪代码：

three_letter_word + whitespace(s) + one_or_two_digit_number

会有用的

假设OP的描述：

三个字母+空格+一个或两个数字就可以了

是对的,

text.split(/(?=\w{3} +\d{1,2})/)

有12个月零7天，因此您可以选择：

text = <<txt
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
txt

text.split(/((?:jan|feb|mar|apr|may|jun|ju|aug|sep|oct|nov|dec)\s+[12]?\d)/).each{|part|
  p part
}
p '-------------'
text.split(/((?:jan|feb|mar|apr|may|jun|ju|aug|sep|oct|nov|dec)\s+[12]?\d(?:\s*(?:mon|tue|wed|thu|fri|sat|sun))?)/).each{|part|
  p part
}

有关正则表达式的一些详细信息：

```
（？：…）
```
避免匹配部分成为结果的一部分（如$1，$2…）
只有完整的日期匹配没有
```
（？：
```
），并且成为结果的一部分
如果没有最外层的
```
（）
```
，则会在结果中删除匹配项
我示例中的正则表达式区分大小写
```
[123]？\d
```
检查可选的1、2或3以及其他数字。这将允许像32、33这样的日数

由于您需要拆分每次出现的日期，因此您需要在匹配过程中确定正则表达式引擎的位置。您可以使用一个前瞻

？=

，后跟要捕获的所需标记来实现此目的

例如，这个模式

（？=[a-zA-Z]{3}\s+\d{1,2}\s+[a-zA-Z]{6,9}）

这里，正则表达式引擎将位于任何单词的起始位置，该单词有三个字母，后跟一个或多个空格、一个或两个数字、一个或多个空格，以及一个包含6到9个字母的单词，例如

sep 25 Friday

。在本例中，正则表达式引擎位于

sep

中的

之前。利用这些知识，您现在可以使用您选择的任何编程语言拆分字符串

line.split（/？=[a-zA-Z]{3}\s+\d{1,2}\s+[a-zA-Z]{6,9}/）；

？=：

这是一个与要捕获的正则表达式标记之前的位置相匹配的前瞻

[a-zA-Z]{3}:

匹配3个单词，因为月份是单词而不是数字，例如

sep

\s+\d{1,2}:

匹配一个或多个空格，后跟一个或两个数字

\s+[a-zA-Z]{6,9}：

匹配一个或多个空格，后跟至少6个单词，最多9个单词，因为一周中数字最少的一天是

星期五（6个字母），最高的是星期三（9个字母）
Ruby有一个奇妙的方法，它是数组的一部分（从Enumerable继承而来）打过电话。我会像这样使用它：
str = <<EOT
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
EOT

MONTHS = %w[jan feb mar apr may jun jul aug sep oct nov dec]
MONTH_PATTERN = Regexp.union(MONTHS).source # => "jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"
MONTH_REGEX = /^(?:#{ MONTH_PATTERN })\b/i # => /^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i

schedule = str.lines.slice_before(MONTH_REGEX).to_a
# => [["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#      "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"],
#     ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#      "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]]

schedule[0]
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#     "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"]

schedule[1]
# => ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#     "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]

这是我们在内部使用的一种技术，通过将巨大的配置文件拆分为几行，并使用正则表达式查找节的标记。
我可以看到，除了每条记录的第一行之外，其他行都缩进了几个空格，因此您可以使用str.split（/\n（？！\s+）
您指定要在日期上拆分。因此，我没有拆分任何具有无法转换为日期的指定日期格式的字符串，包括“9月31日星期六”
和“9月26日星期三”
（今年的后者是“星期六”
）。我假设日期子字符串可以出现在字符串中的任何位置。如果您希望要求它们从每行的开头开始，这当然是一个简单的修改
str =
"sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 31 mon at some other place 
oct 26 sat The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **"

require 'date'

arr = str.split.
          map(&:capitalize).
          each_cons(3).
          map { |a| a.join(' ') }.
          select { |s| Date.strptime(s, '%b %d %a') rescue nil }
  #=> ["Sep 25 Fri", "Oct 26 Sat"]

r = /(#{ arr.join('|') })/i
  #=> /(Sep 25 Fri|Oct 26 Sat)/i

str.split(r)
  #=" ["",
  #    "sep 25 fri",
  # " The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n\
  #  at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n    sep 31\
  #   mon at some other place \n    ",
  # "oct 26 sat",
  # " The Holdup, The Wheeland Brothers\n           at the El Rey Theatre,\
  #   Chico 18+ (a/a with adult) 7:30pm/8:30pm **"]

要避免返回的数组开头和结尾出现空字符串，请使用：
str.split(r).delete_if(&:empty?)

规则是什么？日期的分隔方式是什么？格式是什么？如果每个元素都在单独的一行上，可能只需要按行分割string.split（“\n”）
？@hwnd我在中编辑了这个问题response@alexey-shein每个条目都有未知数量的行，因此无法使用。我已编辑了该问题。您需要编辑到：1）将您的输入设置为有效字符串（'sep 25….'
），2）将您的输入分配给变量，以便读者无需定义即可引用它（str='sep 25…'
），3）显示给定输入所需的输出。您需要将\b
放在\w{3}之前因为否则它将在单词的中间匹配。事实上，在OP的例子中，它在Chico 18上分裂。@AlexeyShein我知道。正如我强调的，我的代码只在OP编写的内容正确的假设下工作。事实是否如此是OP的问题。仅在月份缩写上进行拆分并不是很有区别。例如，str=“汤姆能来参加你的聚会吗？\n贝丝也能来吗？”；schedule=str.lines.slice_before（MONTH#REGEX）。to_a#=>[[“Tom能来参加你的聚会吗？\n”]，[“May Beth也能来吗？”]
。合并日期和年份很简单，但由于剩余的行以空格开头，这不是问题。是的，很简单，但是，对于阅读你答案的没有经验的Rubiest来说，这可能并不明显。这与“cat 99 dog”
不匹配吗？是的，是的，但op只是需要一种模式来做到这一点。他也可以明确地将一周中的所有月份和天数替换为[a-zA-Z]+
我接受这个答案，因为在很多情况下，使用块指定拆分边界很有用，而且对我来说确实有效。
str =
"sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 31 mon at some other place 
oct 26 sat The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **"

require 'date'

arr = str.split.
          map(&:capitalize).
          each_cons(3).
          map { |a| a.join(' ') }.
          select { |s| Date.strptime(s, '%b %d %a') rescue nil }
  #=> ["Sep 25 Fri", "Oct 26 Sat"]

r = /(#{ arr.join('|') })/i
  #=> /(Sep 25 Fri|Oct 26 Sat)/i

str.split(r)
  #=" ["",
  #    "sep 25 fri",
  # " The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n\
  #  at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n    sep 31\
  #   mon at some other place \n    ",
  # "oct 26 sat",
  # " The Holdup, The Wheeland Brothers\n           at the El Rey Theatre,\
  #   Chico 18+ (a/a with adult) 7:30pm/8:30pm **"]

str.split(r).delete_if(&:empty?)