使用正则表达式将ruby字符串分块到节中_Ruby_Regex

使用正则表达式将ruby字符串分块到节中

ruby regex

使用正则表达式将ruby字符串分块到节中,ruby,regex,Ruby,Regex,我有一个文本文件，分为多个部分，我想将其分解为一个数组，每个部分都有字符串元素。然后，每个部分的内容将根据不同的部分进行不同的操作。我目前正在使用irb，很可能会将其分解为一个单独的ruby脚本文件我已经从输入文件（“sample”和“sample_file”）创建了一个string对象和file对象，以测试不同的方法。我确信文件读取循环在这里是可行的，但我相信一个简单的匹配就是我所需要的该文件如下所示： *** Section Header *** randomly formatted

我有一个文本文件，分为多个部分，我想将其分解为一个数组，每个部分都有字符串元素。然后，每个部分的内容将根据不同的部分进行不同的操作。我目前正在使用irb，很可能会将其分解为一个单独的ruby脚本文件

我已经从输入文件（“sample”和“sample_file”）创建了一个string对象和file对象，以测试不同的方法。我确信文件读取循环在这里是可行的，但我相信一个简单的匹配就是我所需要的

该文件如下所示：

*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section

示例输出：

[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]

这就是[段串，段串，段串] 我的大多数尝试都失败了，因为不一致的打开和关闭条件或我的需要的多行性质导致了复杂性

下面是我最近的尝试，要么创建不需要的元素（比如包含一个标题的结束星号和另一个标题的开始星号的字符串），要么只抓取一个标题

这与标题匹配：

sample.scan(/\*{3}.*/)

这会匹配标题和部分，但会从关闭和打开的星号创建元素，我不完全理解“向前看”和“后面看”断言，但我认为基于我对解决方案的搜索，解决方案会是这样的

sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)

任何方向都非常感谢

^((?:[ ]+|[ ]*\*)+.+)$

您可以尝试此方法。使用

[]

代替

\s

作为

\s

封面。请参阅演示。抓取捕获

您可以尝试此方法。使用

[]

代替

\s

作为

\s

封面。请参阅演示。抓取捕获

有很多方法可以完成您想要做的事情，尽管如果您想使用regex模式，这样的方法可能会奏效（取决于确切的文本，您可能需要对其进行一些调整）：

示例：

代码：

A输出模式

（.*[*].+[^*]*）

：

有很多方法可以完成您想要做的事情，尽管如果您想使用regex模式，这样的方法可能会奏效（取决于确切的文本，您可能需要对其进行一些调整）：

示例：

代码：

A输出模式

（.*[*].+[^*]*）

：

Ruby的可枚举项包括对此类任务非常有用的：

str = "*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
#      "",
#      "randomly formatted content",
#      "multiple lines",
#      ""],
#     [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
#      "",
#      "This sections info"],
#     ["       **** sub headers sometime occur***",
#      "           I'm okay with treating this as normal headers for now.",
#      "           I think sub headers may have something consistent about them.",
#      "",
#      ""],
#     ["*** Header ***", "  info for this section"]]

在之前使用

slice_允许我使用一个非常简单的模式来定位一个地标/目标，该地标/目标指示子数组中断发生的位置。使用/^\s*\*{3}/
查找以可能的空格字符串开头，后跟三个'*'
的行。一旦找到，就会开始一个新的子数组
如果您希望每个子数组实际上是单个字符串而不是块中的行数组，map（&:join）
是您的朋友：
str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "     *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "           **** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "    *** Header ***      info for this section    "]

而且，如果要去除前导和尾随空格，可以将strip
与map
结合使用：
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "**** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "*** Header ***      info for this section"]

或：
或：
取决于你想做什么

按“\r”拆分在我的真实文件上产生的输出比“\n”更好
使用/\r？\n/
，这是一个正则表达式，用于查找可选的回车符，后跟新行。Windows使用“\r\n”
组合来标记行尾，而Mac OS和*nix仅使用“\n”
。通过这样做，您不会将您的代码绑定到仅是Windows
我不知道以前的slice\u是否是为这种特殊用途而开发的，但我用它来分解文本文件并将其分解成段落，以及将网络设备配置分解成块，这使得两种情况下的解析都变得更加容易。
Ruby的可枚举性包括对此类任务非常有用的：
str = "*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
#      "",
#      "randomly formatted content",
#      "multiple lines",
#      ""],
#     [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
#      "",
#      "This sections info"],
#     ["       **** sub headers sometime occur***",
#      "           I'm okay with treating this as normal headers for now.",
#      "           I think sub headers may have something consistent about them.",
#      "",
#      ""],
#     ["*** Header ***", "  info for this section"]]

在
之前使用slice_允许我使用一个非常简单的模式来定位一个地标/目标，该地标/目标指示子数组中断发生的位置。使用/^\s*\*{3}/
查找以可能的空格字符串开头，后跟三个'*'
的行。一旦找到，就会开始一个新的子数组
如果您希望每个子数组实际上是单个字符串而不是块中的行数组，map（&:join）
是您的朋友：
str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "     *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "           **** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "    *** Header ***      info for this section    "]

而且，如果要去除前导和尾随空格，可以将strip
与map
结合使用：
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "**** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "*** Header ***      info for this section"]

或：
或：
取决于你想做什么

按“\r”拆分在我的真实文件上产生的输出比“\n”更好
使用/\r？\n/
，这是一个正则表达式，用于查找可选的回车符，后跟新行。Windows使用“\r\n”
组合来标记行尾，而Mac OS和*nix仅使用“\n”
。通过这样做，您不会将您的代码绑定到仅是Windows
我不知道以前的slice\u是否是为这种特殊用途而开发的，但我用它来分解文本文件并将其分解成段落，以及将网络设备配置分解成块，这使得两种情况下的解析都更加容易。
一个更具可读性的想法可能是在模式之前使用lookahead进行拆分：
str.split /(?=\n *\*{3})/

一个更具可读性的想法可能是在模式之前使用lookahead进行拆分：
str.split /(?=\n *\*{3})/

你期望的输出是什么？tom（或者是“twintur”），每当你给出一个例子（这几乎总是一件好事）时，请始终显示你期望的输出。即使没有必要让你的问题清晰明了，快速浏览一下，示例和所需的输出通常会告诉读者你想做什么。试试sample.split（/^\s*\*{3}/）。这并不完全是你想要的，但这是一个很好的起点。马克，这很有效。拆分从输出中删除分隔符（w
str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a

str.split /(?=\n *\*{3})/