Ruby-ARGF&;RegEx:如何在段落“回车”上拆分\r\n“;但不是行尾“\r\n“;

Ruby-ARGF&;RegEx:如何在段落“回车”上拆分\r\n“;但不是行尾“\r\n“;,ruby,regex,hadoop-streaming,Ruby,Regex,Hadoop Streaming,我正在尝试使用ruby中的正则表达式预处理一些文本,以输入到映射器作业中,并希望在表示段落的回车上拆分 文本将使用ARGF.each作为hadoop流作业的一部分进入映射器 "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n" "daughter of James Stevenson, Esq. of South Park, in the county of\r\n" "Gloucester,

我正在尝试使用ruby中的正则表达式预处理一些文本,以输入到映射器作业中,并希望在表示段落的回车上拆分

文本将使用ARGF.each作为hadoop流作业的一部分进入映射器

"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n"    # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"
更新:

然后,所需的输出是一个数组,如下所示:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
    "daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
    "1789\"\r\n"
  [1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]
最终,我想要的是以下数组,该数组中没有回车符:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
    "daughter of James Stevenson, Esq. of South Park, in the county of"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
    "1789\""
  [1] "Precisely such had the paragraph originally stood from the printer's"
]
提前感谢您提供的任何见解。

要拆分文本,请使用:

result = text.gsub(/(?<!\")\\r\\n|(?<=\\\")\\r\\n/, '').split(/[\r\n]+\"\\r\\n\".*?[\r\n]+/)

result=text.gsub(/(?在执行
ARGF.each do | text |
时要小心,
文本将是每一行,而不是整个文本块

您可以提供
ARGF。每个
都有一个特殊的行分隔符,它将返回两个“行”,即案例中的两个段落

试试这个:

paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}

首先,将输入分成两段,然后使用
gsub
删除不需要的换行符。

那么在上面的示例中,您想要的输出到底是什么?两个元素的数组?这是我正在使用的文件的一个示例:简化问题的一种方法是在将文件输入ARGF并删除之前对文件本身进行预处理新的行上有\r\n,但要保留表示段落的\r\n,谢谢,这非常有用!在示例中,我想执行ARGF.each do | text |;段落=text.split(在此处插入_REGEX|u)##需要遵循更多的代码块是否有办法保持此流程?即,是否有可能按照您上面描述的模式拉入全文?您是在讨论
全文输入
级别,还是
段落
级别,还是
级别的流程?无论如何,您可以执行
text=ARGF.read
获取所有输入内容,然后做任何你想做的事情。我刚才说的是段落级别-但这可以通过段落来实现。每个do | p |最后一件事!使用p.gsub(“\r\n”,”)是否也可以将非字母数字字符剥离为同一正则表达式的一部分?
paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}