Ruby on rails 如何使用regexp将字符串拆分为除单词之间的空格以外的其他空格

Ruby on rails 如何使用regexp将字符串拆分为除单词之间的空格以外的其他空格,ruby-on-rails,ruby,regex,string,split,Ruby On Rails,Ruby,Regex,String,Split,我正在使用split分隔此字符串: @output = "5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983" 我想获得以下形式的数组: @array_output = ["5", "490", "'Msc

我正在使用
split
分隔此字符串:

@output = "5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983"
我想获得以下形式的数组:

@array_output = ["5", "490", "'Msci Italy'", "'Msci Germany'", "'Msci France'", "'Msci Spain'", "'Msci Emu'", "'05/01/2007'", "'12/01/2007'", "'19/01/2007'", "'26/01/2007'", "'02/02/2007'", "0.2000", "0.1996", "0.1994", "0.2001", "0.1983"]
我尝试使用:

@array_output = @output.split(/\s(?!\w)|\s(?=\d)/)
这适用于Rubular,但当我尝试将
或任何其他索引打印到Rails中的html.erb页面时,我什么也得不到

@output
字符串可能有不同的长度,这只是显示所有可能格式的一个小示例。格式的顺序总是一样困难

我使用
@array\u output=array.new
初始化了
@array\u output
,但它不影响结果

我也尝试了
scan
而不是
split
,但是没有任何改变


出什么问题了?

我刚刚尝试了
CSV
,除了缺少引号外,它还起了作用。如果你觉得可以的话

require 'csv'

@output = "5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983"

@array_output = CSV.parse_line(@output, col_sep: " ", quote_char: "'")
#=> ["5", "490", "Msci Italy", "Msci Germany", "Msci France", "Msci Spain", "Msci Emu", "05/01/2007", "12/01/2007", "19/01/2007", "26/01/2007", "02/02/2007", "0.2000", "0.1996", "0.1994", "0.2001", "0.1983"]

首先,
split
不是正确的工具,为
split
定义一个可能产生正确输出的模式将是一场噩梦。相反,以下是我如何将其分解:

regex = /
(\d+)
\s+
(\d+)
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
('[^']+')
\s+
([\d.]+)
\s+
([\d.]+)
\s+
([\d.]+)
\s+
([\d.]+)
\s+
([\d.]+)
/x
mat = regex.match("5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983")
其结果是:

require 'ap'

ap mat.captures 

# >> [
# >>   [ 0] "5",
# >>   [ 1] "490",
# >>   [ 2] "'Msci Italy'",
# >>   [ 3] "'Msci Germany'",
# >>   [ 4] "'Msci France'",
# >>   [ 5] "'Msci Spain'",
# >>   [ 6] "'Msci Emu'",
# >>   [ 7] "'05/01/2007'",
# >>   [ 8] "'12/01/2007'",
# >>   [ 9] "'19/01/2007'",
# >>   [10] "'26/01/2007'",
# >>   [11] "'02/02/2007'",
# >>   [12] "0.2000",
# >>   [13] "0.1996",
# >>   [14] "0.1994",
# >>   [15] "0.2001",
# >>   [16] "0.1983"
# >> ]
require 'ap'

ap mat.captures 

# >> [
# >>   [ 0] "5",
# >>   [ 1] "490",
# >>   [ 2] "'Msci Italy'",
# >>   [ 3] "'Msci Germany'",
# >>   [ 4] "'Msci France'",
# >>   [ 5] "'Msci Spain'",
# >>   [ 6] "'Msci Emu'",
# >>   [ 7] "'05/01/2007'",
# >>   [ 8] "'12/01/2007'",
# >>   [ 9] "'19/01/2007'",
# >>   [10] "'26/01/2007'",
# >>   [11] "'02/02/2007'",
# >>   [12] "0.2000",
# >>   [13] "0.1996",
# >>   [14] "0.1994",
# >>   [15] "0.2001",
# >>   [16] "0.1983"
# >> ]
为提高可读性,进行了一些重新安排:

regex = /
(\d+) \s+

(\d+) \s+

('[^']+') \s+
('[^']+') \s+
('[^']+') \s+
('[^']+') \s+
('[^']+') \s+

('[^']+') \s+
('[^']+') \s+
('[^']+') \s+
('[^']+') \s+
('[^']+') \s+

([\d.]+) \s+
([\d.]+) \s+
([\d.]+) \s+
([\d.]+) \s+
([\d.]+)
/x
mat = regex.match("5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983")

require 'ap'

ap mat.captures 
我们仍然得到:

# >> [
# >>   [ 0] "5",
# >>   [ 1] "490",
# >>   [ 2] "'Msci Italy'",
# >>   [ 3] "'Msci Germany'",
# >>   [ 4] "'Msci France'",
# >>   [ 5] "'Msci Spain'",
# >>   [ 6] "'Msci Emu'",
# >>   [ 7] "'05/01/2007'",
# >>   [ 8] "'12/01/2007'",
# >>   [ 9] "'19/01/2007'",
# >>   [10] "'26/01/2007'",
# >>   [11] "'02/02/2007'",
# >>   [12] "0.2000",
# >>   [13] "0.1996",
# >>   [14] "0.1994",
# >>   [15] "0.2001",
# >>   [16] "0.1983"
# >> ]
即使这样也有点混乱,因此您会发现在复杂模式中,或者在需要易于维护和理解的代码中,他们将模式生成分解为几个小步骤,让语言处理复杂模式的构建,如下所示:

SINGLE_QUOTED_PATTERN = "('[^']+')"
INTEGER_PATTERN = '(\d+)'
FLOAT_PATTERN = '([\d.]+)'
WHITE_SPACE_PATTERN = '\s+'

REGEX_STRING = [
  [INTEGER_PATTERN] * 2,
  [SINGLE_QUOTED_PATTERN] * 10,
  [FLOAT_PATTERN] * 5
].flatten.join(WHITE_SPACE_PATTERN) 

REGEX = /#{REGEX_STRING}/
# => /(\d+)\s+(\d+)\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+('[^']+')\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)/

data = "5 490 'Msci Italy' 'Msci Germany' 'Msci France' 'Msci Spain' 'Msci Emu' '05/01/2007' '12/01/2007' '19/01/2007' '26/01/2007' '02/02/2007' 0.2000 0.1996 0.1994 0.2001 0.1983"

mat = REGEX.match(data)
这再次导致:

require 'ap'

ap mat.captures 

# >> [
# >>   [ 0] "5",
# >>   [ 1] "490",
# >>   [ 2] "'Msci Italy'",
# >>   [ 3] "'Msci Germany'",
# >>   [ 4] "'Msci France'",
# >>   [ 5] "'Msci Spain'",
# >>   [ 6] "'Msci Emu'",
# >>   [ 7] "'05/01/2007'",
# >>   [ 8] "'12/01/2007'",
# >>   [ 9] "'19/01/2007'",
# >>   [10] "'26/01/2007'",
# >>   [11] "'02/02/2007'",
# >>   [12] "0.2000",
# >>   [13] "0.1996",
# >>   [14] "0.1994",
# >>   [15] "0.2001",
# >>   [16] "0.1983"
# >> ]
require 'ap'

ap mat.captures 

# >> [
# >>   [ 0] "5",
# >>   [ 1] "490",
# >>   [ 2] "'Msci Italy'",
# >>   [ 3] "'Msci Germany'",
# >>   [ 4] "'Msci France'",
# >>   [ 5] "'Msci Spain'",
# >>   [ 6] "'Msci Emu'",
# >>   [ 7] "'05/01/2007'",
# >>   [ 8] "'12/01/2007'",
# >>   [ 9] "'19/01/2007'",
# >>   [10] "'26/01/2007'",
# >>   [11] "'02/02/2007'",
# >>   [12] "0.2000",
# >>   [13] "0.1996",
# >>   [14] "0.1994",
# >>   [15] "0.2001",
# >>   [16] "0.1983"
# >> ]

您可以使用“消极向后看”和“消极向前看”:

/(

让我们分解这个正则表达式:

(?这就是背后的负面看法

\s+
一个或多个空格


(?![a-zA-z])
这是负面展望

我刚试过你的代码,它对我来说还可以。你能提供更多关于你的代码的详细信息吗。一个小的更正:在你的预期结果中,
'Msci-Spain'
-值丢失了。请更仔细地格式化。这有助于我们帮助你,以及其他人理解你的要求。是@knut,抱歉,我忘记了预期结果中的
'Msci-Spain'
。我编辑以更正它,谢谢。感谢@Tin Man更正我的帖子以改进格式,我对StackOverflow相当陌生。
@array\u output
中的值的格式/顺序会变化吗?或者这些字段是固定的吗?@Tin Man我编辑了这个问题以包含这些信息上,谢谢您的评论。缺少引号也可以。我尝试了这一个,但它只填充索引0中的数组。因此
@array\u output[0]
给了我5,但
@array\u output[1]
或其他索引没有给我任何信息。我也使用了
@array\u output=array。new
因为我担心
@array\u output
是一个变量而不是数组,但没有任何变化。这很奇怪。可能其他地方有问题,例如,2
@array\u output
实际上不是同一个对象。如果我使用
split()
它填充数组的次数超过了0索引中的次数(也很难在单词之间进行不正确的拆分)。可能解析行参数中有错误?您的问题中没有这样说过。询问时,我们希望您提供我们需要知道的一切。“”。如果数据是可变的,那么您最好使用CSV并将分隔符更改为空格,并接受这样一个事实:您将无法获得包装单引号。为了向读者澄清:我删除了之前的评论,其中我声明我无法接受固定解决方案,因为我看到答案上的新编辑似乎代表了更灵活的解决方案。同时,用户回复了我删除的评论。如果输出字符串的顺序/格式始终相同,则上述解决方案将起作用。字段长度无关紧要,只是它们以指定的顺序出现。在提出问题后,至少等待几个小时很重要。全局站点也是如此,因此您希望留出时间为全世界的人提供帮助和制定答案。写一个详细的/教育性的答案需要时间,而且在这个过程中经常会有多次编辑。